ASSIGNMENT 3 Subject : Minimum Spanning Trees solution


Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment


5/5 - (1 vote)

1 Introduction
Graphs are widely used in Computer Science for a wide range of subjects. These subjects may includeNaturalLanguageProcessing, ImageProcessing, PatternRecognition, SpeechRecognition, Bioinformatics etc. Graphs provide a natural way of constructing relationships between objects; thereby it is a natural way of learning by using this relationship between objects. These objects may be genes in a dna, phonemes in a speech signal, pixels in an image, words in a text (see Figure 1), and so on. Depending on the type of the problem, everything can be modelled by using the graph theory. There are many operations defined on graphs. You will practice Cosine Similarity and Minimum Spanning Trees in this assignment: The cosine similarity between two vectors (or two documents on the Vector Space) is a measure that calculates the cosine of the angle between them. A Minimum Spanning Tree is a subgraph of a given graph which has the minimum sum of the weights on the edges in the spanning tree. Minimum Spanning Trees have a wide range of application fields. For example, they are widely used in the design of the computer networks or telecommunication networks, handwriting recognition, image segmentation, clustering (see Figure 2), and so on.
Figure 1: A word graph [3]
2 Problem Definition
In this programming assignment, you will practice on one application field of graphs. The field that you will apply graphs is Natural Language Processing. You are expected to measure semantic similarity between words by using the word vectors. Subsequently, you will find word clusters that bear semantically similar words in each of them.
Page 1 of 6
Spring 2018 BBM 204: Software Laboratory II
Semanticsimilarityisaconceptthatmeasurehowsimilartwowords/documents/terms/concepts/senses are in meaning. For example, words ’blue’, ’red’, ’yellow’ are semantically similar, whereas ’book’ is no semantically similar to these words. However, book is semantically similar to ’notebook’. In Figure 3, a graph which is constructed by the Flickr tags is given (see Figure 4 for visual images for the given tags). You can see that semantically similar words are located close to each other on the graph, whereas semantically different words are quite distant from each other.
Figure 2: An example of minimum spanning tree clustering [4]
Another point that you need to observe on the graph is that, vertices in different colours refer to different clusters. Moreover, these clusters bear semantically similar words. In this assignment, using the word vectors provided in assignment you should build a graph. Movie review text is taken as input and word vector is produced. For each word you should assign edge weights equal to the cosine similarity. You can calculate cosine similarity by using formula which is shown below.
Figure 3: Disambiguation of Flickr Tags
Once the initial graph is constructed as described, you will use your graph for practicing two operations: a) measuring semantic similarity, b) finding clusters.
2.1 Measuring the Semantic Similarity Between Words
You will use your graph to measure semantic similarity between words. There are several ways to measure semantic similarity between words. However, you will use the cosine similarity:
Page 2 of 6
Spring 2018 BBM 204: Software Laboratory II
cos_sim(w1,w2) = w~1.w~2 ||w1||.||w2||
In this equation w1 and w2 represents different word vectors. In the word vector file, each line represents a different word vector.
Figure 4: Disambiguation of Flickr Tags
2.2 Clustering
In this part of the assignment, you will use minimum spanning trees on your graph to find word clusters (see Figure 2). As indicated before, minimum spanning trees can be used for clustering data. Once the minimum spanning tree on a given graph is found,k clusters can be obtained by removing k-1 edges that have the minimum weights on the minimum spanning tree. Therefore, k clustersareobtained. Eachoftheseclustersareexpectedtohavesemanticallysimilarwords. Since your initial graph is not weighted, you will construct a weighted graph before finding the minimum spanning tree. In order to create a weighted graph, you will remove all the edges from your initial graph, and you will add only the edges between the word pairs that you measured the semantic similarities in the first step. Once you have a weighted graph, you can find the minimum spanning tree and cut k-1 edges that have the minimum weights on the graph to obtain k clusters. The number of clusters will be provided as a parameter. Therefore, the number of clusters to be obtained will be fixed initially.
3 Input Output And Testing
You will have two input files and one output files in this assignment. All of the file names will be provided to the program as command line arguments. • word vector file: A word vector is given in this file. In this word vector file, each line represents a vector of unique word in the text. That is every word and it’s vector are given in different lines. A sample word vectors are given in Figure 5. • word pairs file: A word pairs list is given in this file. Your program is expected to measure the semantic similarities between the word pairs. A sample word-pairs file is given in Figure 6.
Page 3 of 6
Spring 2018 BBM 204: Software Laboratory II
Figure 5: A sample word vectors
Figure 6: A sample of word pairs file
• cluster file: Finally, your program will produce the minimum spanning tree and produce the clusters out of this spanning tree. Contents of each cluster will be written to this file. Cluster members will be written in alphabetical order (in increasing order) and clusters will be written according to the number of members in each cluster in increasing order (i.e. cluster that has the minimum number of members will be written first). Cluster members must be delimited by commas. A sample clusters file is given in Figure ??.
4 Execution of the Program
Your implementation should run with following command: java -jar SemanticGraph.jar dictionary wordpairs clusters numberofclusters Here:
• wordVec is the name of the file that has the dictionary, • wordpairs is the name of the file that has word pairs that your program will measure how similar they are, • clusters is the name of the file that the contents of each cluster will be written to, • numberofclusters is the number of clusters that your program has to find by using the minimum spanning tree. Please keep in mind that all of the names of the files will be taken as command line arguments and will not be fixed names, otherwise your program will not be able to run with other file
Page 4 of 6
Spring 2018 BBM 204: Software Laboratory II
names. A screenshot of the terminal that shows how to run your program is given in Figure 9. You can test your program by creating your own word_pairs.
• Do not miss the deadline. • Save all your work until the assignment is graded. • The assignment must be original, individual work. Duplicate or very similar assignments are both going to be considered as cheating. • YoucanaskyourquestionsviaPiazza( and you are supposed to be aware of everything discussed in Piazza. • You will submit your work from with the file hierarchy as below: This file hierarchy must be zipped before submitted (Not .rar, only .zip files are supported by the system)