# ASSIGNMENT 3 Subject : Minimum Spanning Trees solution

\$24.99

Original Work ?

## Description

5/5 - (1 vote)

1 Introduction
Graphs are widely used in Computer Science for a wide range of subjects. These subjects may includeNaturalLanguageProcessing, ImageProcessing, PatternRecognition, SpeechRecognition, Bioinformatics etc. Graphs provide a natural way of constructing relationships between objects; thereby it is a natural way of learning by using this relationship between objects. These objects may be genes in a dna, phonemes in a speech signal, pixels in an image, words in a text (see Figure 1), and so on. Depending on the type of the problem, everything can be modelled by using the graph theory. There are many operations deﬁned on graphs. You will practice Cosine Similarity and Minimum Spanning Trees in this assignment: The cosine similarity between two vectors (or two documents on the Vector Space) is a measure that calculates the cosine of the angle between them. A Minimum Spanning Tree is a subgraph of a given graph which has the minimum sum of the weights on the edges in the spanning tree. Minimum Spanning Trees have a wide range of application ﬁelds. For example, they are widely used in the design of the computer networks or telecommunication networks, handwriting recognition, image segmentation, clustering (see Figure 2), and so on.
Figure 1: A word graph [3]
2 Problem Deﬁnition
In this programming assignment, you will practice on one application ﬁeld of graphs. The ﬁeld that you will apply graphs is Natural Language Processing. You are expected to measure semantic similarity between words by using the word vectors. Subsequently, you will ﬁnd word clusters that bear semantically similar words in each of them.
Page 1 of 6
Spring 2018 BBM 204: Software Laboratory II
Semanticsimilarityisaconceptthatmeasurehowsimilartwowords/documents/terms/concepts/senses are in meaning. For example, words ’blue’, ’red’, ’yellow’ are semantically similar, whereas ’book’ is no semantically similar to these words. However, book is semantically similar to ’notebook’. In Figure 3, a graph which is constructed by the Flickr tags is given (see Figure 4 for visual images for the given tags). You can see that semantically similar words are located close to each other on the graph, whereas semantically diﬀerent words are quite distant from each other.
Figure 2: An example of minimum spanning tree clustering [4]
Another point that you need to observe on the graph is that, vertices in diﬀerent colours refer to diﬀerent clusters. Moreover, these clusters bear semantically similar words. In this assignment, using the word vectors provided in assignment you should build a graph. Movie review text is taken as input and word vector is produced. For each word you should assign edge weights equal to the cosine similarity. You can calculate cosine similarity by using formula which is shown below.
Figure 3: Disambiguation of Flickr Tags
Once the initial graph is constructed as described, you will use your graph for practicing two operations: a) measuring semantic similarity, b) ﬁnding clusters.
2.1 Measuring the Semantic Similarity Between Words
You will use your graph to measure semantic similarity between words. There are several ways to measure semantic similarity between words. However, you will use the cosine similarity:
Page 2 of 6
Spring 2018 BBM 204: Software Laboratory II
cos_sim(w1,w2) = w~1.w~2 ||w1||.||w2||
(1)
In this equation w1 and w2 represents diﬀerent word vectors. In the word vector ﬁle, each line represents a diﬀerent word vector.
Figure 4: Disambiguation of Flickr Tags
2.2 Clustering
In this part of the assignment, you will use minimum spanning trees on your graph to ﬁnd word clusters (see Figure 2). As indicated before, minimum spanning trees can be used for clustering data. Once the minimum spanning tree on a given graph is found,k clusters can be obtained by removing k-1 edges that have the minimum weights on the minimum spanning tree. Therefore, k clustersareobtained. Eachoftheseclustersareexpectedtohavesemanticallysimilarwords. Since your initial graph is not weighted, you will construct a weighted graph before ﬁnding the minimum spanning tree. In order to create a weighted graph, you will remove all the edges from your initial graph, and you will add only the edges between the word pairs that you measured the semantic similarities in the ﬁrst step. Once you have a weighted graph, you can ﬁnd the minimum spanning tree and cut k-1 edges that have the minimum weights on the graph to obtain k clusters. The number of clusters will be provided as a parameter. Therefore, the number of clusters to be obtained will be ﬁxed initially.
3 Input Output And Testing
You will have two input ﬁles and one output ﬁles in this assignment. All of the ﬁle names will be provided to the program as command line arguments. • word vector ﬁle: A word vector is given in this ﬁle. In this word vector ﬁle, each line represents a vector of unique word in the text. That is every word and it’s vector are given in diﬀerent lines. A sample word vectors are given in Figure 5. • word pairs ﬁle: A word pairs list is given in this ﬁle. Your program is expected to measure the semantic similarities between the word pairs. A sample word-pairs ﬁle is given in Figure 6.
Page 3 of 6
Spring 2018 BBM 204: Software Laboratory II
Figure 5: A sample word vectors
Figure 6: A sample of word pairs ﬁle
• cluster ﬁle: Finally, your program will produce the minimum spanning tree and produce the clusters out of this spanning tree. Contents of each cluster will be written to this ﬁle. Cluster members will be written in alphabetical order (in increasing order) and clusters will be written according to the number of members in each cluster in increasing order (i.e. cluster that has the minimum number of members will be written ﬁrst). Cluster members must be delimited by commas. A sample clusters ﬁle is given in Figure ??.
4 Execution of the Program
Your implementation should run with following command: java -jar SemanticGraph.jar dictionary wordpairs clusters numberofclusters Here:
• wordVec is the name of the ﬁle that has the dictionary, • wordpairs is the name of the ﬁle that has word pairs that your program will measure how similar they are, • clusters is the name of the ﬁle that the contents of each cluster will be written to, • numberofclusters is the number of clusters that your program has to ﬁnd by using the minimum spanning tree. Please keep in mind that all of the names of the ﬁles will be taken as command line arguments and will not be ﬁxed names, otherwise your program will not be able to run with other ﬁle
Page 4 of 6
Spring 2018 BBM 204: Software Laboratory II
names. A screenshot of the terminal that shows how to run your program is given in Figure 9. You can test your program by creating your own word_pairs.
Notes
• Do not miss the deadline. • Save all your work until the assignment is graded. • The assignment must be original, individual work. Duplicate or very similar assignments are both going to be considered as cheating. • YoucanaskyourquestionsviaPiazza(https://piazza.com/hacettepe.edu.tr/spring2018/bbm204) and you are supposed to be aware of everything discussed in Piazza. • You will submit your work from https://submit.cs.hacettepe.edu.tr/index.php with the ﬁle hierarchy as below: This ﬁle hierarchy must be zipped before submitted (Not .rar, only .zip ﬁles are supported by the system)