Description
1. Introduction / Aim
In this assignment, you are expected to design a document analysis program. The primary goal of this experiment is to get you practice on linked lists and matrices.
2. Background Information
Document analytics is the process of deriving information from text. The documents are examined to obtain information about authors by analyzing the texts. It is expected that an author’s documents will be similar to each other. Therefore, author of a document can be identified using similarity between other documents.
The similarity is a measure of the resemblance between datasets; i.e., how similar or alike the sets are. The most commonly used datasets are terms when the similarity of documents is considered. Documents similarities can be found using similarity metrics based on the frequency of terms in documents.
In Information Retrieval approaches, the commonly used similarity measure is the Cosine Similarity. Cosine similarity is a measure of similarity between non-zero vectors of an inner product space that measures the cosine of the angle between them. When the documents are the subject, this measure is computed by representing datasets as vectors in term space and then comparing the angles formed by the vectors.
Given two vectors of terms, A and B, the cosine similarity is represented using a dot product and magnitude (Magnitude is defined as the length of a vector. The notation for absolute value [|| ||]is also used for the magnitude of a vector.) as follows:
The cosine of 0o means that cos(0) = 1 and that the vectors have identical directions, in other words, the corresponding datasets are completely similar to one another. The cosine of 90o means that cos(90) = 0 and that the vectors are perpendicular, in other words, corresponding variables are
2
Hacettepe University, Computer Engineering Department (2017) BBM203 Experiment III
uncorrelated. Consequently, if the cosine similarity is closer to 1, the similarity of vectors increases. In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies cannot be negative. The angle between two vectors of term frequency cannot be greater than 90o.
3. Experiment
The primary goal of this experiment is to make you acquainted with operations in linked lists like adding, deleting, sorting etc.
In this experiment, you are expected to write an application that calculates the document similarity and holds the pairs of words in a document. And you are also expected to use linked lists and matrices when performing this assignment.
You will design and implement an application program named Simon, in ANSI C and compile it with gcc. You will also use the make utility and write a makefile [7] for your project.
The application is a document analysis program. Simon will get some commands and parameters as arguments from the console until the user sends –q parameter.
Example
–r bbm203/p3/d1.txt # read d1.txt form the given path
–a w2 3 d1.txt #add w2 with count 3 in d1.txt
–n2 d1.txt #get first 3 pairs of words that pass the most on the document
–d w1 d1.txt #delete w1 from d1.txt
–r bbm203/p3/d2.txt # read d2.txt form the given path
–s d1 d2.txt #return similarity of d1.txt and d2.txt
-q #quit the application
Details of example
With –r command, the application will read the given document word by word. In order to obtain words, you should stop reading on all non-alphanumeric characters. Also, you should hold all words and counts in a linked list for each read document.
With –a w2 3 d1.txt command and parameters, the application will add the word w2 and its count (3) to the list of d1 document.
With –n2 d1.txt command and parameter, the application will return the first 3 pairs of words that pass the most on the document d1.txt. (Pairs of words means that two consecutive words. For example, the sentence “today the weather is so cold” has 5 pairs of words; today-the, the-weather, weather-is, is-so, so-cold) To get the pairs of words, you have to design nested linked list.
With –d w1 d1.txt command and parameters, the application will delete the node which holds the information of w1, from the linked list
With –s d1.txt d2.txt command and parameters, the application will return the cosine similarity of d1.txt and d2.txt. To simplify operations, the application will calculate just fist 10 words of the documents’ sorted list. It means that there will be the maximum 20×2 (word x document) matrix for calculation of similarity.
3
Hacettepe University, Computer Engineering Department (2017) BBM203 Experiment III
With –q command, the application will free to memory and close.
A sample application
A sample application is detailed below, so you can understand the solution of the problem better.
The contents of d1.txt and d2.txt are as follows.
d1.txt The funct printf( ) and scanf( ) types, plus null-term and char strings. Char strings char and. Blah blah blah blah
d2.txt This is testing for printf… This is testing for scanf… char strings and char. fgetc(file) to5 blah blah blah blah blah the and
-r d1.txt: With this command d1.txt document is read by the application. While reading, the application creates a linked list for the document and adds words to list with their pair. The final representation of this process is shown in Figure 1.
the funct printf and scanf types plus null term char strings blah
1 1 1 3 1 1 1 1 1 3 2 4
strings 2
and 1
char 1term 1null 1plus 1types 1scanf 1
char 1
blah 1
and 1printf 1funct 1 char 2 blah 3
Figure 1. Linked list of words of d1.txt before sorting
You have to sort the linked list. You can sort the list while adding the words or after reading all text. The final representation of sorting is shown in Figure 2. (P.s. The bubble sort algorithm is used on this sample. You can use any sorting algorithm in your application.) There won’t be any output.
blah and char strings the funct printf scanf types plus null term
4 3 3 2 1 1 1 1 1 1 1 1
strings 2
and 1
char 1term 1null 1plus 1types 1scanf 1
char 1
blah 1
and 1printf 1funct 1char 2blah 3
Figure 2. The linked list of words of d1.txt after applying bubble sort algorithm.
4
Hacettepe University, Computer Engineering Department (2017) BBM203 Experiment III
–a Ccode 3 d1.txt: the application will add the word Ccode and its count (3) to the sorted linked list of d1 document. The final representation of the linked list of d1.txt after adding the given word and its count is shown in Figure 3.
Figure 3. The representation of the linked list after adding given word and its count
–n2 d1.txt: For this command, the application will print first 3 pairs of words that pass the most on the document d1.txt on the screen of the console.
Blah-blah, char-strings, strings-char
–d types d1.txt: For this command, the application will delete the node which holds the information of the given word (“types”). There won’t be any output but you have to update the linked list of given document (“d1”). The representation of the linked list of d1.txt after delete operation is shown in Figure 4.
blah and char ccode strings the funct printf scanf plus null term
4 3 3 3 2 1 1 1 1 1 1 1
strings 2
and 1
char 1term 1null 1types 1scanf 1
char 1
blah 1
and 1printf 1funct 1char 2blah 3 Null
Figure 4. The representation of the linked list after deleting the given word
-r d2.txt : d2.txt document is read by the application. The final representation of this process is shown in Figure 5.
blah this is testing for char printf scanf strings fgetc file to5 the and
5 2 2 2 2 2 1 1 1 1 1 1 1 1
Figure 5. The linked list of the document d2.txt after reading and sorting
5
Hacettepe University, Computer Engineering Department (2017) BBM203 Experiment III
–s d1.txt d2.txt: After getting this command the application will return the cosine similarity of d1.txt and d2.txt. First, you will get the first 10 words from the linked list of d1.txt and d2.txt. Then you will create a document-word matrix according to the number of different words you have. And then, you will put the words counts into the document-word matrix. The document-word matrix of this sample application is shown in Figure 6. Finally, you will apply cosine similarity on the matrix and will print the result of similarity on the screen of the console like below.
blah and char ccode strings the funct printf scanf plus this is testing for fgetc
d1 4 3 3 3 2 1 1 1 1 1 0 0 0 0 0
d2 5 1 2 0 1 1 0 1 1 0 2 2 2 2 1
Figure 6. The document-word matrix of this sample application
Cosine Similarity of d1.txt and d2.txt is 0.660
Important Issues In order to obtain terms, you should stop reading on all non-alphanumeric characters. All terms in your list must be in lower case Be careful! If the magnitude of two documents is zero, there will be division by zero error when calculating cosine similarity. In this case, the result will be zero. Test your program on “dev.cs.hacettepe.edu.tr” before submission. Your submission will be compiled and executed on this machine and this machine only. Projects without a proper Makefile won’t be graded. Your application will be controlled for possible memory leaks and you will get bonus points.
Assignment Report Report will be 30 points Please explain your design, data structures, and algorithms. Give necessary details without unnecessary clutter. Do a Big-O notation algorithm analysis of your functions Guide: ftp://ftp.cs.hacettepe.edu.tr/pub/dersler/genel/FormatForLabReports.doc
Last Remarks Regardless of the length, use UNDERSTANDABLE names to your variables, classes, and Functions Write READABLE SOURCE CODE block o These will cost you points. Deadline is 23:59 pm. No other submission method will be accepted. Save all your work until the assignment is graded. The assignment must be original, INDIVIDUAL work. Duplicate or very similar assignments are both going to be punished. General discussion of the problem is allowed, but DO NOT SHARE answers, algorithms or source codes. Copying without citing will be considered as cheating. Citing does not make it yours; cited work will get 0 from the respective section.
6
Hacettepe University, Computer Engineering Department (2017) BBM203 Experiment III
You will submit your work from https://submit.cs.hacettepe.edu.tr/index.php with the file hierarchy as below:
This file hierarchy must be zipped before submission (Not .rar, only .zip files are supported by the system) References 1- C The Complete Reference 4th Ed Herbert Schildt 2000 2- Garcia, E. (2015b). A Tutorial on Distance and Similarity. Retrieved from
https://www.minerazzi.com/tutorials/distance-similarity-tutorial.pdf 3- C The Complete Reference 4th Ed Herbert Schildt 2000 4- Calculus II For Dummies, 2nd Edition, Mark Zegarelli 5- Fundamentals of Data Structures in C. Ellis Horowitz, Sartaj Sahni, and Susan AndersonFreed, 1993 6- BBM 201 – Data Structures (Fall 2016