CMSC 409: Artificial Intelligence Project 4 solution

$29.99

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (2 votes)

Pr.4.
1. Download and unzip “Project4_sentences.zip” and “Project4_code.zip” files.
A set of sentences is given in the file “sentences.txt”. Each sentence is a line in the file. Create the
feature vector by writing a program that applies the following text mining techniques to this set of
sentences.
A. Tokenize sentences
B. Remove punctuation and special characters
C. Remove numbers
D. Convert upper-case to lower-case
E. Remove stop words. A set of stop words is provided in the file “stop_words.txt”
F. Perform stemming. Use the Porter stemming code provided in the file
“Porter_Stemmer_X.txt”
G. Combine stemmed words.
Provide the feature vector in your report.
Note:
The feature vector contains unique sets of words that appear in the set of sentences provided.
The file “Project4_code.zip” contains implementations of the Porter Stemmer in several languages. You
can use any version of the code provided (provided versions of the code are Java, Matlab, Python, and C).
Make sure you rename your file accordingly. More source code for the Porter Stemmer can be found here:
https://tartarus.org/martin/PorterStemmer/
2. Using the feature vector generated in first task, write a program that generates the Term Document
Matrix (TDM) for ALL the sentences in “sentences.txt”, similar to TDM below.
Example TDM
Keyword set anonymous identify car …
Sentence 1 1 4 3 …
Sentence 2 2 0 1 …
….. … … … …
Sentence 20 2 0 0 …
2.1 Provide the TDM in your report.
2.2 For each of the text mining steps (A to G), explain why they are used, and what sort of
information is lost while applying each of the text-mining steps.
Page 2 of 2
3. Write a program implementing the clustering algorithm of your choice (WTA or FCAN). Apply that
algorithm to TDM to group similar sentences together. Show and comment the results. What could
you have done to obtain different results (relative to the algorithm implementation or feeding the
data)?
——————————————————————
Note:
1. Your software must be user friendly. The TA must be able to test it simply by executing the code.
2. Project deliverable should be a zip file containing:
a. Written report with answers to the questions above in word, pdf, ps, or txt format
b. The source code.
3. Submit your zip file to Instructor mmanic@vcu.edu and cc TA Darshini (Samantha) Mahendran,
mahendrand@vcu.edu. Please use the subject line [CMSC 409] Family name, Project 4.