Description
Objectives
• Read a file with unknown size and store in a vector
• Loop through a vector
• Store, search, and iterate through data in an array of struct
• Use array doubling via dynamic memory to increase the size of the array
Background
There are several fields in computer science that aim to understand how people use
language. This can include analyzing the most frequently used words by certain
authors, and then going one step further to ask a question such as: “Given what we
know about Hemingway’s language patterns, do we believe Hemingway wrote this
lost manuscript?”
In this assignment, we’re going to do a basic introduction to document analysis by
determining the number of unique words and the most frequently used words in
two documents. If you enjoy this, take elective courses on Natural-Language Processing.
What your program needs to do
There is one test file on the website – HW2-HungerGames_edit.txt that contain the
full text from Hunger Games Book 1. We have pre-processed the file to remove all
punctuation and down-cased all words. We will test on a different file!
There is also the ignore words file – HW2-ignoreWords.txt that contain the top 50
common words usually ignored during natural-language processing.
Your program will calculate the following information on any text file:
• The top n words (excluding stop words; n is also a command-line argument)
and the number of times each word was found
• The total number of unique words (excluding stop words) in the file
• The total number of words (excluding stop words) in the file
• The number of array doublings needed to store all unique words in the file
Example:
Your program takes three command-line arguments: the number of most common
words to print out, the name of the file to process, and the stop word list file.
Running your program using:
./a.out 10 HW2-HungerGames_edit.txt HW2-ignoreWords.txt
would return the 10 most common words in the file HW2-HungerGames_edit.txt and
should produce the following results:
682 – is
492 – peeta
479 – its
431 – im
427 – can
414 – says
379 – him
368 – when
367 – no
356 – are
#
Array doubled: 7
#
Unique non-stop words: 7682
#
Total non-stop words: 59157
Program Specifications
The following are requirements for your program:
• Read in the name of the file to process from the second command-line
argument.
• Read in the number of most common words to process from the first
command-line argument.
• Write a function named getStopwords that takes the name of the ignorewords
file and a reference to a vector as parameters (returns void). Read in
the file for a list of the top 50 most common words to ignore (e.g., Table 1).
These are commonly referred to as ‘stopwords’ in NLP (Natural Language
Processing). (Create this file yourself)
o The file will have one word per line, and always have exactly 50 words
in the file. We will test with files having different words in it!
o Your function will update the vector passed to it with a list of the
words from the file.
• Store the unique words found in the file that are not in the stopword list in a
dynamically allocated array.
o Call a function to check if the word is a stopword first, and if it is, then
ignore that word.
o Use an array of structs to store each unique word (variable name
word) and a count (variable name count) of how many times it
appears in the text file.
o Use the array-doubling algorithm to increase the size of your
array
§ We don’t know ahead of time how many unique words the
input file will have, so you don’t know how big the array should
be. Start with an array size of 100 (use the constant declared in
the starter code), and double the size as words are read in from
the file and the array fills up with new words.
• Use dynamic memory allocation to create your array
• Copy the values from the current array into the new
array, and then
• Free the memory used for the current array.
(Index of any given word in the array after resizing must
match index in array before resizing.)
• Output the top n most frequent words
Write a function named printTopN that takes a reference to the array of
structs and the value of n to determine the top n words in the array.
Generate an array of the n top items sorted from most frequent to least
frequent and print these out from most to least.
Array MUST be sorted before calling printTopN.
• Output the number of times you had to double the array.
• Output the number of unique non-stop words.
• Output the total number of non-stop words.
o Write a function named getTotalNumberNonStopWords that takes a
reference to your array of unique non-stop words and returns the
total number of words found.
• Format your output the following way (and reference the example above).
o When you output the top n words in the file, the output needs to be in
order, with the most frequent word printed first. The format for the
output needs to be:
Count – Word
#
Array doubled: