Description
A small, but important aspect in text mining and natural language processing is measuring word frequency. This assignment deals with a heavily boiled-down exercise in loading a text file into Python and computing word frequency statistics. It requires usage of text files, strings and dataframes, so it is heavily encouraged that you take a look at relevant sessions (14-17) if you have not already done so.
(a) Locate a movie script, play script, poem, or book of your choice in .txt format. You are free to choose nearly any novel, movie script, or play that you like, with the qualification that your chosen document must have a minimum of 5 chapters, scenes, and/or acts that distinguish one portion of the document’s narrative from another. For example, the novel “Great Expectations” has 59 chapters, the script for “Jaws” has about 27 scenes, and all or almost all Shakespearean plays have exactly five acts. It is important that for part (e) of the document that these segments exist for your document. Project Gutenburg is a great resource for this if you’re not sure where to start.
(b) Load the words of this structure in sequential order of appearance into a one-dimensional Python list (i.e. the first word should be the first element in the list, while the last word should be the last element) that is case insensitive. It’s up to you how to deal with special chacters — you can remove them manually, ignore them during the loading process, or even count them as words, for example. Make sure you have this list clearly assigned to a variable, so we can evaluate it during grading.
(c) Use your list to create and print a two-column pandas data-frame with the following properties:
i. The first column for each index should represent the word in question at that index
ii. The second column should represent the number of times that particular word appears in the text.
iii. The rows of the data-frame should be ordered according to the first occurrence of each word.
iv. It’s up to you whether or not your data-frame will include an index per row.
Make sure you have this data-frame clearly assigned to a variable, so we can evaluate it during grading.
Ex: if the first word in your text is “the” which occurs 500 times and the second is “balcony” which only appears twice, your data-frame should begin like the following:
Word | Count | |
---|---|---|
1 | “the” | 500 |
2 | “balcony” | 2 |
… | … | … |
Again, the indices are optional. |
(d) Stop-words are commonly used words in a given language that often fail to communicate useful summative information about its content. The attached stop_words.py file has a simple list of common stop words assigned to a variable. For this part of the assigment, you are to create a modified copy of the data-frame from (c) with the following modifications: i. all stop words have been removed from the data-frame and ii. the data frame rows have been sorted in decreasing order of frequency counts. Again, make sure you have this data-frame clearly assigned to a variable, so we can evaluate it during grading.
(e) While total word counts can provide a useful measure of the content of a document, they cannot reveal much about its underlying trends. In the context of document analysis, the term trend implies a direction (in terms of theme, mood, etc.) in which the content changes throughout the narrative. For example, some works of fiction begin with a comedic tone, and take on a more serious tone in later stages, or vice versa. For the last part of your assignment, you are going to modify the approach taken in part (d) to address individual segments of the document. More specifically, you are to divide the raw document into partitions according to the chapters, acts, etc. that are present, and then produce a list of data-frames, where each list element is a single data-frame containing word frequencies for a single segment with the same format as the data-frame from part (d) outlined above. You are free to use whatever means you prefer in splitting the text into chapters and constructing the list of data-frames, but one option is to use regular expressions with the raw document. Once again, you must insure your list is readily accessible to us in the form of a variable.
You can use .py files, .ipynb files, or a combination of the two in your solution. Zip these file(s) along with a simple README telling me what to run to generate the list and data-frames into a zip file with the name <LN_FN_2.zip>, where LN is your last-name and FN is your first-name, and submit this file to Blackboard.