MIE 1513 Lab and Assignment 4: Data Science and NLP for Customer Review Analysis solution




Rate this product

This lab and assignment involves performing analysis of real hotel review data crawled from the
Tripadvisor website to automatically identify positive and negative keywords and phrases associated
with hotels and to better understand characteristics of data analysis tools, extracting explanatory
review summaries, and human reviewing behavior.
• Programming language: Python (Google Colab Environment)
• Due Date: Posted in Syllabus
Marking scheme and requirements: Full marks will be given for (1) working, readable, reasonably efficient, documented code that achieves the assignment goals and (2) for providing appropriate
answers to the questions in a Jupyter notebook ds-assignment.ipynb committed to the student’s
assignment repository.
Please note the plagiarism policy in the syllabus. If you borrow or modify any multiline snippets of
code from the web, you are required to cite the URL in a comment above the code that you used.
You do not need to cite tutorials or reference content that demonstrate how to use packages – you
should certainly be making use of such content.
What/how to submit your work:
• All your code should be included in a notebook named ds-assignment.ipynb.
• Commit and push your work to your github repository in order to submit it. Your last commit
and push before the assignment deadline will be considered to be your submission. You can
check your repository online to make sure that all required files have actually been committed
and pushed to your repository.
• A link to create a personal repository for this assignment is posted on QUERCUS.
1 Before and in the Introductory lab
In this lab, we will familiarize ourselves with the nltk library and an nltk sentiment analysis package
called Vader. Please familiarize yourself with this tool through the examples you will be taught
by in the lab.
In lab, we will provide some examples of the types of data analysis required for the assignment,
however you will be required to extend these analyses to additional cases.
2 Tripadvisor Crawler
For full credit in this assignment, all students will be required to crawl a specific dataset for hotel
reviews in a city that should not overlap with any of your classmates. Note that crawling is timeconsuming and cannot be done last minute. There are some pre-collected datasets that you can
choose to use, but, if you use the pre-collected data to finish the assignment, you will receive a
20% deduction from your assignment score.
To use the Tripadvisor crawler, please download the Python files from https://github.com/
aesuli/trip-advisor-crawler. First run the crawler trip-advisor-crawler.py from the command
with your own city selection and then parse all crawled html file into csv using trip-advisor-parser.py
before loading into your DSVM environment. More details provided below.
2.1 Command-line Examples
If you search the hotel reviews of city Niagara Falls on Tripadvisor, you will see the web address is
https://www.tripadvisor.ca/Hotels-g154998-Niagara_Falls_Ontario-Hotels.html. From
the web address, you will find two important pieces of information for the command line invocation.
First, the domain is ‘ca’. And second, the city code of the Niagara Falls is 154998 on Tripadvisor.
To download the reviews, you will also need to specify the local path for the files. In this
example, the location is the ‘data’ folder under your assignment folder. Then you can extract the
reviews using the following command line:
$python3 t ri p −a d vi s o r−c r a wl e r . py −o data ca : 1 5 4 9 9 8
The downloaded reviews are in html format. You still need to process the data into the csv
format so that we can easily load it into Python. Fortunately, you do NOT need to write the code
yourself. Please run the following command line after extraction from previous step:
$python3 p a r s e r . py −d data −o r e vi e w s . c s v
Warning: You may not be able to crawl all the hotels within a city; stop when you have
crawled for more than 10 hours (i.e., run your crawler overnight). Be aware that there are two
additional things you need to do if you stop crawling manually or the file cannot be parsed due to
an incomplete crawl:
• In ids.txt file, delete the last hotel’s reviews entirely (all reviews for that hotel).
• In the data folders, delete the folder of the same hotel entirely.
Helpful Hints: Choose a small town (not a city!) with about 10-20 hotels. Do not choose a town
with fewer than 10 hotels (this will provide insufficient data for the assignment) though feel free
to exceed 20 if you wish. Some hotels may have a lot of reviews (e.g., 1000), in which case you
can read through the crawler code to determine how to limit the number of retrieved reviews to a
maximum of 100. If you wish to perform review analysis for a different domain (e.g., restaurant
reviews from the Yelp academic dataset), please first discuss with Scott for approval.
3 Main Assignment
Please answer the questions below in an IPython notebook that you must submit via github. In
the following, ground truth rating (star rating) can be binarized; if so, explain your rationale for
how your binarize.
Q1. Sentiment Analysis and Aggregation
(a) Compute average Vader sentiment and average ground truth rating per hotel.
(b) Rank hotels by
(i) Average Ground Truth Sentiment
(ii) Average Vader Compound Sentiment Score
Show both top-5 and bottom-5 for both ranking methods. Do they agree or are there interesting differences?
Q2. Frequency Analysis
(a) Use term frequency of the words for (i) positive reviews and (ii) negative with ground truth
sentiment to rank the top-50 most frequent non-stopwords in the review collection. Do you
note anything interesting and/or locale-specific about these top-ranked words?
(b) Repeat this analysis for the top-50 noun phrases and note any interesting results.
Q3. Mutual Information
(a) Use mutual information (MI) with ground truth sentiment to rank the top-50 most sentimentbearing non-stopwords in the review collection. Do you note anything interesting and/or
locale-specific about these top-ranked words?
(b) Repeat this analysis for the top-50 noun phrases and note any interesting results.
Q4. Pointwise Mutual Information
(a) For ground truth sentiment, calculate the top-50 words according to PMI of the word occuring
with (i) positive reviews and (ii) negative reviews. Do you note anything interesting and/or
locale-specific about these top-ranked words?
(b) Repeat this analysis for the top-50 noun phrases and note any interesting results.
(c) Repeat this analysis for the single top and single bottom hotel (according to the ground
truth rating). Do you gain any useful hotel-specific insights about what is good and bad
about these two hotels? If not, explain why not.
Q5. General Plots
(a) Histogram
(a) Show separate histograms of ground truth and Vader sentiment scores (ignore hotel ID).
Do you notice any interesting differences?
(b) Show a histogram of the number of reviews per hotel. Do you notice any interesting
trends? Are these expected?
(b) Boxplots
(a) In two plots, one for ground truth star rating and one for Vader sentiment, show a plot
of 5 side-by-side boxplots of these scores.
(b) Report the mean and variance of the ground truth and Vader sentiment scores for the
top-5 ranked hotels according to star rating.
(c) Which do you find more informative, the boxplots or the mean and variance, or are they
equally informative? Why?
(c) Scatterplots and heatmaps
(a) Show both a scatterplot and heatmap of ground truth score (star rating) versus Vader
sentiment score. Each review is a point on the scatterplot. Do you notice anything
interesting? What does this tell you about star ratings vs. Vader sentiment scores?
What does this tell you about human ratings and/or Vader sentiment analysis?
(b) Show two scatterplots and two heatmaps of the length of reviews versus each of ground
truth score and Vader sentiment score. Each review is a point on the scatterplot. Are
there any trends?
(c) Show two scatterplots of the number of reviews per hotel versus each of average ground
truth score and average Vader sentiment score. In this case, each hotel is a single point
on the scatterplot. Are there any trends?
4 Clarifications
• This assignment is graded entirely via code review – your submitted IPython notebook
must contain all experimental output – notebooks with empty output cells will not be
• While there is no AutoGrader for this assignment, you must still submit on github by
the required deadline. Timestamps are checked.
• As a backup in case your cell data is lost, you should add and commit a .pdf version of your
submitted .ipynb to your github repo. (This .pdf should show up when you browse your
assignment repo via the web interface.)
• Please do not commit/push the crawled data to your github Repository (github is not for
data storage) – it should only contain your IPython notebook.
• Some crawled reviews do not contain any words/strings. You should perform data preprocessing to remove those rows. Real data is always dirty.
• Before you use a Pandas Dataframe for merge and join operations, please first read and
understand the respective function descriptions in the online official Pandas documentation.
• You can use the pre-collected data from the reviews.zip file that is downloaded and unzipped with the command line provided in the your assignment file (ds assignment.ipynb).
However, as noted previously, using pre-collected data results in a 20% deduction
from your assignment score.