Description
1. Dataset Preparation (10 points)
We will use the Amazon reviews dataset which contains real reviews for
jewelry products sold on Amazon. The dataset is downloadable at:
https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_
us_Beauty_v1_00.tsv.gz
1
(a)
Read the data as a Pandas frame using Pandas package and only keep the
Reviews and Ratings fields in the input data frame to generate data. Our
goal is to train sentiment analysis classifiers that can predict the rating value
for a given review.
We create a three-class classification problem according to the ratings.
The original dataset is large. To this end, let ratings with the values of 1 and
2 form class 1, ratings with the value of 3 form class 2, and ratings with the
values of 4 and 5 form class
3. To avoid the computational burden, select
20,000 random reviews from each rating class and create a balanced dataset
to perform the required tasks on the downsized dataset. Split your dataset
into 80% training dataset and 20% testing dataset. Note that you can split
your dataset after step 4 when the TF-IDF features are extracted.
Follow the given order of data processing but you can change the order if
it improves your final results.
2. Data Cleaning (20 points)
Use some data cleaning steps to preprocess the dataset you created. For
example, you can use:
– convert all reviews into lowercase.
– remove the HTML and URLs from the reviews
– remove non-alphabetical characters
– remove extra spaces
– perform contractions on the reviews, e.g., won’t → will not. Include as
many contractions in English that you can think of.
You can use other cleaning procedures that can help to improve performance. You can either use Pandas package functions or any other built-in
functions. Do not try to implement the above processes manually.
In your report, print the average length of the reviews in terms of character length in your dataset before and after cleaning (to be printed by .py
file).
3. Preprocessing (20 points)
Use NLTK package to process your dataset:
2
– remove the stop words
– perform lemmatization
In your report and the .py file, print the average length of the reviews in
terms of character length in before and after preprocessing.
4. Feature Extraction (10 points)
Use sklearn to extract TF-IDF features. At this point, you should have
created a dataset that consists of features and labels for the reviews you
selected.
5. Perceptron (10 points)
Train a Perceptron model on your training dataset using the sklearn built-in
implementation.
Study about generalizations of Precision, Recall, and f1-score in multiclass situations. Report Precision, Recall, and f1-score per class and their
averages on the testing split of your dataset. These 12 values should be
printed in separate lines by the .py file.
6. SVM (10 points)
Train an SVM model on your training dataset using the sklearn built-in
implementation. Report Precision, Recall, and f1-score per class and their
averages on the testing split of your dataset. These 12 values should be
printed in separate lines by the .py file.
7. Logistic Regression (10 points)
Train a Logistic Regression model on your training dataset using the sklearn
built-in implementation. Report Precision, Recall, and f1-score per class and
their averages on the testing split of your dataset. These 12 values should be
printed in separate lines by the .py file.
3
8. Multinomial Naive Bayes (10 points)
Train a Multinomial Naive Bayes model on your training dataset using the
sklearn built-in implementation. Report Precision, Recall, and f1-score per
class and their averages on the testing split of your dataset. These 12 values
should be printed in separate lines by the .py file
Note 1: For questions 5-8, part of grading is based on being competitive.
For each question, we will sort the computed average precision values across
the class. For each question, the top 40% will receive full credit. The next
30% will loose 1 point, and the bottom 30% will lose 2 points. We have
this grading scheme to motivate you to explore ideas for increasing your
performance values.
Note 2: To be consistent, when the .py file is run, the following should
be printed, each in a line:
– Average length of reviews before and after data cleaning (with a comma
between them)
– Average length of reviews before and after data preprocessing (with
comma between them)
– Precision, Recall, and f1-score for the testing split in 4 lines (in the order
of rating classes and then the average) for Perceptron (with comma between
the three values)
– Precision, Recall, and f1-score for the testing split in 4 lines (in the order
of rating classes and then the average) for SVM (with comma between the
three values)
– Precision, Recall, and f1-score for the testing split in 4 lines (in the order
of rating classes and then the average) for Logistic Regression (with comma
between the three values)
– Precision, Recall, and f1-score for the testing split in 4 lines (in the
order of rating classes and then the average) for Naive Bayes (with comma
between the three values)
Note that in your Jupyter notebook, print the Precision, Recall, and f1-
score for the above models in separate lines and in .py file in separate lines.
4