Description
CSCI544: Homework Assignment №1 sentiment analysis
This assignment gives you hands-on experience with text representations and the use of text classification for sentiment analysis.
Sentiment analysis is extensively used to study customer behaviors using
reviews and survey responses, online and social media, and healthcare
materials for marketing and costumer service applications. The assignment is accompanied with a Jupyter Notebook to structure your
code.
Please submit:
1. A PDF report which contains answers to the questions in the
assignment along with brief explanations about your solution. Please
also print the completed Jupyter Notebook in PDF format and merge
it with your report. If you are comfortable, you can answer to the
questions on Jupyter Notebook as well but just submit one PDF file
by merging your written answer and the completed Jupyter notebook
in case you use a separate PDF to answer the question. On your
completed Jupyter notebook, please print the requested values, too.
2. You also need to submit an executable .py file which when run,
generates the requested numerical outputs in the assignment as listed
at the end of the assignment description. We need the .py file to check
overlap between codes to detect plagiarism. Please include the Python
version you use.
The libraries that you will need are included in the HW1.ipynb
file. You can use other libraries as far as they decently similar to
the ones included in the HW1.ipynb file but do not use more advance
libraries. If you use online resource, you need to cite and explain how
you have used the resource. At the beginning of the .py file, add a
read command to read the data.tsv file as the input to your .py file
from the current directory.
1
1. Dataset Preparation (10 points)
We will use the Amazon reviews dataset which contains real reviews for office
products sold on Amazon. The dataset is downloadable at:
https://web.archive.org/web/20201127142707if_/https://s3.amazonaws.
com/amazon-reviews-pds/tsv/amazon_reviews_us_Office_Products_v1_
00.tsv.gz
Be patient as it may take some time before you have the dataset download
but it will be done is a few minutes.
(a)
Read the data as a Pandas frame using Pandas package and only keep the
Reviews and Ratings fields in the input data frame to generate data. Our
goal is to train sentiment analysis classifiers.
We create a binary classification problem according to the ratings. Let
ratings with the values of 1, 2 and 3 form class 1, and ratings with the
values of 4 and 5 form class 2. The original dataset is large. To avoid
the computational burden, select 50,000 random reviews from each rating
class and create a balanced dataset to perform the required tasks on the
downsized dataset. Split your dataset into 80% training dataset and 20%
testing dataset. Note that you can split your dataset after step 4 when the
TF-IDF features are extracted.
Follow the given order of data processing but you can change the order if
it improves your final results.
2. Data Cleaning (20 points)
Use some data cleaning steps to preprocess the dataset you created. For
example, you can use:
– convert all reviews into lowercase.
– remove the HTML and URLs from the reviews
– remove non-alphabetical characters
– remove extra spaces
– perform contractions on the reviews, e.g., won’t → will not. Include as
many contractions in English that you can think of.
2
You can use other cleaning procedures that can help to improve performance. You can either use Pandas package functions or any other built-in
functions. Do not try to implement the above processes manually.
In your report, print the average length of the reviews in terms of character length in your dataset before and after cleaning (to be printed by .py
file).
3. Preprocessing (20 points)
Use NLTK package to process your dataset:
– remove the stop words
– perform lemmatization
In your report and the .py file, print the average length of the reviews in
terms of character length in before and after preprocessing.
4. Feature Extraction (10 points)
Use sklearn to extract both TF-IDF and Bag of Words (BoW) features.
Note that BoW may need a little more programming but is not difficult to
generate. At this point, you should have created two datasets that consists
of features and labels for the reviews you selected.
5. Perceptron (10 points)
Train a Perceptron model on your training dataset using the sklearn built-in
implementation.
Report Precision, Recall, and f1-score for training Perceptron using both
BoW and TF-IDF features. These 6 values should be printed in two separate
lines by the .py file for first BoW and then TF-IDF as follows
– Precision Recall F1
– Precision Recall F1
3
6. SVM (10 points)
Train an SVM model on your training datasets using the sklearn built-in implementation. Report Precision, Recall, and f1-score similar to the previous
question format in lines 3 and 4.
7. Logistic Regression (10 points)
Train a Logistic Regression model on your training datasets using the sklearn
built-in implementation. Report Precision, Recall, and f1-score similar to the
previous question format in lines 5 and 6 by the .py file.
8. Naive Bayes (10 points)
Train a Naive Bayes model on your training dataset using the sklearn builtin implementation. Report Precision, Recall, and f1-score similar to the
previous question format in lines 7 and 68 by the .py file.
Note 1: To be consistent, when the .py file is run, the following should
be printed, each in a line:
– Average length of reviews before and after data cleaning (with a comma
between them)
– Average length of reviews before and after data preprocessing (with
comma between them)
– Precision, Recall, and f1-score for the testing split in 2 lines
– Precision, Recall, and f1-score for the testing split in 2 lines
– Precision, Recall, and f1-score for the testing split in 2 lines
– Precision, Recall, and f1-score for the testing split in 2 lines
Note that in your Jupyter notebook, print the Precision, Recall, and f1-
score for the above models just by putting a space between them and in .py
file in separate lines.
Note 2: Your models should have a decent performance to receive full
credit. The decent performance will be determined later by checking all
submissions
4
CSCI544: Homework Assignment №2 HMMs
Introduction
This assignment gives you hands-on experience on using HMMs on part-ofspeech tagging. We will use the Wall Street Journal section of the Penn
Treebank to build an HMM model for part-of-speech tagging. In the folder
named data, there are three files: train, dev and test. In the files of train and
dev, we provide you with the sentences with human-annotated part-of-speech
tags. In the file of test, we provide only the raw sentences that you need to
predict the part-of-speech tags.
File Formats
The train.json, dev.json, test.json are in the json format and can be read
with
1 import json
2
3 with open(‘train.json’) as f:
4 train_data = json.load(f)
JSON Schema in train.json and dev.json:
In the list of sentence, labels pairs, each ”record” has an index, sentence
(words), labels (tags)
1 [
2 {
3 “index”: 0,
4 “sentence”: [“This”, “is”, “a”, “sample”, “sentence.”],
1
5 “labels”: [“label1”, “label2”, “label3”, “label4”, “label5”]
6 },
7 {
8 “index”: 1,
9 “sentence”: [“Another”, “example”, “sentence.”],
10 “labels”: [“label1”, “label2”, “label3”]
11 },
12 {
13 “index”: 2,
14 “sentence”: [“Yet”, “another”, “sentence”, “for”, “analysis.”],
15 “labels”: [“label1”, “label2”, “label3”, “label4”, “label5”]
16 },
17 // More records…
18 ]
19
JSON Schema in test.json:
It is similar to the schema in train.json and dev.json, but we don’t give you
the labels.
1 [
2 {
3 “index”: 0,
4 “sentence”: [“This”, “is”, “a”, “sample”, “sentence.”]
5 },
6 {
7 “index”: 1,
8 “sentence”: [“Another”, “example”, “sentence.”]
9 },
10 {
11 “index”: 2,
12 “sentence”: [“Yet”, “another”, “sentence”, “for”, “analysis.”]
13 },
14 // More records…
15 ]
16
2
Task 1: Vocabulary Creation (20 points)
The first task is to create a vocabulary using the training data. In HMM,
one important problem when creating the vocabulary is to handle unknown
words. One simple solution is to replace rare words whose occurrences are
less than a threshold (e.g. 3) with a special token ‘< unk >’.
Task. Generate a vocabulary from the training data stored in the ”train”
file and save this vocabulary as ”vocab.txt.” The format of the ”vocab.txt”
file should adhere to the following specifications: Each line should consist of
a word type, its index within the vocabulary, and its frequency of occurrence,
with these elements separated by the tab symbol (\t ). The initial line should
feature the special token ”< unk >,” followed by subsequent lines sorted in
descending order of occurrence frequency.
Please take into account that you are only allowed to use the training
data to construct this vocabulary, and you must refrain from using the development and test data.
For instance:
< unk > 0 2000
word1 1 20000
word2 2 10000
Additionally, kindly provide answers to the following questions:
What threshold value did you choose for identifying unknown words for replacement?
What is the overall size of your vocabulary, and how many times does the
special token ”< unk >” occur following the replacement process?
Task 2: Model Learning (20 points)
The second task is to learn an HMM from the training data. Remember that
the solution of the emission and transition parameters in HMM are in the
following formulation:
t(s
′
|s) = count(s→s
′
)
count(s)
e(x|s) = count(s→x)
count(s)
π(s) = count(null→s)
count(num sentences)
3
t(·|·) is the transition parameter.
e(·|·) is the emission parameter.
π(·) is the initial state (The sentence begins with this state); also called as
prior probabilities.
Task. Learning a model using the training data in the file train and output
the learned model into a model file in json format, named hmm.json. The
model file should contains two dictionaries for the emission and transition
parameters, respectively. The first dictionary, named transition, contains
items with pairs of (s, s′
) as key and t(s
′
|s) as value. The second dictionary,
named emission, contains items with pairs of (s, x) as key and e(x|s) as value.
Additionally, kindly provide answers to the following questions:
How many transition and emission parameters in your HMM?
Task 3: Greedy Decoding with HMM (30 points)
The third task is to implement the greedy decoding algorithm with HMM.
Task. Implementing the greedy decoding algorithm and evaluate it on the
development data. Predict the part-of-speech tags of the sentences in the
test data and output the predictions in a file named greedy.json, in the same
format of training data.
Additionally, kindly provide answers to the following questions:
What is the accuracy on the dev data?
Moreover, you are encouraged to utilize the subsequent algorithm as a point
of reference.
Algorithm 1 Greedy Decoding for a sentence
Input:
hmm with
1. π(si) as an element in initial state vector ∈ R
|S|
2. t(si
|sj ) as an element in transition state matrix ∈ R
|S|×|S|
3. e(wi
|si) as an element in emission matrix ∈ R
|W|×|S|
where S is a set of all tags and W is a set of all words.
sentence ← {w1, w2, …., wT }
Output: {y1, y2, …., yT } (A list of tags)
function Decode({w1, w2, …., wT })
y1 ← argmax
s∈S
π(s) ∗ e(w1|s)
for i ← 2 to T do
yi ← argmax
s∈S
t(s|yi−1) ∗ e(wi
|s)
end for
return {y1, y2, …., yT }
end function
Task 4: Viterbi Decoding with HMM (30 Points)
The fourth task is to implement the viterbi decoding algorithm with HMM.
Task. Implementing the viterbi decoding algorithm and evaluate it on the
development data. Predicting the part-of-speech tags of the sentences in the
test data and output the predictions in a file named viterbi.json, in the same
format of training data.
Additionally, kindly provide answers to the following questions:
What is the accuracy on the dev data?
For Viterbi Algorithm, adopt the algorithm on wikipedia for our use case:
https://en.wikipedia.org/wiki/Viterbi_algorithm
Please note that the meaning of x and y are interchanged in this url.
Submission
Please follow the instructions and submit a zipped folder containing:
1. A txt file named vocab.txt, containing the vocabulary created on the
training data. The format of the vocabulary file is that each line contains a word type, its index and its occurrences, separated by the tab
symbol ‘\t’. (see task 1).
For example,
< unk > 0 2000
word1 1 20000
word2 2 10000
2. A json file named hmm.json, containing the emission and transition
probabilities (see task 2).
3. Two prediction files named greedy.json and viterbi.json, containing the
predictions of your model on the test data with the greedy and viterbi
decoding algorithms.
4. python code and a README file to describe how to run your code to
produce your prediction files. (see task 3 and task 4).
5. A PDF file which contains answers to the questions in the assignment
along with brief explanations about your solution.
6
CSCI544: Homework Assignment №3 Dataset Generation
1. Dataset Generation
We will use the Amazon reviews dataset used in HW1. Load the dataset
and build a balanced dataset of 100K reviews along with their labels through
random selection similar to HW1. You can store your dataset after generation
and reuse it to reduce the computational load. For your experiments consider
a 80%/20% training/testing split.
1
2. Word Embedding (25 points)
In this part the of the assignment, you will generate Word2Vec features for
the dataset you generated. You can use Gensim library for this purpose. A
helpful tutorial is available in the following link:
https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.
html
(a) (5 points)
Load the pretrained “word2vec-google-news-300” Word2Vec model and learn
how to extract word embeddings for your dataset. Try to check semantic
similarities of the generated vectors using three examples of your own, e.g.,
King − M an + W oman = Queen or excellent ∼ outstanding.
(b) (20 points)
Train a Word2Vec model using your own dataset. You will use these extracted features in the subsequent questions of this assignment. Set the embedding size to be 300 and the window size to be 13. You can also consider
a minimum word count of 9. Check the semantic similarities for the same
two examples in part (a). What do you conclude from comparing vectors
generated by yourself and the pretrained model? Which of the Word2Vec
models seems to encode semantic similarities between words better?
For the rest of this assignment, use the pretrained “word2vec-googlenews-300” Word2Ve features.
3. Simple models (20 points)
Using the Google pre-trained Word2Vec features, train a single perceptron
and an SVM model for the classification problem. For this purpose, use
the average Word2Vec vectors for each review as the input feature (x =
1
N
PN
i=1 Wi
for a review with N words). Report your accuracy values on the
testing split for these models similar to HW1, i.e., for each of perceptron and
SVM models, report two accuracy values Word2Vec and TF-IDF features.
What do you conclude from comparing performances for the models
trained using the two different feature types (TF-IDF and your trained
2
Word2Vec features)?
4. Feedforward Neural Networks (25 points)
Using the Word2Vec features, train a feedforward multilayer perceptron network for classification. Consider a network with two hidden layers, each
with 50 and 5 nodes, respectively. You can use cross entropy loss and your
own choice for other hyperparamters, e.g., nonlinearity, number of epochs,
etc. Part of getting good results is to select suitable values for these hyperparamters.
You can also refer to the following tutorial to familiarize yourself:
Although the above tutorial is for image data but the concept of training
an MLP is very similar to what we want to do.
(a) (10 points)
To generate the input features, use the average Word2Vec vectors similar to
the “Simple models” section and train the neural network. Report accuracy
values on the testing split for your MLP.
(b) (15 points)
To generate the input features, concatenate the first 10 Word2Vec vectors
for each review as the input feature (x = [WT
1
, …, WT
10]) and train the neural
network. Report the accuracy value on the testing split for your MLP model.
What do you conclude by comparing accuracy values you obtain with
those obtained in the “’Simple Models” section.
5. Recurrent Neural Networks (30 points)
Using the Word2Vec features, train a recurrent neural network (RNN) for
classification. You can refer to the following tutorial to familiarize yourself:
https://pytorch.org/tutorials/intermediate/char_rnn_classification_
tutorial.html
3
(a) (10 points)
Train a simple RNN for sentiment analysis. You can consider an RNN cell
with the hidden state size of 10. To feed your data into our RNN, limit
the maximum review length to 10 by truncating longer reviews and padding
shorter reviews with a null value (0). Report accuracy values on the testing
split for your RNN model.
What do you conclude by comparing accuracy values you obtain with
those obtained with feedforward neural network models.
(b) (10 points)
Repeat part (a) by considering a gated recurrent unit cell.
(c) (10 points)
Repeat part (a) by considering an LSTM unit cell.
What do you conclude by comparing accuracy values you obtain by GRU,
LSTM, and simple RNN.
Note: In total, you need to report accuracy values for:
2 (Perceptron + SVM) + 2 (FNN) + 3 (RNN) = 7 cases.
4
CSCI544: Homework Assignment №4 entity recognition (NER)
Introduction
This assignment gives you hands-on experience in building deep learning
models on named entity recognition (NER).
Here is a table of models that will be covered in this assignment
Model Expected F1 dev
BiLSTM 77
BiLSTM with Glove 88
Transformer Encoder 61
Table 1: Models
Dataset
We will use the CoNLL-2003 corpus to build a neural network for NER. link:
https://huggingface.co/datasets/conll2003 It is convenient to use
the datasets library to get this dataset
1 import datasets
2
3 dataset = datasets.load_dataset(“conll2003”)
Use the convenient .map() function to prepossess your dataset seamlessly.
It can be used to add keys, update keys
1
Example:
1 def convert_word_to_id(sample):
2 # Code to convert all tokens to their respective indexes
3 return {
4 ‘input_ids’: [
5 word2idx[token]
6 for token in sample[‘tokens’]
7 ]
8 }
9
10 dataset.map(convert_word_to_id)
We added the input ids column to this dataset.
Since we do not permit you to use ’pos tags’, ’chunk tags’, remove the following columns from the dataset: [’pos tags’, ’chunk tags’]
Rename the ner tags to labels
The string values of NER tags are https://huggingface.co/datasets/
conll2003#:~:text=%3A%2022%7D-,ner_tags,-%3A%20a%20list%20of
Glove Embeddings
We provide you with a file named glove.6B.100d.gz, which is the GloVe
word embeddings [1]. Alternatively, you can download it from https://
nlp.stanford.edu/data/glove.6B.zip
Download script for ipynb files:
1 !wget https://nlp.stanford.edu/data/glove.6B.zip
2 !unzip glove.6B.zip
Evaluation
We will also provide you with an evaluation function ’evaluate’ from: https:
//github.com/sighsmile/conlleval Whenever we ask you for accuracy, precision, recall, or f1 score, we are referring you to use this
2
script. If this script is not used, that will lead to a penalty
Download script for ipynb
1 !wget https://raw.githubusercontent.com/sighsmile/conlleval/master/conlleval.py
Usage:
1 # This is an example and the code will fail
2 # Because the preds is not the required length
3 from conlleval import evaluate
4 import itertools
5
6 # labels = ner_tags
7 # Map the labels back to their corresponding tag strings
8 labels = [
9 list(map(idx2tag.get, labels))
10 for labels in dataset[‘validation’][‘labels’]
11 ]
12 # This is the prediction by your model
13 preds = [
14 [‘O’, ‘O’, ‘B-ORG’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’],
15 [‘B-LOC’, ‘O’],
16 …
17 …
18 …
19 ]
20
21 precision, recall, f1 = evaluate(
22 itertools.chain(*labels),
23 itertools.chain(*preds)
24 )
Important: GPUs
Use Google Colab For the GPUs You do not necessarily need to write an
ipynb file, a py file works on Google Colab by using the following trick:
1. Upload the .py file to Google Colab. Let’s name it task1.py
3
2. pip install all requirements:
1 !pip install datasets accelerate
3. run the task1.py in a colab cell using this line:
1 !python task1.py
Alternatively, you can use Github to push → clone to Colab.
If you cannot use Google Colab anymore, then Kaggle gives you 30 hours
a week worth of free GPUs (more than enough to complete this homework
15 times in a week). The free tier of Kaggle is better than Colab, but you
can only use the github push → clone trick on Kaggle.
4
Task 1: Bidirectional LSTM model (40 points)
The first task is to build a simple bidirectional LSTM model for NER.
Implementing the bidirectional LSTM network with PyTorch. The architecture of the network is:
Embedding → BiLSTM → Linear → ELU → classifier
There is no flattening, the linear layer is run for every single hidden output
of the BiLSTM layer.
The hyper-parameters of the network are listed in the following table:
Layer hyperparam value
Embedding dim 100
Num LSTM layers 1
LSTM hidden dim 256
LSTM Dropout 0.33
Linear output dim 128
Table 2: Layer Specification
Train this BiLSTM model with the training data on NER with any optimizer
you like. Tune other parameters that are not specified in the above table,
such as batch size, learning rate, and learning rate scheduling.
Additionally, kindly provide answers to the following questions:
What are the precision, recall, and F1 score on the validation data?.
What are the precision, recall, and F1 score on the test data?
5
Task 2: Using GloVe word embeddings (60
points)
Use the GloVe word embeddings to improve the BLSTM in Task 1. Helpful
link. The way we use the GloVe word embeddings is straightforward: we initialize the embeddings in our neural network with the corresponding vectors
in GloVe. Freeze the embeddings. Note that GloVe is case-insensitive,
but our NER model should be case-sensitive because capitalization
is important information for NER. You are asked to find a way to deal
with this conflict.
You may use the same solution to boost the score for Task 1.
Additionally, kindly provide answers to the following questions:
What is the precision, recall, and F1 score on the validation data?
What are the precision, recall, and F1 score on the test data?
BiLSTM with Glove Embeddings outperforms the model without. Can you
provide a rationale for this?
Bonus: The Transformer Encoder (40 points)
Transformer [2] is currently the most dominant architecture in NLP research.
Let’s apply the transformer to the CoNLL-2003 dataset.
Build a BERT-like model by stacking nn.TransformerEncoderLayer
Use Transformer Encoder to stack the transformer encoder layers: nn.TransformerEncoder
Define Positional Embedding for the transformer.
Define Token Embedding for the transformer.
Now define a class for your transformer model that uses:
1. Positional Embedding
2. Token Embedding
3. Transformer Encoder Stack
4. Linear Layer as a classifier
6
Figure 1: Transformer Encoder
Layer hyperparam value
Embedding Size 128
num attention heads 8
sequence max length 128
feed-forward dimensions 128
Table 3: Transformer Specification
Do not forget to handle the src key padding mask, it is very important for
transformers.
The code from above links works when batch first=False.
This means that the input to the transformer is of the dimension
(sequence length, batch size, word vocab size)
This means that the output to the transformer is of the dimension
(sequence length, batch size, tag vocab size)
Additionally, provide answers to the following questions:
What is the precision, recall, and F1 score on the validation data?
What are the precision, recall, and F1 score on the test data?
What is the reason behind the poor performance of the transformer?
7
Submission
Please follow the instructions and submit a zipped folder containing:
1. A PDF file that contains answers to the questions in the assignment
along with a clear description of your solution, including all the hyperparameters used in network architecture and model training.
2. Python code that only produces results on test data for task 1 (no
training). You may need to save your models for this.
3. Python code that only produces results on test data for task 2 (no
training). You may need to save your models for this.
4. README file to describe how to run your code to produce your results. In the README file, you need to provide the command line to
produce the prediction files. (We will execute your cmd to reproduce
your reported results on the test).
5. BONUS: The code you used.
References
[1] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove:
Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP),
pages 1532–1543, 2014.
[2] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention
is all you need. Advances in neural information processing systems, 30,
2017.
8