Description

5/5 - (1 vote)

Problem 1: Naive Bayes classifier
In this problem, we will study the decision boundary of multinomial Naive Bayes model for binary text
classification. The decision boundary is often specificed as the level set of a function: {x ∈ X : h(x) = 0},
where x for which h(x) > 0 is in the positive class and x for which h(x) < 0 is in the negative class.
1. [2 points] Give an expression of h(x) for the Naive Bayes model pθ(y | x), where θ denotes the
parameters of the model.

2. [3 points] Recall that for multinomial Naive Bayes, we have the input X = (X1, . . . , Xn) where n is
the number of words in an example. In general, n changes with each example but we can ignore that
for now. We assume that Xi
| Y = y ∼ Categorical(θw1,y, . . . θwm,y) where Y ∈ {0, 1}, wi ∈ V, and
m = |V| is the vocabulary size. Further, Y ∼ Bernoulli(θ1). Show that the multinomial Naive Bayes
model has a linear decision boundary, i.e. show that h(x) can be written in the form w · x + b = 0.
[RECALL: The categorical distribution is a multinomial distribution with one trial. Its PMF is
p(x1, . . . , xm) = Ym
i=1
θ
xi
i
,
where xi = 1[x = i], Pm
i=1 xi = 1, and Pm
i=1 θi = 1. ]

3. [2 points] In the above model, Xi represents a single word, i.e. it’s a unigram model. Think of an
example in text classification where the Naive Bayes assumption might be violated. How would you
allieviate the problem?

4. [2 points] Since the decision boundary is linear, the Naive Bayes model works well if the data is
linearly separable. Discuss ways to make text data works in this setting, i.e. make the data more
linearly separable and its influence on model generalization.

Problem 2: Natural language inference
In this problem, you will build a logistic regression model for textual entailment. Given a premise sentence,
and a hypothesis sentence, we would like to predict whether the hypothesis is entailed by the premise, i.e. if
the premise is true, then the hypothesis must be true.
Example:
label premise hypothesis
entailment The kids are playing in the park The kids are playing
non-entailment The kids are playing in the park The kids are happy
1. [1 point] Given a dataset D = {(x
(i)
, y(i)
)}
n
i=1 where y ∈ {0, 1}, let ϕ be the feature extractor and w
be the weight vector. Write the maximum log-likelihood objective as a minimization problem.

2. [3 point, coding] We first need to decide the features to represent x. Implement extract unigram features
which returns a BoW feature vector for the premise and the hypothesis.

3. [2 point] Let ℓ(w) be the objective you obtained above. Compute the gradient of ℓi(w) given a single
example (x
(i)
, y(i)
). Note that ℓ(w) = Pn
i=1 ℓi(w). You can use fw(x) = 1
1+e−w·ϕ(x) to simplify the
expression.

4. [5 points, coding] Use the gradient you derived above to implement learn predictor. You must obtain
an error rate less than 0.3 on the training set and less than 0.4 on the test set to get full credit using
the unigram feature extractor.

5. [3 points] Discuss what are the potential problems with the unigram feature extractor and describe
your design of a better feature extractor.

6. [3 points, coding] Implement your feature extractor in extract custom features. You must get a
lower error rate on the dev set than what you got using the unigram feature extractor to get full
credits.

7. [3 points] When you run the tests in test classification.py, it will output a file error analysis.txt.
(You may want to take a look at the error analysis function in util.py). Select five examples misclassified by the classifier using custom features. For each example, briefly state your intuition for why
the classifier got it wrong, and what information about the example will be needed to get it right.

8. [3 points] Change extract unigram features such that it only extracts features from the hypothesis.
How does it affect the accuracy? Is this what you expected? If yes, explain why. If not, give some
hypothesis for why this is happening. Don’t forget to change the function back before your submit.
You do not need to submit your code for this part.

Problem 3: Word vectors
In this problem, you will implement functions to compute dense word vectors from a word co-occurrence
matrix using SVD, and explore similarities between words. You will be using python packages nltk and
numpy for this problem.
We will estimate word vectors using the corpus Emma by Jane Austen from nltk. Take a look at the
function read corpus in util.py which downloads the corpus.

1. [3 points, coding] First, let’s construct the word co-occurrence matrix. Implement the function
count cooccur matrix using a window size of 4 (i.e. considering 4 words before and 4 words after the
center word).

2. [1 points] Next, let’s perform dimensionality reduction on the co-occurrence matrix to obtain dense
word vectors. You will implement truncated SVD using the numpy.linalg.svd function in the
next part. Read its documentation (https://numpy.org/doc/stable/reference/generated/numpy.
linalg.svd.html) carefully. Can we set hermitian to True to speed up the computation? Explain
your answer in one sentence.

3. [3 points, coding] Now, implement the cooccur to embedding function that returns word embeddings
based on truncated SVD.

4. [2 points, coding] Let’s play with the word embeddings and see what they capture. In particular, we
will find the most similar words to a given word. In order to do that, we need to define a similarity
function between two word vectors. Dot product is one such metric. Implement it in top k similar
(where metric=’dot’).

5. [1 points] Now, run test embedding.py to get the top-k words. What’s your observation? Explain
why that is the case.

6. [2 points, coding] To fix the issue, implement the cosine similarity function in top k similar (where
metric=’cosine’).

7. [1 points] Among the given word list, take a look at the top-k similar words of “man” and “woman”,
in particular the adjectives. How do they differ? Explain what makes sense and what is surprising.

8. [1 points] Among the given word list, take a look at the top-k similar words of “happy” and “sad”. Do
they contain mostly synonyms, or antonyms, or both? What do you expect and why?

Solved DS-GA-1011 Fall 2024 Representing and Classifying Text

Download Details:

Description

Solved DS-GA-1011 Fall 2024 Representing and Classifying Text

Download Details:

Description

Related products

Solved HW Assignment 0 (A0): Regular Expressions CS6120

CS539 Natural Language Processing HW 1 solution

Solved HW Assignment 2 (A2): Deep Learning CS6120