Description
Problems
Problem 1. Hinge Loss
Here are two reviews of “Frozen,” courtesy of Rotten Tomatoes (no spoilers!):
Rotten Tomatoes has classified these reviews as “positive” and “negative,” respectively, as
indicated by the in-tact tomato on the left and the splattered tomato on the right. In this
assignment, you will create a simple text classification system that can perform this task
automatically.
Problem 1a [2 points]
We’ll warm up with the following set of four mini-reviews, each labeled positive (+1) or
negative (−1):
• (+1) so touching
• (+1) quite impressive
• (−1) not impressive
• (−1) quite boring
Each review x is mapped onto a feature vector ϕ(x), which maps each word to the number
of occurrences of that word in the review. For example, the first review maps to the (sparse)
feature vector ϕ(x) = {so : 1,touching : 1}. Recall the definition of the hinge loss:
Losshinge(x, y, w) = max{0, 1 − w · ϕ(x)y},
where y is the correct label.
Suppose we run stochastic gradient descent, updating the weights according to
w ← w − η∇wLosshinge(x, y, w),
once for each of the four examples in order. After the classifier is trained on the given four
data points, what are the weights of the six words (‘so’, ‘touching’, ‘quite’, ‘impressive’,
‘not’, ‘boring’) that appear in the above reviews? Use η = 1 as the step size and initialize
w = [0, …, 0]. Assume that ∇wLosshinge(x, y, w) = 0 when the margin is exactly 1.
Problem 2: Sentiment Classification
In this problem, we will build a binary linear classifier that reads movie reviews and guesses
whether they are “positive” or “negative.”
Problem 2a [2 points]
Implement the function extractWordFeatures, which takes a review (string) as input and
returns a feature vector ϕ(x) (you should represent the vector ϕ(x) as a dict in Python).
Problem 2b [8 points]
We’re going to train a linear predictor, which can be represented by a logistic regression
model. Here is the definition of linear predict:
fw(x) = (
+1 if w · ϕ(x) > 0
−1 if w · ϕ(x) < 0
=
(
+1 if σ(w · ϕ(x)) > 0.5
−1 if σ(w · ϕ(x)) < 0.5
where σ is a logistic(or sigmoid) function.
Your task is to implement the function learnPredictor using stochastic gradient descent,
minimizing the negative log-likelihood loss (NLL) defined as:
LossNLL(x, y, w) = − log(pw(y | x))
pw(y | x) = (
σ(w · ϕ(x)) if y = 1
1 − σ(w · ϕ(x)) if y = −1
You should first derive ∇wLossNLL(x, y, w), then exploit the formula to update weights
for each example. Also, you can print the training error and test error after each iteration
through the data, so it’s easy to see if your code is working.
Problem 2c [3 points]
The previous features include unigram(single) words only, which cannot consider the context
of a word in an utterance. In this task, we’ll incorporate n-gram words into features. In
other words, feature vector will counts number of occurrences of consecutive words of length
n. Implement extractNgramFeatures which extract n-gram word features.
Problem 3: K-means Clustering
Problem 3a [2 points]
Suppose we have a feature extractor ϕ that produces 2-dimensional feature vectors, and a
toy dataset Dtrain = {x1, x2, x3, x4} with
1. ϕ(x1) = [−1, 0]
2. ϕ(x2) = [2, 0]
3. ϕ(x3) = [0, 3]
4. ϕ(x4) = [4, 3]
Run 2-means on this dataset. What are the final cluster centers µ? Run this algorithm
until it converges, with initial centers:
1. µ1 = [−2, 0] and µ2 = [3, 0]
2. µ1 = [−1, −1] and µ2 = [2, 3]
Problem 3b [6 points]
Implement the kmeans function. You should initialize your k cluster centers to random
elements of examples.
After a few iterations of k-means, your centers will be very dense vectors. In order for
your code to run efficiently and to obtain full credit, you may need to precompute certain
quantities. As a reference, our code runs in under a second on all test cases. You might find
generateClusteringExamples in util.py useful for testing your code.