Description
CM146 Problem Set 1: Decision trees
1 Maximum Likelihood Estimation [15 pts]
Suppose we observe the values of n independent random variables X1, . . . , Xn drawn from the same
Bernoulli distribution with parameter θ
1
. In other words, for each Xi
, we know that
P(Xi = 1) = θ and P(Xi = 0) = 1 − θ.
Our goal is to estimate the value of θ from these observed values of X1 through Xn.
For any hypothetical value ˆθ, we can compute the probability of observing the outcome X1, . . . , Xn
if the true parameter value θ were equal to ˆθ. This probability of the observed data is often called
the likelihood, and the function L(θ) that maps each θ to the corresponding likelihood is called the
likelihood function. A natural way to estimate the unknown parameter θ is to choose the θ that
maximizes the likelihood function. Formally,
ˆθMLE = arg max
θ
L(θ).
(a) Write a formula for the likelihood function, L(θ) = P(X1, . . . , Xn; θ). Your function should
depend on the random variables X1, . . . , Xn and the hypothetical parameter θ. Does the
likelihood function depend on the order in which the random variables are observed ?
(b) Since the log function is increasing, the θ that maximizes the log likelihood `(θ) = log(L(θ)) is
the same as the θ that maximizes the likelihood. Find `(θ) and its first and second derivatives,
and use these to find a closed-form formula for the MLE.
(c) Suppose that n = 10 and the data set contains six 1s and four 0s. Write a short program likelihood.py that plots the likelihood function of this data for each value of ˆθ in
{0, 0.01, 0.02, . . . , 1.0} (use np.linspace(…) to generate this spacing). For the plot, the
x-axis should be θ and the y-axis L(θ). Scale your y-axis so that you can see some variation in
its value. Include the plot in your writeup (there is no need to submit your code). Estimate
ˆθMLE by marking on the x-axis the value of ˆθ that maximizes the likelihood. Does the answer
agree with the closed form answer ?
(d) Create three more likelihood plots: one where n = 5 and the data set contains three 1s and
two 0s; one where n = 100 and the data set contains sixty 1s and forty 0s; and one where
n = 10 and there are five 1s and five 0s. Include these plots in your writeup, and describe
how the likelihood functions and maximum likelihood estimates compare for the different
data sets.
2 Splitting Heuristic for Decision Trees [14 pts]
Recall that the ID3 algorithm iteratively grows a decision tree from the root downwards. On each
iteration, the algorithm replaces one leaf node with an internal node that splits the data based on
one decision attribute (or feature). In particular, the ID3 algorithm chooses the split that reduces
1This is a common assumption for sampling data. So we will denote this assumption as iid, short for Independent
and Identically Distributed, meaning that each random variable has the same distribution and is drawn independent
of all the other random variables
2
the entropy the most, but there are other choices. For example, since our goal in the end is to have
the lowest error, why not instead choose the split that reduces error the most? In this problem, we
will explore one reason why reducing entropy is a better criterion.
Consider the following simple setting. Let us suppose each example is described by n boolean
features: X = hX1, . . . , Xni, where Xi ∈ {0, 1}, and where n ≥ 4. Furthermore, the target function
to be learned is f : X → Y , where Y = X1 ∨ X2 ∨ X3. That is, Y = 1 if X1 = 1 or X2 = 1
or X3 = 1, and Y = 0 otherwise. Suppose that your training data contains all of the 2n possible
examples, each labeled by f. For example, when n = 4, the data set would be
X1 X2 X3 X4 Y
0 0 0 0 0
1 0 0 0 1
0 1 0 0 1
1 1 0 0 1
0 0 1 0 1
1 0 1 0 1
0 1 1 0 1
1 1 1 0 1
X1 X2 X3 X4 Y
0 0 0 1 0
1 0 0 1 1
0 1 0 1 1
1 1 0 1 1
0 0 1 1 1
1 0 1 1 1
0 1 1 1 1
1 1 1 1 1
(a) How many mistakes does the best 1-leaf decision tree make over the 2n
training examples?
(The 1-leaf decision tree does not split the data even once. Make sure you answer for the
general case when n ≥ 4.)
(b) Is there a split that reduces the number of mistakes by at least one? (That is, is there a
decision tree with 1 internal node with fewer mistakes than your answer to part (a)?) Why
or why not?
(c) What is the entropy of the output label Y for the 1-leaf decision tree (no splits at all)?
(d) Is there a split that reduces the entropy of the output Y by a non-zero amount? If so, what
is it, and what is the resulting conditional entropy of Y given this split?
3 Entropy and Information [2 pts]
The entropy of a Bernoulli (Boolean 0/1) random variable X with p(X = 1) = q is given by
B(q) = −q log q − (1 − q) log(1 − q).
Suppose that a set S of examples contains p positive examples and n negative examples. The
entropy of S is defined as H(S) = B
p
p+n
.
(a) Based on an attribute Xj , we split our examples into k disjoint subsets Sk, with pk positive
and nk negative examples in each. If the ratio pk
pk+nk
is the same for all k, show that the
information gain of this attribute is 0.
3
4 Programming exercise : Applying decision trees [24 pts]
Submission instructions
• Only provide answers and plots. Do not submit code.
Introduction2
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15,
1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out
of 2224 passengers and crew. This sensational tragedy shocked the international community and
led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats
for the passengers and crew. Although there was some element of luck involved in surviving the
sinking, some groups of people were more likely to survive than others, such as women, children,
and the upper-class.
In this problem, we ask you to complete the analysis of what sorts of people were likely to survive.
In particular, we ask you to apply the tools of machine learning to predict which passengers survived
the tragedy.
Starter Files
code and data
• code : titanic.py
• data : titanic_train.csv
documentation
• Decision Tree Classifier:
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
• Cross-Validation:
https://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
• Metrics:
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
Download the code and data sets from the course website. For more information on the data set,
see the Kaggle description: https://www.kaggle.com/c/titanic/data. (The provided data sets
are modified versions of the data available from Kaggle.3
)
2This assignment is adapted from the Kaggle Titanic competition, available at https://www.kaggle.com/c/
titanic. Some parts of the problem are copied verbatim from Kaggle.
3Passengers with missing values for any feature have been removed. Also, the categorical feature Sex has been
mapped to {’female’: 0, ’male’: 1} and Embarked to {’C’: 0, ’Q’: 1, ’S’: 2}. If you are interested more
in this process of data munging, Kaggle has an excellent tutorial available at https://www.kaggle.com/c/titanic/
details/getting-started-with-python-ii.
4
Note that any portions of the code that you must modify have been indicated with TODO. Do not
change any code outside of these blocks.
4.1 Visualization [4 pts]
One of the first things to do before trying any formal machine learning technique is to dive into
the data. This can include looking for funny values in the data, looking for outliers, looking at the
range of feature values, what features seem important, etc.
(a) Run the code (titanic.py) to make histograms for each feature, separating the examples
by class (e.g. survival). This should produce seven plots, one for each feature, and each plot
should have two overlapping histograms, with the color of the histogram indicating the class.
For each feature, what trends do you observe in the data?
4.2 Evaluation [20 pts]
Now, let us use scikit-learn to train a DecisionTreeClassifier on the data.
Using the predictive capabilities of the scikit-learn package is very simple. In fact, it can be
carried out in three simple steps: initializing the model, fitting it to the training data, and predicting
new values.4
(b) Before trying out any classifier, it is often useful to establish a baseline. We have implemented
one simple baseline classifier, MajorityVoteClassifier, that always predicts the majority
class from the training set. Read through the MajorityVoteClassifier and its usage and
make sure you understand how it works.
Your goal is to implement and evaluate another baseline classifier, RandomClassifier, that
predicts a target class according to the distribution of classes in the training data set. For
example, if 60% of the examples in the training set have Survived = 0 and 40% have
Survived = 1, then, when applied to a test set, RandomClassifier should randomly predict
60% of the examples as Survived = 0 and 40% as Survived = 1.
Implement the missing portions of RandomClassifier according to the provided specifications. Then train your RandomClassifier on the entire training data set, and evaluate its
training error. If you implemented everything correctly, you should have an error of 0.485.
(c) Now that we have a baseline, train and evaluate a DecisionTreeClassifier (using the class
from scikit-learn and referring to the documentation as needed). Make sure you initialize
your classifier with the appropriate parameters; in particular, use the ‘entropy’ criterion
discussed in class. What is the training error of this classifier?
(d) So far, we have looked only at training error, but as we learned in class, training error is a
poor metric for evaluating classifiers. Let us use cross-validation instead.
4Note that almost all of the model techniques in scikit-learn share a few common named functions, once
they are initialized. You can always find out more about them in the documentation for each model. These are
some-model-name.fit(…), some-model-name.predict(…), and some-model-name.score(…).
5
Implement the missing portions of error(…) according to the provided specifications. You
may find it helpful to use train_test_split(…) from scikit-learn. To ensure that we
always get the same splits across different runs (and thus can compare the classifier results),
set the random_state parameter to be the trial number.
Next, use your error(…) function to evaluate the training error and (cross-validation) test
error of each of your three models. To do this, generate a random 80/20 split of the training
data, train each model on the 80% fraction, evaluate the error on either the 80% or the 20%
fraction, and repeat this 100 times to get an average result. What are the average training
and test error of each of your classifiers on the Titanic data set?
(e) One problem with decision trees is that they can overfit to training data, yielding complex
classifiers that do not generalize well to new data. Let us see whether this is the case for the
Titanic data.
One way to prevent decision trees from overfitting is to limit their depth. Repeat your crossvalidation experiments but for increasing depth limits, specifically, 1, 2, . . . , 20. Then plot the
average training error and test error against the depth limit. (Also plot the average test error
for your baseline classifiers. As the baseline classifiers are independent of the depth limit,
their plots should be flat lines.) Include this plot in your writeup, making sure to label all
axes and include a legend for your classifiers. What is the best depth limit to use for this
data? Do you see overfitting? Justify your answers using the plot.
(f) Another useful tool for evaluating classifiers is learning curves, which show how classifier
performance (e.g. error) relates to experience (e.g. amount of training data).
Run another experiment using a decision tree with the best depth limit you found above.
This time, vary the amount of training data by starting with splits of 0.05 (5% of the data
used for training) and working up to splits of size 0.95 (95% of the data used for training) in
increments of 0.05. Then plot the decision tree training and test error against the amount of
training data. (Also plot the average test error for your baseline classifiers.) Include this plot
in your writeup, and provide a 1-2 sentence description of your observations.
CM146 Problem Set 2: Perceptron and regression
1 Perceptron [2 pts]
Design (specify θ for) a two-input perceptron (with an additional bias or offset term) that computes
the following boolean functions. Assume T = 1 and F = −1. If a valid perceptron exists, show
that it is not unique by designing another valid perceptron (with a different hyperplane, not simply
through normalization). If no perceptron exists, state why.
(a) OR (b) XOR
2 Logistic Regression [10 pts]
Consider the objective function that we minimize in logistic regression:
J(θ) = −
X
N
n=1
[yn log hθ (xn) + (1 − yn) log (1 − hθ (xn))]
(a) Find the partial derivatives ∂J
∂θj
.
(b) Find the partial second derivatives ∂
2J
∂θj∂θk
and show that the Hessian (the matrix H of second
derivatives with elements Hjk =
∂
2J
∂θj∂θk
) can be written as H =
PN
n=1 hθ (xn) (1 − hθ (xn)) xnx
T
n
.
(c) Show that J is a convex function and therefore has no local minima other than the global
one.
Hint: A function J is convex if its Hessian is positive semi-definite (PSD), written H 0. A
matrix is PSD if and only if
z
THz ≡
X
j,k
zjzkHjk ≥ 0.
for all real vectors z.
2
3 Locally Weighted Linear Regression [14 pts]
Consider a linear regression problem in which we want to “weight” different training instances
differently because some of the instances are more important than others. Specifically, suppose we
want to minimize
J(θ0, θ1) = X
N
n=1
wn (θ0 + θ1xn,1 − yn)
2
.
Here wn > 0. In class, we worked out what happens for the case where all the weights (the wn’s)
are the same. In this problem, we will generalize some of those ideas to the weighted setting.
(a) Calculate the gradient by computing the partial derivatives of J with respect to each of the
parameters (θ0, θ1).
(b) Set each partial derivatives to 0 and solve for θ0 and θ1 to obtain values of (θ0, θ1) that
minimize J.
(c) Show that J(θ) can also be written
J(θ) = (Xθ − y)
TW(Xθ − y)
for an appropriate diagonal matrix W, and where X =
1 x1,1
1 x2,1
.
.
.
.
.
.
1 xN,1
and y =
y1
y2
.
.
.
yN
and
θ =
θ0
θ1
. State clearly what W is.
3
4 Implementation: Polynomial Regression [20 pts]
In this exercise, you will work through linear and polynomial regression. Our data consists of
inputs xn ∈ R and outputs yn ∈ R, n ∈ {1, . . . , N}, which are related through a target function
y = f(x). Your goal is to learn a linear predictor hθ(x) that best approximates f(x). But this time,
rather than using scikit-learn, we will further open the “black-box”, and you will implement the
regression model!
code and data
• code : regression.py
• data : regression_train.csv, regression_test.csv
This is likely the first time that many of you are working with numpy and matrix operations within
a programming environment. For the uninitiated, you may find it useful to work through a numpy
tutorial first.1 Here are some things to keep in mind as you complete this problem:
• If you are seeing many errors at runtime, inspect your matrix operations to make sure that
you are adding and multiplying matrices of compatible dimensions. Printing the dimensions
of variables with the X.shape command will help you debug.
• When working with numpy arrays, remember that numpy interprets the * operator as elementwise multiplication. This is a common source of size incompatibility errors. If you want
matrix multiplication, you need to use the dot function in Python. For example, A*B does
element-wise multiplication while dot(A,B) does a matrix multiply.
• Be careful when handling numpy vectors (rank-1 arrays): the vector shapes 1 × N, N ×
1, and N are all different things. For these dimensions, we follow the the conventions of
scikit-learn’s LinearRegression class2
. Most importantly, unless otherwise indicated (in
the code documentation), both column and row vectors are rank-1 arrays of shape N, not
rank-2 arrays of shape N × 1 or shape 1 × N.
Visualization [1 pts]
As we learned last week, it is often useful to understand the data through visualizations. For this
data set, you can use a scatter plot to visualize the data since it has only two properties to plot (x
and y).
(a) Visualize the training and test data using the plot_data(…) function. What do you observe? For example, can you make an educated guess on the effectiveness of linear regression
in predicting the data?
1Try out SciPy’s tutorial (https://wiki.scipy.org/Tentative_NumPy_Tutorial), or use your favorite search engine to find an alternative. Those familiar with Matlab may find the “Numpy for Matlab Users” documentation
(https://wiki.scipy.org/NumPy_for_Matlab_Users) more helpful.
2
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
4
Linear Regression [12 pts]
Recall that linear regression attempts to minimize the objective function
J(θ) = X
N
n=1
(hθ(xn) − yn)
2
.
In this problem, we will use the matrix-vector form where
y =
y1
y2
.
.
.
yN
, X =
x
T
1
x
T
2
.
.
.
x
T
N
, θ =
θ0
θ1
θ2
.
.
.
θD
and each instance xn =
1, xn,1, . . . , xn,DT
.
In this instance, the number of input features D = 1.
Rather than working with this fully generalized, multivariate case, let us start by considering a
simple linear regression model:
hθ(x) = θ
Tx = θ0 + θ1×1
regression.py contains the skeleton code for the class PolynomialRegression. Objects of this
class can be instantiated as model = PolynomialRegression (m) where m is the degree of the
polynomial feature vector where the feature vector for instance n,
1, xn,1, x2
n,1
, . . . , xm
n,1
T
. Setting
m = 1 instantiates an object where the feature vector for instance n,
1, xn,1
T
.
(b) Note that to take into account the intercept term (θ0), we can add an additional “feature” to
each instance and set it to one, e.g. xi,0 = 1. This is equivalent to adding an additional first
column to X and setting it to all ones.
Modify PolynomialRegression.generate_polynomial_features(…) to create the matrix
X for a simple linear model.
(c) Before tackling the harder problem of training the regression model, complete
PolynomialRegression.predict(…) to predict y from X and θ.
(d) One way to solve linear regression is through gradient descent (GD).
Recall that the parameters of our model are the θj values. These are the values we will adjust
to minimize J(θ). In gradient descent, each iteration performs the update
θj ← θj − 2α
X
N
n=1
(hθ(xn) − yn) xn,j (simultaneously update θj for all j).
With each step of gradient descent, we expect our updated parameters θj to come closer to
the parameters that will achieve the lowest value of J(θ).
• As we perform gradient descent, it is helpful to monitor the convergence by computing
the cost, i.e., the value of the objective function J. Complete PolynomialRegression.cost(…)
to calculate J(θ).
If you have implemented everything correctly, then the following code snippet should
return 40.234.
train_data = load_data(‘regression_train.csv’)
model = PolynomialRegression()
model.coef_ = np.zeros(2)
model.cost(train_data.X, train_data.y)
• Next, implement the gradient descent step in PolynomialRegression.fit_GD(…).
The loop structure has been written for you, and you only need to supply the updates
to θ and the new predictions ˆy = hθ(x) within each iteration.
We will use the following specifications for the gradient descent algorithm:
– We run the algorithm for 10, 000 iterations.
– We terminate the algorithm ealier if the value of the objective function is unchanged
across consecutive iterations.
– We will use a fixed step size.
• So far, you have used a default learning rate (or step size) of η = 0.01. Try different
η = 10−4
, 10−3
, 10−2
, 0.0407, and make a table of the coefficients, number of iterations
until convergence (this number will be 10, 000 if the algorithm did not converge in a
smaller number of iterations) and the final value of the objective function. How do the
coefficients compare? How quickly does each algorithm converge?
(e) In class, we learned that the closed-form solution to linear regression is
θ = (XTX)
−1XT y.
Using this formula, you will get an exact solution in one calculation: there is no “loop until
convergence” like in gradient descent.
• Implement the closed-form solution PolynomialRegression.fit(…).
• What is the closed-form solution? How do the coefficients and the cost compare to those
obtained by GD? How quickly does the algorithm run compared to GD?
(f) Finally, set a learning rate η for GD that is a function of k (the number of iterations) (use
ηk =
1
1+k
) and converges to the same solution yielded by the closed-form optimization (minus
possible rounding errors). Update PolynomialRegression.fit_GD(…) with your proposed
learning rate. How long does it take the algorithm to converge with your proposed learning
rate?
Polynomial Regression[7 pts]
Now let us consider the more complicated case of polynomial regression, where our hypothesis is
hθ(x) = θ
T φ(x) = θ0 + θ1x + θ2x
2 + . . . + θ
mx
m.
6
(g) Recall that polynomial regression can be considered as an extension of linear regression in
which we replace our input matrix X with
Φ =
φ(x1)
T
φ(x2)
T
.
.
.
φ(xN )
T
,
where φ(x) is a function such that φj (x) = x
j
for j = 0, . . . , m.
Update PolynomialRegression.generate_polynomial_features(…) to create an m + 1
dimensional feature vector for each instance.
(h) Given N training instances, it is always possible to obtain a “perfect fit” (a fit in which all
the data points are exactly predicted) by setting the degree of the regression to N − 1. Of
course, we would expect such a fit to generalize poorly. In the remainder of this problem, you
will investigate the problem of overfitting as a function of the degree of the polynomial, m.
To measure overfitting, we will use the Root-Mean-Square (RMS) error, defined as
ERMS =
p
J(θ)/N,
where N is the number of instances.3
Why do you think we might prefer RMSE as a metric over J(θ)?
Implement PolynomialRegression.rms_error(…).
(i) For m = 0, . . . , 10, use the closed-form solver to determine the best-fit polynomial regression
model on the training data, and with this model, calculate the RMSE on both the training
data and the test data. Generate a plot depicting how RMSE varies with model complexity
(polynomial degree) – you should generate a single plot with both training and test error, and
include this plot in your writeup. Which degree polynomial would you say best fits the data?
Was there evidence of under/overfitting the data? Use your plot to justify your answer.
3Note that the RMSE as defined is a biased estimator. To obtain an unbiased estimator, we would have to divide
by n − k, where k is the number of parameters fitted (including the constant), so here, k = m + 1.
CM146 Problem Set 3: SVM and Kernels
1 Kernels [8 pts]
(a) For any two documents x and z, define k(x, z) to equal the number of unique words that occur
in both x and z (i.e., the size of the intersection of the sets of words in the two documents).
Is this function a kernel? Give justification for your answer.
(b) One way to construct kernels is to build them from simpler ones. We have seen various
“construction rules”, including the following: Assuming k1(x, z) and k2(x, z) are kernels,
then so are
• (scaling) f(x)k1(x, z)f(z) for any function f(x) ∈ R
• (sum) k(x, z) = k1(x, z) + k2(x, z)
• (product) k(x, z) = k1(x, z)k2(x, z)
Using the above rules and the fact that k(x, z) = x · z is (clearly) a kernel, show that the
following is also a kernel:
1 +
x
||x||
·
z
||z||3
(c) Given vectors x and z in R
2
, define the kernel kβ(x, z) = (1 + βx · z)
3
for any value β > 0.
Find the corresponding feature map φβ(·)
1
. What are the similarities/differences from the
kernel k(x, z) = (1 + x · z)
3
, and what role does the parameter β play?
2 SVM [8 pts]
Suppose we are looking for a maximum-margin linear classifier through the origin, i.e. b = 0 (also
hard margin, i.e., no slack variables). In other words, we minimize 1
2
||θ||2
subject to ynθ
Txn ≥
1, n = 1, . . . , N.
Parts of this assignment are adapted from course material by Tommi Jaakola (MIT), and Andrew Ng (Stanford),
and Jenna Wiens (UMich).
1You may use any external program to expand the cubic.
1
(a) Given a single training vector x = (a, e)
T with label y = −1, what is the θ
∗
that satisfies the
above constrained minimization?
(b) Suppose we have two training examples, x1 = (1, 1)T and x2 = (1, 0)T with labels y1 = 1 and
y2 = −1. What is θ
∗
in this case, and what is the margin γ?
(c) Suppose we now allow the offset parameter b to be non-zero. How would the classifier and the
margin change in the previous question? What are (θ
∗
, b∗
) and γ? Compare your solutions
with and without offset.
3 Twitter analysis using SVMs [26 pts]
In this project, you will be working with Twitter data. Specifically, we have supplied you with a
number of tweets that are reviews/reactions to movies2
,
e.g., “@nickjfrost just saw The Boat That Rocked/Pirate Radio and I thought it was brilliant! You
and the rest of the cast were fantastic! < 3”.
You will learn to automatically classify such tweets as either positive or negative reviews. To do
this, you will employ Support Vector Machines (SVMs), a popular choice for a large number of
classification problems.
Download the code and data sets from the course website. It contains the following data files:
• tweets.txt contains 630 tweets about movies. Each line in the file contains exactly one
tweet, so there are 630 lines in total.
• labels.txt contains the corresponding labels. If a tweet praises or recommends a movie, it
is classified as a positive review and labeled +1; otherwise it is classified as a negative review
and labeled −1. These labels are ordered, i.e. the label for the i
th tweet in tweets.txt
corresponds to the i
th number in labels.txt.
• held_out_tweets.txt contains 70 tweets for which we have withheld the labels.
Skim through the tweets to get a sense of the data.
The python file twitter.py contains skeleton code for the project. Skim through the code to
understand its structure.
3.1 Feature Extraction [2 pts]
We will use a bag-of-words model to convert each tweet into a feature vector. A bag-of-words
model treats a text file as a collection of words, disregarding word order. The first step in building
a bag-of-words model involves building a “dictionary”. A dictionary contains all of the unique
words in the text file. For this project, we will be including punctuations in the dictionary too.
For example, a text file containing “John likes movies. Mary likes movies2!!” will have a dictionary {’John’:0, ’Mary’:1, ’likes’:2, ’movies’:3, ’movies2’:4, ’.’:5, ’!’:6}. Note
that the (key,value) pairs are (word, index), where the index keeps track of the number of
unique words (size of the dictionary).
2Please note that these data were selected at random and thus the content of these tweets do not reflect the views
of the course staff. 🙂
2
Given a dictionary containing d unique words, we can transform the n variable-length tweets into
n feature vectors of length d by setting the i
th element of the j
th feature vector to 1 if the i
th
dictionary word is in the j
th tweet, and 0 otherwise.
(a) We have implemented extract_words(…) that processes an input string to return a list of
unique words. This method takes a simplistic approach to the problem, treating any string
of characters (that does not include a space) as a “word” and also extracting and including
all unique punctuations.
Implement extract_dictionary(…) that uses extract_words(…) to read all unique
words contained in a file into a dictionary (as in the example above). Process the tweets in
the order they appear in the file to create this dictionary of d unique words/punctuations.
(b) Next, implement extract_feature_vectors(…) that produces the bag-of-words representation of a file based on the extracted dictionary. That is, for each tweet i, construct a
feature vector of length d, where the j
th entry in the feature vector is 1 if the j
th word in the
dictionary is present in tweet i, or 0 otherwise. For n tweets, save the feature vectors in a
feature matrix, where the rows correspond to tweets (examples) and the columns correspond
to words (features). Maintain the order of the tweets as they appear in the file.
(c) In main(…), we have provided code to read the tweets and labels into a (630, d) feature
matrix and (630,) label array. Split the feature matrix and corresponding labels into your
training and test sets. The first 560 tweets will be used for training and the last 70
tweets will be used for testing. **All subsequent operations will be performed on these
data.**
3.2 Hyperparameter Selection for a Linear-Kernel SVM [10 pts]
Next, we will learn a classifier to separate the training data into positive and negative tweets. For
the classifier, we will use SVMs with two different kernels: linear and radial basis function (RBF).
We will use the sklearn.svm.SVC class and explicitly set only three of the initialization parameters:
kernel, gamma, and C. As usual, we will use SVC.fit(X,y) to train our SVM, but in lieu of using
SVC.predict(X) to make predictions, we will use SVC.decision_function(X), which returns the
(signed) distance of the samples to the separating hyperplane.
SVMs have hyperparameters that must be set by the user. For both linear and RBF-kernel SVMs,
we will select the hyperparameters using 5-fold cross-validation (CV). Using 5-fold CV, we will
select the hyperparameters that lead to the ‘best’ mean performance across all 5 folds.
(a) The result of a hyperparameter selection often depends upon the choice of performance measure. Here, we will consider the following performance measures: accuracy, F1-Score,
AUROC, precision, sensitivity, and specificity.
Implement performance(…). All measures, except sensitivity and specificity, are implemented in sklearn.metrics library. You can use sklearn.metrics.confusion_matrix(…)
to calculate the other two.
(b) Next, implement cv_performance(…) to return the mean k-fold CV performance for the
performance metric passed into the function. Here, you will make use of SVC.fit(X,y) and
SVC.decision_function(X), as well as your performance(…) function.
3
You may have noticed that the proportion of the two classes (positive and negative) are not
equal in the training data. When dividing the data into folds for CV, you should try to keep
the class proportions roughly the same across folds. In your write-up, briefly describe why
it might be beneficial to maintain class proportions across folds. Then, in main(…), use
sklearn.cross_validation.StratifiedKFold(…) to split the data for 5-fold CV, making
sure to stratify using only the training labels.
(c) Now, implement select_param_linear(…) to choose a setting for C for a linear SVM based
on the training data and the specified metric. Your function should call cv_performance(…),
passing in instances of SVC(kernel=’linear’, C=c) with different values for C, e.g., C =
10−3
, 10−2
, . . . , 102
.
(d) Finally, using the training data from Section 3.1 and the functions implemented here, find
the best setting for C for each performance measure mentioned above. Report your findings
in tabular format (up to the fourth decimal place):
C accuracy F1-score AUROC precision sensitivity specificity
10−3
10−2
10−1
100
101
102
best C
Your select_param_linear(…) function returns the ‘best’ C given a range of values. How
does the 5-fold CV performance vary with C and the performance metric?
3.3 Hyperparameter Selection for an RBF-kernel SVM [8 pts]
Similar to the hyperparameter selection for a linear-kernel SVM, you will perform hyperparameter
selection for an RBF-kernel SVM.
(a) Describe the role of the additional hyperparameter γ for an RBF-kernel SVM. How does γ
affect generalization error?
(b) Implement select_param_rbf(…) to choose a setting for C and γ via a grid search. Your
function should call cv_performance(…), passing in instances of
SVC(kernel=’rbf’, C=c, gamma=gamma) with different values for C and gamma. Explain
what kind of grid you used and why.
(c) Finally, using the training data from Section 3.1 and the function implemented here, find
the best setting for C and γ for each performance measure mentioned above. Report your
findings in tabular format. This time, because we have a two-dimensional grid search, report
only the best score for each metric, along with the accompanying C and γ setting.
4
metric score C γ
accuracy
F1-score
AUROC
precision
sensitivity
specificity
How does the CV performance vary with the hyperparameters of the RBF-kernel SVM?
3.4 Test Set Performance [6 pts]
In this section, you will apply the two classifiers learned in the previous sections to the test data
from Section 3.1. Once you have predicted labels for the test data, you will measure performance.
(a) Based on the results you obtained in Section 3.2 and Section 3.3, choose a hyperparameter
setting for the linear-kernel SVM and a hyperparameter setting for the RBF-kernel SVM.
Explain your choice.
Then, in main(…), using the training data extracted in Section 3.1 and SVC.fit(…),
train a linear- and an RBF-kernel SVM with your chosen settings.
(b) Implement performance_test(…) which returns the value of a performance measure, given
the test data and a trained classifier.
(c) For each performance metric, use performance_test(…) and the two trained linear- and
RBF-kernel SVM classifiers to measure performance on the test data. Report the results. Be
sure to include the name of the performance metric employed, and the performance on the
test data. How do the test performance of your two classifiers compare?
CM146 Problem Set 4: Clustering and PCA
Introduction
Machine learning techniques have been applied to a variety of image interpretation problems. In
this project, you will investigate facial recognition, which can be treated as a clustering problem
(“separate these pictures of Joe and Mary”).
For this project, we will use a small part of a huge database of faces of famous people (Labeled Faces
in the Wild [LFW] people dataset1
). The images have already been cropped out of the original
image, and scaled and rotated so that the eyes and mouth are roughly in alignment; additionally,
we will use a version that is scaled down to a manageable size of 50 by 37 pixels (for a total of
1850 “raw” features). Our dataset has a total of 1867 images of 19 different people. You will
apply dimensionality reduction using principal component analysis (PCA) and explore clustering
methods such as k-means and k-medoids to the problem of facial recognition on this dataset.
Download the starter files from the course website. It contains the following source files:
• util.py – Utility methods for manipulating data, including through PCA.
• cluster.py – Code for the Point, Cluster, and ClusterSet classes, on which you will build
the clustering algorithms.
• faces.py – Main code for the project.
Please note that you do not necessarily have to follow the skeleton code perfectly. We encourage
you to include your own additional methods and functions. However, you are not allowed to use
any scikit-learn classes or functions other than those already imported in the skeleton code.
1 PCA and Image Reconstruction [4 pts]
Before attempting automated facial recognition, you will investigate a general problem with images.
That is, images are typically represented as thousands (in this project) to millions (more generally)
of pixel values, and a high-dimensional vector of pixels must be reduced to a reasonably lowdimensional vector of features.
(a) As always, the first thing to do with any new dataset is to look at it. Use get_lfw_data(…)
to get the LFW dataset with labels, and plot a couple of the input images using show_image(…).
Then compute the mean of all the images, and plot it. (Remember to include all requested
images in your writeup.) Comment briefly on this “average” face.
(b) Perform PCA on the data using util.PCA(…). This function returns a matrix U whose
columns are the principal components, and a vector mu which is the mean of the data. If
you want to look at a principal component (referred to in this setting as an eigenface), run
show_image(vec_to_image(v)), where v is a column of the principal component matrix.
(This function will scale vector v appropriately for image display.) Show the top twelve
eigenfaces:
plot_gallery([vec_to_image(U[:,i]) for i in xrange(12)])
Comment briefly on your observations. Why do you think these are selected as the top
eigenfaces?
1
https://vis-www.cs.umass.edu/lfw/
2
(c) Explore the effect of using more or fewer dimensions to represent images. Do this by:
• Finding the principal components of the data
• Selecting a number l of components to use
• Reconstructing the images using only the first l principal components
• Visually comparing the images to the originals
To perform PCA, use apply_PCA_from_Eig(…) to project the original data into the lowerdimensional space, and then use reconstruct_from_PCA(…) to reconstruct high-dimensional
images out of lower dimensional ones. Then, using plotGallery(…), submit a gallery of the
first 12 images in the dataset, reconstructed with l components, for l = 1, 10, 50, 100, 500, 1288.
Comment briefly on the effectiveness of differing values of l with respect to facial recognition.
We will revisit PCA in the last section of this project.
2 K-Means and K-Medoids [16 pts]
Next, we will explore clustering algorithms in detail by applying them to a toy dataset. In particular,
we will investigate k-means and k-medoids (a slight variation on k-means).
(a) In k-means, we attempt to find k cluster centers µj ∈ R
d
, j ∈ {1, . . . , k} and n cluster
assignments c
(i) ∈ {1, . . . , k}, i ∈ {1, . . . , n}, such that the total distance between each
data point and the nearest cluster center is minimized. In other words, we attempt to find
µ1
, . . . , µk and c
(1), . . . , c(n)
that minimizes
J(c, µ) = Xn
i=1
||x
(i) − µc
(i) ||2
.
To do so, we iterate between assigning x
(i)
to the nearest cluster center c
(i) and updating
each cluster center µj
to the average of all points assigned to the j
th cluster.
Instead of holding the number of clusters k fixed, one can think of minimizing the objective
function over µ, c, and k. Show that this is a bad idea. Specifically, what is the minimum
possible value of J(c, µ, k)? What values of c, µ, and k result in this value?
(b) To implement our clustering algorithms, we will use Python classes to help us define three
abstract data types: Point, Cluster, and ClusterSet (available in cluster.py). Read
through the documentation for these classes. (You will be using these classes later, so make
sure you know what functionality each class provides!) Some of the class methods are already
implemented, and other methods are described in comments. Implement all of the methods
marked TODO in the Cluster and ClusterSet classes.
(c) Next, implement random_init(…) and kMeans(…) based on the provided specifications.
3
(d) Now test the performance of k-means on a toy dataset.
Use generate_points_2d(…) to generate three clusters each containing 20 points. (You
can modify generate_points_2d(…) to test different inputs while debugging your code,
but be sure to return to the initial implementation before creating any plots for submission.)
You can plot the clusters for each iteration using the plot_clusters(…) function.
In your writeup, include plots for the k-means cluster assignments and corresponding cluster
“centers” for each iteration when using random initialization.
(e) Implement kMedoids(…) based on the provided specification.
Hint: Since k-means and k-medoids are so similar, you may find it useful to refactor your code
to use a helper function kAverages(points, k, average, init=’random’, plot=True),
where average is a method that determines how to calculate the average of points in a
cluster (so it can take on values ClusterSet.centroids or ClusterSet.medoids).2
As before, include plots for k-medoids clustering for each iteration when using random initialization.
(f) Finally, we will explore the effect of initialization. Implement cheat_init(…).
Now compare clustering by initializing using cheat_init(…). Include plots for k-means
and k-medoids for each iteration.
3 Clustering Faces [12 pts]
Finally (!), we will apply clustering algorithms to the image data. To keep things simple, we will
only consider data from four individuals. Make a new image dataset by selecting 40 images each
from classes 4, 6, 13, and 16, then translate these images to (labeled) points: 3
X1, y1 = util.limit_pics(X, y, [4, 6, 13, 16], 40)
points = build_face_image_points(X1, y1)
(a) Apply k-means and k-medoids to this new dataset with k = 4 and initializing the centroids
randomly. Evaluate the performance of each clustering algorithm by computing the average
cluster purity with ClusterSet.score(…). As the performance of the algorithms can vary
widely depending upon the initialization, run both clustering methods 10 times and report
the average, minimum, and maximum performance.
average min max
k-means
k-medoids
How do the clustering methods compare in terms of clustering performance and runtime?
2
In Python, if you have a function stored to the variable func, you can apply it to parameters arg by callling
func(arg). This works even if func is a class method and arg is an object that is an instance of the class.
3There is a bug in fetch lfw version 0.18.1 where the results of the loaded images are not always in the same order.
This is not a problem for the previous parts but can affect the subset selected in this part. Thus, you may see varying
results. Results that show the correct qualitative behavior will get full credit.
4
Now construct another dataset by selecting 40 images each from two individuals 4 and 13.
(b) Explore the effect of lower-dimensional representations on clustering performance. To do
this, compute the principal components for the entire image dataset, then project the newly
generated dataset into a lower dimension (varying the number of principal components), and
compute the scores of each clustering algorithm.
So that we are only changing one thing at a time, use init=’cheat’ to generate the same initial set of clusters for k-means and k-medoids. For each value of l, the number of principal components, you will have to generate a new list of points using build_face_image_points(…).)
Let l = 1, 3, 5, . . . , 41. The number of clusters K = 2. Then, on a single plot, plot the
clustering score versus the number of components for each clustering algorithm (be sure to
label the algorithms). Discuss the results in a few sentences.
Some pairs of people are more similar to one another and some more different.
(c) Experiment with the data to find a pair that clustering can discriminate very well and another
pair that it finds very difficult (assume you have 40 images for each individual). Describe your
methodology (you may choose any of the clustering algorithms you implemented). Report the
two pairs in your writeup (display the pairs of images using plot_representative_images),
and comment briefly on the results.
CM146 Problem Set 5: Boosting, Unsupervised learning
1 AdaBoost [5 pts]
In the lecture on ensemble methods, we said that in iteration t, AdaBoost is picking (ht
, βt) that
minimizes the objective:
(h
∗
t
(x), β∗
t
) = arg min
(ht(x),βt)
X
n
wt(n)e
−ynβtht(xn)
= arg min
(ht(x),βt)
(e
βt − e
−βt
)
X
n
wt(n)I[yn 6= ht(xn)]
+e
−βt X
n
wt(n)
We define the weighted misclassification error at time t, t to be t =
P
n wt(n)I[yn 6= ht(xn)]. Also
the weights are normalized so that P
n wt(n) = 1.
(a) Take the derivative of the above objective function with respect to βt and set it to zero to
solve for βt and obtain the update for βt
.
(b) Suppose the training set is linearly separable, and we use a hard-margin linear support vector
machine (no slack) as a base classifier. In the first boosting iteration, what would the resulting
β1 be?
2 K-means for single dimensional data [5 pts]
In this problem, we will work through K-means for a single dimensional data.
(a) Consider the case where K = 3 and we have 4 data points x1 = 1, x2 = 2, x3 = 5, x4 = 7.
What is the optimal clustering for this data ? What is the corresponding value of the objective
?
Parts of this assignment are adapted from course material by Jenna Wiens (UMich) and Tommi Jaakola (MIT).
1
(b) One might be tempted to think that Lloyd’s algorithm is guaranteed to converge to the
global minimum when d = 1. Show that there exists a suboptimal cluster assignment (i.e.,
initialization) for the data in the above part that Lloyd’s algorithm will not be able to improve
(to get full credit, you need to show the assignment, show why it is suboptimal and explain
why it will not be improved).
2
3 Gaussian Mixture Models [8 pts]
We would like to cluster data {x1, . . . , xN }, xn ∈ R
d using a Gaussian Mixture Model (GMM)
with K mixture components. To do this, we need to estimate the parameters θ of the GMM, i.e.,
we need to set the values θ = {ωk, µk
, Σk}
K
k=1 where ωk is the mixture weight associated with
mixture component k, and µk and Σk denote the mean and the covariance matrix of the Gaussian
distribution associated with mixture component k.
If we knew which cluster each sample xn belongs to (we had complete data), we showed in the
lecture on Clustering that the log likelihood l is what we have below and we can compute the
maximum likelihood estimate (MLE) of all the parameters.
l(θ) = X
n
log p(xn, zn)
=
X
k
X
n
γnk log ωk +
X
k
(X
n
γnk log N(xn|µk
, Σk)
)
(1)
Since we do not have complete data, we use the EM algorithm. The EM algorithm works by
iterating between setting each γnk to the posterior probability p(zn = k|xn) (step 1 on slide 26
of the lecture on Clustering) and then using γnk to find the value of θ that maximizes l (step 2
on slide 26). We will now derive updates for one of the parameters, i.e., µj
(the mean parameter
associated with mixture component j).
(a) To maximize l, compute ∇µj
l(θ): the gradient of l(θ) with respect to µj
.
(b) Set the gradient to zero and solve for µj
to show that µj = P
1
n
γnj
P
n
γnjxn.
(c) Suppose that we are fitting a GMM to data using K = 2 components. We have N = 5
samples in our training data with xn, n ∈ {1, . . . , N} equal to: {5, 15, 25, 30, 40}.
We use the EM algorithm to find the maximum likeihood estimates for the model parameters,
which are the mixing weights for the two components, ω1 and ω2, and the means for the two
components, µ1 and µ2. The standard deviations for the two components are fixed at 1.
Suppose that at the end of step 1 of iteration 5 in the EM algorithm, the soft assignment γnk
for the five data items are as shown in Table 1.
γ1 γ2
0.2 0.8
0.2 0.8
0.8 0.2
0.9 0.1
0.9 0.1
Table 1: Entry in row n and column k of the table corresponds to γnk
What are updated values for the parameters ω1, ω2, µ1, and µ2 at the end of step 2 of the
EM algorithm?