## Description

This homework contains 4 questions. The last two questions require programming. The maximum

number of points is 100 plus 10 bonus points.

1 Question 1 – Boosting (20 points)

We learned about boosting in lecture and the topic is covered in Murphy 16.4. On page 555 Murphy claims

that “it was proved that one could boost the performance (on the training set) of any weak learner arbitrarily

high, provided the weak learned could always perform slightly better than chance.” We will now verify this

in the AdaBoost framework.

1. (5 points) Given a set of N observations (x

j

, yj

) where y

j

is the label y

j ∈ {−1, 1}, let ht(x) be the

weak classifier at step t and let αt be its weight. First we note that the final classi

3. (10 points) We showed above that training error is bounded above by QT

t=1 Zt

. At step t the values

Z1, Z2, . . . , Zt−1 are already fixed therefore at step t we can choose αt

to minimize Zt

. Let

t =

X

N

j=1

w

t

j

δ(ht(x

j

) 6= y

j

)

be the weighted training error for weak classifier ht(x) then we can re-write the formula for Zt as:

Zt = (1 − t) exp(−αt) + t exp(αt)

(a) First find the value of αt

that minimizes Zt

then show that

Z

opt

t = 2p

t(1 − t)

(b) Assume we choose Zt

this way. Then re-write t =

1

2 − γt where γt 0 implies better than

random and γt < 0 implies worse than random. Then show that:
Zt ≤ exp(−2γ
2
t
)
You may want to use the fact that log(1 − x) ≤ −x for 0 ≤ x < 1
Thus we have:
training ≤
Y
T
t=1
Zt ≤ exp(−2
X
T
t=1
γ
2
t
)
(c) Finally, show that if each classifier is better than random (e.g. γt ≥ γ for all t and γ 0) that:
training ≤ exp(−2T γ2
)
Which shows that the training error can be made arbitrarily small with enough steps.
2 PCA via Successive Deflation [20 points]
(Adapted from Murphy Exercise 12.7)
Suppose we have a set of n data points x1, . . . , xn, where each xi
is represented as a d-dimensional
column vector.
Let X = [x1; . . . ; xn] be the (d × n) matrix where column i is equal to xi
. Define C =
1
nXXT
to be
the covariance matrix of X, where cij =
Pn
l=1 xilxjl = covar(i, j).
Next, order the eigenvectors of C by their eigenvalues (largest first), and let v1, v2, . . . , vk be the first k
eigenvectors. These satisfy
v
T
i vj =
(
0 if i 6= j
1 if i = j
v1 is the first principal eigenvector of C (the eigenvector with the largest eigenvalue), and as such satisfies
Cv1 = λ1v1. Now define x˜i as the orthogonal projection of xi onto the space orthogonal to v1:
x˜i = (I − v1v
T
1
)xi
Finally, define X˜ = [x˜1; . . . ; x˜n] as the deflated matrix of rank d − 1, which is obtained by removing
from the d-dimensional data the component that lies in the direction of the first principal eigenvector:
X˜ = (I − v1v
T
1
)X
2
1. [5 points] Show that the covariance of the deflated matrix,
C˜ =
1
n
X˜ X˜ T
is given by
C˜ =
1
n
XXT − λ1v1v
T
1
(Hint: Some useful facts: (I − v1v
T
1
) is symmetric, XXT v1 = nλ1v1, and v
T
1 v1 = 1. Also, for any
matrices A and B, (AB)
T = BT AT
.)
2. [5 points] Show that for j 6= 1, if vj is a principal eigenvector of C with corresponding eigenvalue λj
(that is, Cvj = λjvj ), then vj is also a principal eigenvector of C˜ with the same eigenvalue λj .
3. [5 points] Let u be the first principal eigenvector of C˜ . Explain why u = v2. (You may assume u is
unit norm.)
4. [5 points] Suppose we have a simple method f for finding the leading eigenvector and eigenvalue of
a positive-definite matrix, denoted by [λ, u] = f(C). Write some pseudocode for finding the first k
principal basis vectors of X that only uses the special f function and simple vector arithmetic.
(Hint: This should be a simple iterative routine that takes only a few lines to write. The input is C, k,
and the function f, the output should be vj and λj for j ∈ 1, · · · , k)
3 Programming Question (clustering with K-means) [30 points]
In class we discussed the K-means clustering algorithm. Your programming assignment this week is to
implement the K-means algorithm on digit data.
3.1 The Data
There are two files with the data. The first digit.txt contains the 1000 observations of 157 pixels
(a subset of the original 785) from images containing handwritten digits. The second file labels.txt
contains the true digit label (either 1, 3, 5, or 7). You can read both data files in with
X = load(‘../hw3data/digits/digit.txt’);
Y = load(‘../hw3data/digits/labels.txt’);
Please note that there aren’t IDs for the digits. Please assume the first line is ID 1, the second line is ID
2, and so on. The labels correspond to the digit file, so the first line of labels.txt is the label for the digit in
the first line of digit.txt.
3.2 The algorithm
Your algorithm should be implemented as follows:
1. Select k starting centers that are points from your data set. You should be able to select these centers
randomly or have them given as a parameter.
2. Assign each data point to the cluster associated with the nearest of the k center points.
3. Re-calculate the centers as the mean vector of each cluster from (2).
4. Repeat steps (2) and (3) until convergence or iteration limit.
Define convergence as no change in label assignment from one step to another or you have iterated 20
times (whichever comes first). Please count your iterations as follows: after 20 iterations, you should have
assigned the points 20 times.
3
3.3 Within group sum of squares
The goal of clustering can be thought of as minimizing the variation within groups and consequently maximizing
the variation between groups. A good model has low sum of squares within each group.
We define sum of squares in the traditional way. Let Ck be the k
th cluster and let µk be the mean of the
observations in cluster Ck. Then the within group sum of squares for cluster Ck is defined as:
SS(k) = X
i∈Ck
||xi − µk
||2
Please note that the term ||xi − µk
||2
is the euclidean distance between xi and µk
.
If there are K clusters total then the “total within group sum of squares” is just the sum of all K of these
individual SS(k) terms.
3.4 Pair-counting measures
Given that we know the actual assignment labels for each data point we can attempt to analyze how well the
clustering recovered this. WE will performance criteria which are based on two principles: i) two data points
that belong to the same class should be assigned to the same cluster; and ii) two data points that belong to
different classes should be assigned to different clusters. Formally speaking, consider all pairs of same-class
data points, let p1 be the percentage of pairs of which both data points were assigned to the same cluster.
Consider all pairs of different-class data points, let p2 be the percentage of pairs of which two data points are
assigned to different clusters. Let p3 be the average of these two values p3 = (p1+p2)/2, which summarizes
the clustering performance.
3.5 Questions
When you have implemented the algorithm please report the following:
1. [8pts] The values of the total within group sum of squares and pair-counting measures (p1, p2, p3) for
k = 2, k = 4 and k = 6. Start your centers with the first k points in the dataset. So, if k = 2, your
initial centroids will be ID 1 and ID 2, which correspond to the first two lines in the file.
2. [7pts] The number of iterations that k-means ran for k = 6, starting the centers as in the previous item.
Make sure you count the iterations correctly. If you start with iteration i = 1 and at i = 4 the cluster
assignments don’t change, the number of iterations was 4, as you had to do step 2 four times to figure
this out.
3. [7pts] A plot of the total within group sum of squares versus k for k = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Start
your centers randomly (choose k points from the dataset at random).
4. [8pts] A plot of p1, p2, p3 versus k for k = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Start your centers randomly
(choose k points from the dataset at random).
For the last two items, you should run k-means algorithm several times (e.g., 10 times) and average the
results. For each question, submit a single plot, which is the average of the runs.
4 Programming Question (scene classification) [30 points + 10 bonus]
This question gives you the opportunities to learn LibSVM, which is one of the best software tool for classification
problem. You will train SVMs with different kernels and use them to classify scenes from The
Big Bang Theory, your favorite TV series. To classify an image, you will use Bag-of-Word representation,
which is one of the most popular image representation. This representation views an images as the histogram
of image features, or visual words. You MUST use LibSVM and your K-means implementation
from Question 3.
4
4.1 LibSVM installation
First download LibSVM https://www.csie.ntu.edu.tw/˜cjlin/libsvm/. Follow the instruction
in README to install LibSVM for Matlab. Two main functions of LibSVM that you should pay attention
to are: svmtrain and svmpredict. Note that Matlab also has a machine learning toolbox that
comes with these two functions with exactly the same names. However, Matlab’s SVM implementation is
not as good as LibSVM, so you need to make sure that you are using svmtrain and svmpredict from LibSVM.
To check if you have installed the program correctly, in Matlab do:
which svmtrain
which svmpredict
Matlab should return the paths to the svmtrain and svmpredict of LibSVM. To learn how to use these
functions, type the names of the function in Matlab:
svmtrain
svmpredict
4.2 Data
Training and test images are provided in the subdirectory bigbangtheory. The training image ids and
labels are given in train.mat. This file contains two variables: imgIds and lbs. imgIds is a column
vector and each row has a name of image in the training set. lbs is a matrix denoting the label for the
image with the corresponding index. There are total 8 classes for the dataset: living room (1), kitchen (2),
hallway (3), Penny’s living room (4), cafeteria (5), Cheesecake factory (6), laundry room (7), and comic
bookstore (8).
Validation set is not provided for this question. You have to do cross validation to find the parameter for
the best performance. You can implement cross validation by yourself, or you can use LibSVM functionality.
Image ids for test set are given in test.mat.
4.3 Helper functions
A number of Matlab helper functions are provided. Also, some functions are left unfinished and you have to
complete them.
1. Use HW3 BoW.learnDictionary() to learn visual vocabulary for scene representation. You
have to fill your K-means implementation from Question 3 in this function.
2. Use HW3 BoW.cmpFeatVecs() to compute feature vectors. This function will return the histogram
of visual words for each image, which is our feature vector.
4.4 What to implement?
1. Complete the function HW3 BoW.learnDictionary(). This function learns a visual vocabulary
by running k-means on random image patches. Learn a visual dictionary with K = 1000 clusters.
2. (5 points) Using SVM with RBF kernel, report the 5-fold cross-validation accuracy for the default
kernel parameters for C and γ.
3. (10 points) Tune C and γ for SVM with RBF kernel using 5-fold cross validation. Report the values
of C, γ, and the 5-fold cross-validation accuracy.
4. Unfortunately, LibSVM does not support exponential X
2 kernel, so you will need to implement it.
Implement a function:
[trainK, testK] = cmpExpX2Kernel(trainD, testD, gamma)
5
that takes train and test data and the kernel parameter gamma and return the train and test kernels.
Recall that the exponential X
2 kernel value for two d-dimensional feature vector x and y is:
k(x, y) = exp
−
1
γ
X
d
i=1
(xi − yi)
2
xi + yi +
!
(1)
The term is added to avoid division by 0.
5. (10 points) LibSVM allows using pre-computed kernels. Train an SVM with exponential X
2 kernel.
Report the 5-fold cross validation accuracy with the best tuned values of C and γ. Note that a good
default value of γ is the average of X
2 distance between training data points, as discussed in the
lecture.
6. (5 points) Use your model to predict on the test set. Save the predicted labels on a CSV file
predTestLabels.csv (see 5.2 for the format). The order of predicted labels should correspond
to the order of the reviews in test.mat. Submit this predTestLabels.csv file on Kaggle and
report classification accuracy.
7. (10 bonus points) Submit predTestLabels.csv and enter our Kaggle competition for fame. You
must use your Stony Brook email to register on Kaggle. We will maintain a leader board, and the
top three entries at the end of the competition (due date) will receive 10 bonus points. The ranking is
based on classification accuracy in your PDF submission file.
To prevent exploiting test data, you are allowed to make a maximum of 2 submissions per 24 hours.
Your submission will be evaluated immediately and the leader board will be updated.
5 What to submit?
5.1 Blackboard submission
You will need to submit both your code and your answers to questions on Blackboard. Do not submit the
provided data. Put the answer file and your code in a folder named: SUBID FirstName LastName (e.g.,
10947XXXX heeyoung kwon). Zip this folder and submit the zip file on Blackboard. Your submission must
be a zip file, i.e, SUBID FirstName LastName.zip. The answer file should be named: answers.pdf, and it
should contain:
1. Answers to Question 1 and 2
2. Answers to Question 3.5, including the requested plots.
3. Answers to Question 4.4
5.2 Kaggle submission
For Questions 4.4.4 and 4.4.5, you must submit a CSV file to get the classification accuracy from the competition
site inclass.kaggle.com/c/hw3-scene-classification/. A submission file should
contain two columns: ImgId and Prediction. The file should contain a header and have the following format.
ImgId, P rediction
001778, 2
001779, 5
... ...
(2)
A sample submission file is available from the competition site and our handout.
6
6 Cheating warnings
Don’t cheat. You must do the homework yourself, otherwise you won’t learn. You must use your real name
to register on Kaggle. Do not create multiple accounts to bypass the submission limitation per 24 hours.
Doing so will be considered cheating.