Homework 3 solution


Original Work


4.3/5 - (3 votes)

This homework contains 4 questions. The last two questions require programming. The maximum
number of points is 100 plus 10 bonus points.
1 Question 1 – Boosting (20 points)
We learned about boosting in lecture and the topic is covered in Murphy 16.4. On page 555 Murphy claims
that “it was proved that one could boost the performance (on the training set) of any weak learner arbitrarily
high, provided the weak learned could always perform slightly better than chance.” We will now verify this
in the AdaBoost framework.
1. (5 points) Given a set of N observations (x
, yj
) where y
is the label y
j ∈ {−1, 1}, let ht(x) be the
weak classifier at step t and let αt be its weight. First we note that the final classi
3. (10 points) We showed above that training error is bounded above by QT
t=1 Zt
. At step t the values
Z1, Z2, . . . , Zt−1 are already fixed therefore at step t we can choose αt
to minimize Zt
. Let
t =
) 6= y
be the weighted training error for weak classifier ht(x) then we can re-write the formula for Zt as:
Zt = (1 − t) exp(−αt) + t exp(αt)
(a) First find the value of αt
that minimizes Zt
then show that
t = 2p
t(1 − t)
(b) Assume we choose Zt
this way. Then re-write t =
2 − γt where γt 0 implies better than
random and γt < 0 implies worse than random. Then show that: Zt ≤ exp(−2γ 2 t ) You may want to use the fact that log(1 − x) ≤ −x for 0 ≤ x < 1 Thus we have: training ≤ Y T t=1 Zt ≤ exp(−2 X T t=1 γ 2 t ) (c) Finally, show that if each classifier is better than random (e.g. γt ≥ γ for all t and γ 0) that: training ≤ exp(−2T γ2 ) Which shows that the training error can be made arbitrarily small with enough steps. 2 PCA via Successive Deflation [20 points] (Adapted from Murphy Exercise 12.7) Suppose we have a set of n data points x1, . . . , xn, where each xi is represented as a d-dimensional column vector. Let X = [x1; . . . ; xn] be the (d × n) matrix where column i is equal to xi . Define C = 1 nXXT to be the covariance matrix of X, where cij = Pn l=1 xilxjl = covar(i, j). Next, order the eigenvectors of C by their eigenvalues (largest first), and let v1, v2, . . . , vk be the first k eigenvectors. These satisfy v T i vj = ( 0 if i 6= j 1 if i = j v1 is the first principal eigenvector of C (the eigenvector with the largest eigenvalue), and as such satisfies Cv1 = λ1v1. Now define x˜i as the orthogonal projection of xi onto the space orthogonal to v1: x˜i = (I − v1v T 1 )xi Finally, define X˜ = [x˜1; . . . ; x˜n] as the deflated matrix of rank d − 1, which is obtained by removing from the d-dimensional data the component that lies in the direction of the first principal eigenvector: X˜ = (I − v1v T 1 )X 2 1. [5 points] Show that the covariance of the deflated matrix, C˜ = 1 n X˜ X˜ T is given by C˜ = 1 n XXT − λ1v1v T 1 (Hint: Some useful facts: (I − v1v T 1 ) is symmetric, XXT v1 = nλ1v1, and v T 1 v1 = 1. Also, for any matrices A and B, (AB) T = BT AT .) 2. [5 points] Show that for j 6= 1, if vj is a principal eigenvector of C with corresponding eigenvalue λj (that is, Cvj = λjvj ), then vj is also a principal eigenvector of C˜ with the same eigenvalue λj . 3. [5 points] Let u be the first principal eigenvector of C˜ . Explain why u = v2. (You may assume u is unit norm.) 4. [5 points] Suppose we have a simple method f for finding the leading eigenvector and eigenvalue of a positive-definite matrix, denoted by [λ, u] = f(C). Write some pseudocode for finding the first k principal basis vectors of X that only uses the special f function and simple vector arithmetic. (Hint: This should be a simple iterative routine that takes only a few lines to write. The input is C, k, and the function f, the output should be vj and λj for j ∈ 1, · · · , k) 3 Programming Question (clustering with K-means) [30 points] In class we discussed the K-means clustering algorithm. Your programming assignment this week is to implement the K-means algorithm on digit data. 3.1 The Data There are two files with the data. The first digit.txt contains the 1000 observations of 157 pixels (a subset of the original 785) from images containing handwritten digits. The second file labels.txt contains the true digit label (either 1, 3, 5, or 7). You can read both data files in with X = load(‘../hw3data/digits/digit.txt’); Y = load(‘../hw3data/digits/labels.txt’); Please note that there aren’t IDs for the digits. Please assume the first line is ID 1, the second line is ID 2, and so on. The labels correspond to the digit file, so the first line of labels.txt is the label for the digit in the first line of digit.txt. 3.2 The algorithm Your algorithm should be implemented as follows: 1. Select k starting centers that are points from your data set. You should be able to select these centers randomly or have them given as a parameter. 2. Assign each data point to the cluster associated with the nearest of the k center points. 3. Re-calculate the centers as the mean vector of each cluster from (2). 4. Repeat steps (2) and (3) until convergence or iteration limit. Define convergence as no change in label assignment from one step to another or you have iterated 20 times (whichever comes first). Please count your iterations as follows: after 20 iterations, you should have assigned the points 20 times. 3 3.3 Within group sum of squares The goal of clustering can be thought of as minimizing the variation within groups and consequently maximizing the variation between groups. A good model has low sum of squares within each group. We define sum of squares in the traditional way. Let Ck be the k th cluster and let µk be the mean of the observations in cluster Ck. Then the within group sum of squares for cluster Ck is defined as: SS(k) = X i∈Ck ||xi − µk ||2 Please note that the term ||xi − µk ||2 is the euclidean distance between xi and µk . If there are K clusters total then the “total within group sum of squares” is just the sum of all K of these individual SS(k) terms. 3.4 Pair-counting measures Given that we know the actual assignment labels for each data point we can attempt to analyze how well the clustering recovered this. WE will performance criteria which are based on two principles: i) two data points that belong to the same class should be assigned to the same cluster; and ii) two data points that belong to different classes should be assigned to different clusters. Formally speaking, consider all pairs of same-class data points, let p1 be the percentage of pairs of which both data points were assigned to the same cluster. Consider all pairs of different-class data points, let p2 be the percentage of pairs of which two data points are assigned to different clusters. Let p3 be the average of these two values p3 = (p1+p2)/2, which summarizes the clustering performance. 3.5 Questions When you have implemented the algorithm please report the following: 1. [8pts] The values of the total within group sum of squares and pair-counting measures (p1, p2, p3) for k = 2, k = 4 and k = 6. Start your centers with the first k points in the dataset. So, if k = 2, your initial centroids will be ID 1 and ID 2, which correspond to the first two lines in the file. 2. [7pts] The number of iterations that k-means ran for k = 6, starting the centers as in the previous item. Make sure you count the iterations correctly. If you start with iteration i = 1 and at i = 4 the cluster assignments don’t change, the number of iterations was 4, as you had to do step 2 four times to figure this out. 3. [7pts] A plot of the total within group sum of squares versus k for k = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Start your centers randomly (choose k points from the dataset at random). 4. [8pts] A plot of p1, p2, p3 versus k for k = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Start your centers randomly (choose k points from the dataset at random). For the last two items, you should run k-means algorithm several times (e.g., 10 times) and average the results. For each question, submit a single plot, which is the average of the runs. 4 Programming Question (scene classification) [30 points + 10 bonus] This question gives you the opportunities to learn LibSVM, which is one of the best software tool for classification problem. You will train SVMs with different kernels and use them to classify scenes from The Big Bang Theory, your favorite TV series. To classify an image, you will use Bag-of-Word representation, which is one of the most popular image representation. This representation views an images as the histogram of image features, or visual words. You MUST use LibSVM and your K-means implementation from Question 3. 4 4.1 LibSVM installation First download LibSVM https://www.csie.ntu.edu.tw/˜cjlin/libsvm/. Follow the instruction in README to install LibSVM for Matlab. Two main functions of LibSVM that you should pay attention to are: svmtrain and svmpredict. Note that Matlab also has a machine learning toolbox that comes with these two functions with exactly the same names. However, Matlab’s SVM implementation is not as good as LibSVM, so you need to make sure that you are using svmtrain and svmpredict from LibSVM. To check if you have installed the program correctly, in Matlab do: which svmtrain which svmpredict Matlab should return the paths to the svmtrain and svmpredict of LibSVM. To learn how to use these functions, type the names of the function in Matlab: svmtrain svmpredict 4.2 Data Training and test images are provided in the subdirectory bigbangtheory. The training image ids and labels are given in train.mat. This file contains two variables: imgIds and lbs. imgIds is a column vector and each row has a name of image in the training set. lbs is a matrix denoting the label for the image with the corresponding index. There are total 8 classes for the dataset: living room (1), kitchen (2), hallway (3), Penny’s living room (4), cafeteria (5), Cheesecake factory (6), laundry room (7), and comic bookstore (8). Validation set is not provided for this question. You have to do cross validation to find the parameter for the best performance. You can implement cross validation by yourself, or you can use LibSVM functionality. Image ids for test set are given in test.mat. 4.3 Helper functions A number of Matlab helper functions are provided. Also, some functions are left unfinished and you have to complete them. 1. Use HW3 BoW.learnDictionary() to learn visual vocabulary for scene representation. You have to fill your K-means implementation from Question 3 in this function. 2. Use HW3 BoW.cmpFeatVecs() to compute feature vectors. This function will return the histogram of visual words for each image, which is our feature vector. 4.4 What to implement? 1. Complete the function HW3 BoW.learnDictionary(). This function learns a visual vocabulary by running k-means on random image patches. Learn a visual dictionary with K = 1000 clusters. 2. (5 points) Using SVM with RBF kernel, report the 5-fold cross-validation accuracy for the default kernel parameters for C and γ. 3. (10 points) Tune C and γ for SVM with RBF kernel using 5-fold cross validation. Report the values of C, γ, and the 5-fold cross-validation accuracy. 4. Unfortunately, LibSVM does not support exponential X 2 kernel, so you will need to implement it. Implement a function: [trainK, testK] = cmpExpX2Kernel(trainD, testD, gamma) 5 that takes train and test data and the kernel parameter gamma and return the train and test kernels. Recall that the exponential X 2 kernel value for two d-dimensional feature vector x and y is: k(x, y) = exp − 1 γ X d i=1 (xi − yi) 2 xi + yi +  ! (1) The  term is added to avoid division by 0. 5. (10 points) LibSVM allows using pre-computed kernels. Train an SVM with exponential X 2 kernel. Report the 5-fold cross validation accuracy with the best tuned values of C and γ. Note that a good default value of γ is the average of X 2 distance between training data points, as discussed in the lecture. 6. (5 points) Use your model to predict on the test set. Save the predicted labels on a CSV file predTestLabels.csv (see 5.2 for the format). The order of predicted labels should correspond to the order of the reviews in test.mat. Submit this predTestLabels.csv file on Kaggle and report classification accuracy. 7. (10 bonus points) Submit predTestLabels.csv and enter our Kaggle competition for fame. You must use your Stony Brook email to register on Kaggle. We will maintain a leader board, and the top three entries at the end of the competition (due date) will receive 10 bonus points. The ranking is based on classification accuracy in your PDF submission file. To prevent exploiting test data, you are allowed to make a maximum of 2 submissions per 24 hours. Your submission will be evaluated immediately and the leader board will be updated. 5 What to submit? 5.1 Blackboard submission You will need to submit both your code and your answers to questions on Blackboard. Do not submit the provided data. Put the answer file and your code in a folder named: SUBID FirstName LastName (e.g., 10947XXXX heeyoung kwon). Zip this folder and submit the zip file on Blackboard. Your submission must be a zip file, i.e, SUBID FirstName LastName.zip. The answer file should be named: answers.pdf, and it should contain: 1. Answers to Question 1 and 2 2. Answers to Question 3.5, including the requested plots. 3. Answers to Question 4.4 5.2 Kaggle submission For Questions 4.4.4 and 4.4.5, you must submit a CSV file to get the classification accuracy from the competition site inclass.kaggle.com/c/hw3-scene-classification/. A submission file should contain two columns: ImgId and Prediction. The file should contain a header and have the following format. ImgId, P rediction 001778, 2 001779, 5 ... ... (2) A sample submission file is available from the competition site and our handout. 6 6 Cheating warnings Don’t cheat. You must do the homework yourself, otherwise you won’t learn. You must use your real name to register on Kaggle. Do not create multiple accounts to bypass the submission limitation per 24 hours. Doing so will be considered cheating.