## Description

1 [24 points] K-means for image compression

In this problem, we will apply the K-means algorithm to lossy image compression, by reducing the number

of colors used in an image. CTools contains a 512×512 image of a mandrill represented in 24-bit color.

This means that, for each of the 262144 pixels in the image, there are three 8-bit numbers (each ranging

from 0 to 255) that represent the red, green, and blue intensity values for that pixel. The straightforward

representation of this image therefore takes about 262144 ⇥ 3 = 786432 bytes (a byte being 8 bits). To

compress the image, we will use K-means to reduce the image to K = 16 colors. More specifically, each pixel

in the image is considered a point in the three-dimensional (r, g, b)-space. To compress the image, we will

cluster these points in color-space into 16 clusters, and replace each pixel with the closest cluster centroid.

Follow the instructions below. Be warned that some of these operations can take a while. (several minutes

even on a fast computer!)

(a) Load mandrill-large.tiff into matlab by typing A = double(imread(’mandrill-large.tiff’));.

Now, A is a “three dimensional matrix,” and A(:,:,1), A(:,:,2) and A(:,:,3) are 512⇥512 arrays that

respectively contain the red, green, and blue values for each pixel. Enter imshow(uint8(round(A)));

to display the image.

(b) Since the large image has 262144 pixels and would take a while to cluster, we will instead run vector

quantization on a smaller image. Repeat part (a) with mandrill-small.tiff. Treating each pixel’s

(r, g, b) values as an element of R3, run K-means with 16 clusters on the pixel data from this smaller

image, iterating (preferably) to convergence, but in no case for less than 30 iterations. For initialization,

set each cluster centroid to the (r, g, b)-values of a randomly chosen pixel in the image.

(c) Take the matrix A from mandrill-large.tiff, and replace each pixel’s (r, g, b) values with the value of

the closest cluster centroid. Display the new image, and compare it visually to the original image. Hand

in all your code and a printout of your compressed image. Comment on what regions of the original

image are better preserved and what regions are not preserved well.

1

(d) If we represent the image with these reduced (16) colors as described in (c), by (approximately) what

factor have we compressed the image?

compression factor = space used to store original image

space used to store compressed image

2 [20 points] Cross-Validation on Hyper-Parameters of SVM

In the previous assignment (Homework3), you have implemented an SVM classifier using two di↵erent

optimization methods, Batch Gradient Descent and Stochastic Gradient Descent (SGD). In this question,

we will only consider SGD since it converges faster and requires fewer iterations over the training set. SGD

solves this minimization:

min

w,b

E(w, b) = X

N

i=1

E(i)

(w, b)

where, E(i)

(w, b) = 1

2N ||w||2 + C max ⇣

0, 1 t

(i)

(wT x(i) + b)

⌘

The pseudo-code for SGD minimization is summarized in Homework3 Q3, in that question, you were

given the values for C and ⌘0 to use. Your task in this question is find good values for C and ⌘0 (a.k.a

hyper-paremeters1) using cross validation. Load the matlab file q2 data.mat, which loads the variables

q2x train, q2t train, q2x test and q2t test.

[Note: the dataset svm data.mat is di↵erent than dataset of Homework3 Q3, therefore the C and ⌘0 may

be di↵erent than what was recommended for use in Homework3. Nonetheless, you may reuse parts of your

code from Homework3]

(a) [3 points] Write down the expressions for rwE(w, b) and @

@bE(w, b), as well as the SGD update rules

for both parameters.

(b) [8 points] Implement “hold-out cross validation” using a (50% : 50%) split. In other words, train on

the first half of the training examples, and validate on the second half of the training examples to choose

a good configuration of C and ⌘0. Try all pairs of C and ⌘0 where C 2 {0.01, 0.1, 1, 10, 100, 1000} and

⌘0 2 {0.01, 0.5, 1.0, 10, 100}. Run 200 iterations of SGD for each combination of C and ⌘0. Report the C

and ⌘0 that minimize the validation error. If there were multiple pairs (C, ⌘0) that achieve the minimum

error, choose the pair for which ⌘0 is smallest. Finally, using those cross validated hyper-parameters,

train your SVM using the entire training set then compute and report the test error (i.e. number of

mis-classified test examples).

(c) [9 points] Implement “k-fold cross validation” using k = 10, known as 10-fold validation. Try all pairs

of C and ⌘0 where C 2 {0.01, 0.1, 1, 10, 100, 1000} and ⌘0 2 {0.01, 0.5, 1.0, 10, 100}. Run 200 iterations

of SGD for each combination of C and ⌘0. Report the C and ⌘0 that minimize the validation error.

Finally, using those cross validated hyper-parameters, train your SVM using the entire training set then

compute and report the test error (i.e. number of mis-classified test examples).

3 [20 points] Intuitions behind Kernels in classification

In the previous assignment, you have proven some useful properties of kernels. In addition, you have

Kernelized an algorithm. In this (short) question, you will explore why Kernels are useful. For this question,

load the matlab file q3 data.mat, which loads the variables q3x train, q3t train, q3x test and q3t test.

1The parameters of SVM are w and b, which are fit to the training data. The hyper-parameters are chosen before training

begins.

2

(a) [6 points] Train an SVM over the training data using matlab’s built-in svmtrain using a linear kernel

(i.e. no kernel). Plot the training data and the separating hyperplane. Use svmclassify to classify all

test examples. Calculate and report the classification accuracy.

[hint: svmtrain accepts a ’showplot’ argument which plots the training points and the separating hyperplane]

(b) [5 points] Train an SVM over the training data using a Radial Basis Function (RBF) kernel, also known

as Gaussian Kernel. Plot the training data and the separating hyperplane. Report the test classification

accuracy using the RBF kernel.

(c) [3 points] Which Kernel did better? RBF or linear (/ no kernel)? Why did this happen?

(d) [6 points] Use 5-fold cross validation to determine the best hyper-parameter, for the SVM with RBF

kernel. Try 2 {0.2, 0.5, 1, 1.5, 2, 2.5, 3} [Note: you can use the help command to find out how to setup

hyper parameters when using the svmtrain command

(e) [Extra credit 8 points] Repeat step a through d using LibSVM instead of Matlab’s sum toolkit. You

can find information about how to download and use LibSVM by Google libsvm and download libsvm.

4 [16 points] Update rules for a 2-layer Neural Network

In this question, you will derive the update rules for a 2-layer neural network. Our small Neural Network is

depicted in this diagram:

• The first (input) layer takes a training example x that is 3-dimensional (M = 3).

• Given an example x on the input layer, the second layer (a.k.a first hidden layer) computes a tanh(.)

transformation of the input layer2. In particular, each of the 3 depicted hidden neurons compute the

the following:

hk = tanh

0

@bk +X

3

j=1

wkjxj

1

A = tanh

bk + wk

T x

; for k = 1, 2, 3

Where wkj is the weight of the parameter (or edge, as depicted in the graph) connecting xj and hk,

and the vector wk

T = [wk1, wk2, wk3] As discussed in class, we can define a W matrix to contain all

wk vectors like:

W =

2

4

wT

1

wT

2

wT

3

3

5

2the lecture slides define zk = bk + wkT x and hk = tanh(zk)

3

• Using the matrix W, we can define the vector h to be:

h = tanh (b + Wx)

Where h = [h1, h2, h3] and b = [b1, b2, b3]

• Finally, the output layer (which contains only one, output neuron) is a logistic layer. In particular, the

output neuron is computes the following:

t

ˆ= (b4 + ✓T h)

where (.) is the logistic function, (a) = 1

1+ea , ✓ is the vector of containing the weights of edges

connecting hidden neurons to the output neuron, and b4 is the bias for the output neuron.

Given a training example x and its label t, we define the error on the example as the cross-entropy loss:

E(x, t) =

tlog(t

ˆ) + (1 t) log(1 t

ˆ)

If we want to optimize the network parameters (✓, b4,W, b) via SGD, we would first need to find expressions for the derivatives of the error function above with respect to each of the parameters.

(a) [6 points] Given a training example (x, t), find the derivatives of the error function with respect to

parameters of the output neuron. In particular, calculate expressions for r✓E(x, t) and @

@b4 E(x, t).

(b) [10 points] Given a training example (x, t), find the derivatives of the error function with respect to

parameters of the hidden layer. In particular, calculate expressions for rWE(x, t) and rbE(x, t).

4