## Description

1. (30 points) Let X = {x

1

, . . . , x

N } with x

t ∈ R

D, t = 1, . . . , N be a given training set.

Assume that the dataset is centered, i.e., PN

t=1 x

t = 0 ∈ R

D. We focus on performing linear

dimensionality reduction on the dataset using PCA (principal component analysis). With

PCA, for each x

t ∈ R

D, we get z

t = Wx

t

, where z

t ∈ R

d

, d < D, is the low dimensional

projection, and W ∈ R

d×D is the PCA projection matrix. Let Σ = 1

N

PN

t=1 x

t

(x

t

)

T be the

sample covariance matrix. Further, let v

t = WT z

t

so that v

t ∈ R

D.

(a) (10 points) Professor HighLowHigh claims: v

t = x

t

for all t = 1, . . . , N. Is the claim

correct? Clearly explain and prove your answer with necessary (mathematical) details.

(b) (20 points) Professor HighLowHigh also claims:

X

N

t=1

kx

t

k

2

2 −

X

N

t=1

kv

t

k

2

2 =

X

N

t=1

kx

t − v

t

k

2

2

,

where for a vector a = [a1 · · · aD], kak

2

2 =

PD

k=1(ak)

2

. Is the claim correct? Clearly

explain and prove your answer with necessary (mathematical) details.

2. (30 points) Let Z = {(x

1

, r

1

), . . . ,(x

N , r

N )}, x

t ∈ R

d

, r

t ∈ R

k be a set of N training samples.

We consider training a multilayer perceptron as shown in Figure 1. We consider a general

setting where the transfer functions at each stage are denoted by g, i.e.,

z

t

h = g(a

t

h

) = g

X

d

j=1

wh,jx

t

j + w0

and y

t

i = g(a

t

i

) = g

X

H

h=1

vi,hz

t

h + vi0

!

,

where a

t

h

, at

i

respectively denote the input activation for hidden node h and output node i.

Further, let L(·, ·) be the loss function, so that the learning focuses on minimizing:

E(W, V |Z) = X

N

t=1

X

k

i=1

L(r

t

i

, yt

i

) .

(a) (10 points) Show that the stochastic gradient descent update for vi,h is of the form

v

new

i,h = v

old

i,h + ∆vi,h, with the update

∆vi,h = η∆t

i

z

t

h

, where ∆t

i = g

0

(a

t

i

)

−

∂L(r

t

i

, yt

i

)

∂yt

i

. (1)

(b) (20 points) Show that the stochastic gradient descent update for wh,j is of the form

w

new

h,j = w

old

h,j + ∆wh,j , with the update

∆wh,j = η∆t

hx

t

j

, where ∆t

h = g

0

(a

t

h

)

X

k

i=1

∆t

i

vi,h!

. (2)

Figure 1: Two layer perceptron.

Programming assignment:

The next problem involves programming. For Question 3, we will be using the 2-class classification datasets from Boston50 and Boston75. In particular, we will develop code for 2-class

Support Vector Machines (SVMs) using gradient descent. The goal will be to modify your code for

MyLogisticReg2 from HW3.

3. (40 points) We will develop code for 2-class SVMs with parameters (w, w0) where w ∈

R

d

, w0 ∈ R. Assume a given dataset {(x

t

, yt

), t = 1, . . . , N}, where x

t ∈ R

d and y

t ∈ {−1, 1}.

Recall from our discussion in class that training SVMs involves minimizing the following

objective function:

f(w, w0) = 1

n

Xn

i=1

max{0, 1 − y

t

(wT x

t + w0)} +

λ

2

kwk

2

. (3)

We will use λ = 5 in this assignment.

For reference, compare the objective function to that of regularized logistic regression which

you recently worked on as part of HW3:

f(w, w0) = 1

n

Xn

i=1

{−y

t

(wT x

t + w0) + log(1 + exp(wT x

t + w0))} +

λ

2

kwk

2

, (4)

where we had used λ = 0 for the HW3 code.

We will develop code for MySVM2 with corresponding MySVM2.fit(X,y) and MySVM2.predict(X)

functions. Parameters for the model can be initialized following what you had done for

MyLogisticReg2. In the fit function, the parameters will be estimated using mini-batch

stochastic gradient descent with different mini-batch sizes m ≤ n. In particular, you will

modify your MyLogisticReg2 code by using gradients for the SVM objective in (3) instead of

the logistic regression objective in (4). Further, you will have to add the mini-batch stochastic

gradient descent (SGD) functionality which, for a pre-specified mini-batch size m, picks m

unique points at random to do the gradient descent in each iteration. We will run experiments

with different values of m.

We will compare the performance of MySVM2 for different values of mini-batch size m with

LogisticRegression1 on two datasets: Boston50 and Boston75. Recall that Boston has

506 data points, and a 5-fold cross-validation leaves n ≈ 400 points for training in each fold.2

For mini-batch SGD, we will consider three different values of m:

(i) m = 40, which is ≈ 10% of the dataset in each fold for 5-fold cross-validation,

(ii) m = 200, which is ≈ 50% of the dataset in each fold for 5-fold cross-validation, and

(iii) m = n, which is the full dataset in each fold for 5-fold cross-validation.

Note that m = n uses the full dataset (available for that fold) in each iteration and hence

corresponds to the usual gradient descent.3

Using my cross val with 5-fold cross-validation, report the error rates in each fold as well as

the mean and standard deviation of error rates across all folds for the four methods: MySVM2

with m = 40, m = 200, and m = n, and LogisticRegression, applied to the two 2-class

classification datasets: Boston50 and Boston75.

You will have to submit (a) code and (b) summary of results:

(a) Code: You will have to submit code for MySVM2() as well as a wrapper code q3().

For developing MySVM2(), you are encouraged to consult the code for MyLogisticReg2()

from HW3. You need to make sure you have init , fit, and predict implemented in

MySVM2. init (d,m) will initialize the parameters and will take the data dimensionality

d and mini-batch size m as input. You can add additional inputs such as the step size or

the convergence threshold. fit(X,y) will take the data features X and labels y and will

use mini-batch SGD to estimate the parameters w, w0. predict(X) will take a feature

matrix corresponding to the test set and return the predicted labels. Your class MySVM2()

will NOT inherit any base class in sklearn.

The wrapper code (main file) has no input and is used to prepare the datasets,

and make calls to my cross val(method,X,y,k) to generate the error rate results for

each dataset and each method. The code for my cross val(method,X,y,k) must be

yours (e.g., code you made in HW1 with modifications as needed) and you cannot use

cross val score() in sklearn. The results should be printed to terminal (not generating an additional file in the folder). Make sure the calls to my cross val(method,X,y,k)

1You should use LogisticRegression from sklearn, as we did for HW3. Note that Linear SVMs are implemented in sklearn as LinearSVC, but we will not use it since we have not discussed it in class. We will stick to

LogisticRegression for comparisons.

2Note that we are denoting the number of training points available for training in each fold as n, which is smaller

than the size of the full dataset.

3The exact value of n may differ mildly across the 5-folds since 506 cannot be exactly divided by 5. Your code for

HW3 is already doing these splits, so this aspect should not need additional effort. In the code, the m = n needs to

be passed as a special option (say, m = 106

or m =“all”) so the code knows it has use the the full dataset for that

fold.

are made in the following order and add a print to the terminal before each call to show

which method and dataset is being used:

i. MySVM2 with m = 40 for Boston50;

ii. MySVM2 with m = 200 for Boston50;

iii. MySVM2 with m = n for Boston50;

iv. LogisticRegression for Boston50;

v. MySVM2 with m = 40 for Boston75;

vi. MySVM2 with m = 200 for Boston75;

vii. MySVM2 with m = n for Boston75;

viii. LogisticRegression for Boston75.

*For the wrapper code, you need to make a q3.py file for it, and one should be able to

run your code by calling “python q3.py” in command line window.

(b) Summary of results: For each dataset and each method, report the test set error rates

for each of the k = 5 folds, the mean error rate over the k folds, and the standard deviation

of the error rates over the k folds. Make a table to present the results for each method

and each dataset (4 tables in total). Each column of the table represents a fold and add

two columns at the end to show the overall mean error rate and standard deviation over

the k folds. For example:

Error rates for MySVM2 with m = 40 for Boston50

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean SD

# # # # # # #

Additional instructions: Code can only be written in Python; no other programming languages

will be accepted. One should be able to execute your code for Q3 directly from command prompt

(e.g., “python q3.py”) without the need to run Python interactive shell first. Test your code

yourself before submission and suppress any warning messages that may be printed. Your code

must be run on a CSE lab machine (e.g., csel-kh1260-01.cselabs.umn.edu). Please make sure you

specify the full Python version you are using as well as instructions on how to run your program

in the README file (must be readable through a text editor such as Notepad). Information on

the size of the datasets, including number of data points and dimensionality of features, as well

as number of classes can be readily extracted from the datasets in scikit-learn. Each function

must take the inputs in the order specified in the problem and display the output via the terminal

or as specified. For Q3, you can submit additional files/functions (as needed) which will be used

by the main file. Please put comments in your code so that one can follow the key parts and steps

in your code.

Extra Credit Problem:

EC1 (15 points) Consider Problem 2 with specific choices of the activation function g(a). We will

assume L(r

t

i

, yt

i

) = (r

t

i − y

t

i

)

2

.

(a) (5 points) Let g(u) = max(0, u). What is the gradient g

0

(a)? What does the update in

(1) look like with these specific choices of g(·) and L(·, ·)? Clearly explain your answer

and show details.

(b) (5 points) For some α ∈ [0, 1], let

g(a) = max(0, a) + α min(0, a) . (5)

What is the gradient g

0

(a) in terms of α? Clearly explain your answer and show details.

(c) (5 points) For the activation function in (5), is there a specific choice of α ∈ [0, 1] which

makes the two layer perceptron effectively a linear model, i.e., the predictions y

t

i

are a

linear function of the inputs x

t

j

? Clearly explain your answer and show details.

Follow the rules strictly. If we cannot run your code, you will not get any credit.

• Things to submit

1. hw4.pdf: A document which contains the solution to Problems 1, 2, 3 (and extra credit

problem) including the summary of results for Problem 3. This document must be in

PDF format (no word, photo, etc., is accepted). If you submit a scanned copy of a

hand-written document, make sure the copy is clearly readable, otherwise no credit may

be given.

2. Python code for Problem 3 (must include the required q3.py).

3. README.txt: README file that contains your name, student ID, email, instructions

on how to run your code, the full Python version (e.g., Python 3.5) you are using, any

assumptions you are making, and any other necessary details. The file must be readable

by a text editor such as Notepad.

4. Any other files, except the data, which are necessary for your code.