# CSCI 5521 Homework 4 linear dimensionality solution

\$30.00

Category:

## Description

1. (30 points) Let X = {x
1
, . . . , x
N } with x
t ∈ R
D, t = 1, . . . , N be a given training set.
Assume that the dataset is centered, i.e., PN
t=1 x
t = 0 ∈ R
D. We focus on performing linear
dimensionality reduction on the dataset using PCA (principal component analysis). With
PCA, for each x
t ∈ R
D, we get z
t = Wx
t
, where z
t ∈ R
d
, d < D, is the low dimensional
projection, and W ∈ R
d×D is the PCA projection matrix. Let Σ = 1
N
PN
t=1 x
t
(x
t
)
T be the
sample covariance matrix. Further, let v
t = WT z
t
so that v
t ∈ R
D.
(a) (10 points) Professor HighLowHigh claims: v
t = x
t
for all t = 1, . . . , N. Is the claim
(b) (20 points) Professor HighLowHigh also claims:
X
N
t=1
kx
t
k
2
2 −
X
N
t=1
kv
t
k
2
2 =
X
N
t=1
kx
t − v
t
k
2
2
,
where for a vector a = [a1 · · · aD], kak
2
2 =
PD
k=1(ak)
2
. Is the claim correct? Clearly
2. (30 points) Let Z = {(x
1
, r
1
), . . . ,(x
N , r
N )}, x
t ∈ R
d
, r
t ∈ R
k be a set of N training samples.
We consider training a multilayer perceptron as shown in Figure 1. We consider a general
setting where the transfer functions at each stage are denoted by g, i.e.,
z
t
h = g(a
t
h
) = g

X
d
j=1
wh,jx
t
j + w0

 and y
t
i = g(a
t
i
) = g
X
H
h=1
vi,hz
t
h + vi0
!
,
where a
t
h
, at
i
respectively denote the input activation for hidden node h and output node i.
Further, let L(·, ·) be the loss function, so that the learning focuses on minimizing:
E(W, V |Z) = X
N
t=1
X
k
i=1
L(r
t
i
, yt
i
) .
(a) (10 points) Show that the stochastic gradient descent update for vi,h is of the form
v
new
i,h = v
old
i,h + ∆vi,h, with the update
∆vi,h = η∆t
i
z
t
h
, where ∆t
i = g
0
(a
t
i
)


∂L(r
t
i
, yt
i
)
∂yt
i

. (1)
(b) (20 points) Show that the stochastic gradient descent update for wh,j is of the form
w
new
h,j = w
old
h,j + ∆wh,j , with the update
∆wh,j = η∆t
hx
t
j
, where ∆t
h = g
0
(a
t
h
)
X
k
i=1
∆t
i
vi,h!
. (2)
Figure 1: Two layer perceptron.
Programming assignment:
The next problem involves programming. For Question 3, we will be using the 2-class classification datasets from Boston50 and Boston75. In particular, we will develop code for 2-class
Support Vector Machines (SVMs) using gradient descent. The goal will be to modify your code for
MyLogisticReg2 from HW3.
3. (40 points) We will develop code for 2-class SVMs with parameters (w, w0) where w ∈
R
d
, w0 ∈ R. Assume a given dataset {(x
t
, yt
), t = 1, . . . , N}, where x
t ∈ R
d and y
t ∈ {−1, 1}.
Recall from our discussion in class that training SVMs involves minimizing the following
objective function:
f(w, w0) = 1
n
Xn
i=1
max{0, 1 − y
t
(wT x
t + w0)} +
λ
2
kwk
2
. (3)
We will use λ = 5 in this assignment.
For reference, compare the objective function to that of regularized logistic regression which
you recently worked on as part of HW3:
f(w, w0) = 1
n
Xn
i=1
{−y
t
(wT x
t + w0) + log(1 + exp(wT x
t + w0))} +
λ
2
kwk
2
, (4)
where we had used λ = 0 for the HW3 code.
We will develop code for MySVM2 with corresponding MySVM2.fit(X,y) and MySVM2.predict(X)
functions. Parameters for the model can be initialized following what you had done for
MyLogisticReg2. In the fit function, the parameters will be estimated using mini-batch
stochastic gradient descent with different mini-batch sizes m ≤ n. In particular, you will
the logistic regression objective in (4). Further, you will have to add the mini-batch stochastic
gradient descent (SGD) functionality which, for a pre-specified mini-batch size m, picks m
unique points at random to do the gradient descent in each iteration. We will run experiments
with different values of m.
We will compare the performance of MySVM2 for different values of mini-batch size m with
LogisticRegression1 on two datasets: Boston50 and Boston75. Recall that Boston has
506 data points, and a 5-fold cross-validation leaves n ≈ 400 points for training in each fold.2
For mini-batch SGD, we will consider three different values of m:
(i) m = 40, which is ≈ 10% of the dataset in each fold for 5-fold cross-validation,
(ii) m = 200, which is ≈ 50% of the dataset in each fold for 5-fold cross-validation, and
(iii) m = n, which is the full dataset in each fold for 5-fold cross-validation.
Note that m = n uses the full dataset (available for that fold) in each iteration and hence
corresponds to the usual gradient descent.3
Using my cross val with 5-fold cross-validation, report the error rates in each fold as well as
the mean and standard deviation of error rates across all folds for the four methods: MySVM2
with m = 40, m = 200, and m = n, and LogisticRegression, applied to the two 2-class
classification datasets: Boston50 and Boston75.
You will have to submit (a) code and (b) summary of results:
(a) Code: You will have to submit code for MySVM2() as well as a wrapper code q3().
For developing MySVM2(), you are encouraged to consult the code for MyLogisticReg2()
from HW3. You need to make sure you have init , fit, and predict implemented in
MySVM2. init (d,m) will initialize the parameters and will take the data dimensionality
d and mini-batch size m as input. You can add additional inputs such as the step size or
the convergence threshold. fit(X,y) will take the data features X and labels y and will
use mini-batch SGD to estimate the parameters w, w0. predict(X) will take a feature
matrix corresponding to the test set and return the predicted labels. Your class MySVM2()
will NOT inherit any base class in sklearn.
The wrapper code (main file) has no input and is used to prepare the datasets,
and make calls to my cross val(method,X,y,k) to generate the error rate results for
each dataset and each method. The code for my cross val(method,X,y,k) must be
yours (e.g., code you made in HW1 with modifications as needed) and you cannot use
cross val score() in sklearn. The results should be printed to terminal (not generating an additional file in the folder). Make sure the calls to my cross val(method,X,y,k)
1You should use LogisticRegression from sklearn, as we did for HW3. Note that Linear SVMs are implemented in sklearn as LinearSVC, but we will not use it since we have not discussed it in class. We will stick to
LogisticRegression for comparisons.
2Note that we are denoting the number of training points available for training in each fold as n, which is smaller
than the size of the full dataset.
3The exact value of n may differ mildly across the 5-folds since 506 cannot be exactly divided by 5. Your code for
HW3 is already doing these splits, so this aspect should not need additional effort. In the code, the m = n needs to
be passed as a special option (say, m = 106
or m =“all”) so the code knows it has use the the full dataset for that
fold.
are made in the following order and add a print to the terminal before each call to show
which method and dataset is being used:
i. MySVM2 with m = 40 for Boston50;
ii. MySVM2 with m = 200 for Boston50;
iii. MySVM2 with m = n for Boston50;
iv. LogisticRegression for Boston50;
v. MySVM2 with m = 40 for Boston75;
vi. MySVM2 with m = 200 for Boston75;
vii. MySVM2 with m = n for Boston75;
viii. LogisticRegression for Boston75.
*For the wrapper code, you need to make a q3.py file for it, and one should be able to
run your code by calling “python q3.py” in command line window.
(b) Summary of results: For each dataset and each method, report the test set error rates
for each of the k = 5 folds, the mean error rate over the k folds, and the standard deviation
of the error rates over the k folds. Make a table to present the results for each method
and each dataset (4 tables in total). Each column of the table represents a fold and add
two columns at the end to show the overall mean error rate and standard deviation over
the k folds. For example:
Error rates for MySVM2 with m = 40 for Boston50
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean SD
# # # # # # #
Additional instructions: Code can only be written in Python; no other programming languages
will be accepted. One should be able to execute your code for Q3 directly from command prompt
(e.g., “python q3.py”) without the need to run Python interactive shell first. Test your code
yourself before submission and suppress any warning messages that may be printed. Your code
must be run on a CSE lab machine (e.g., csel-kh1260-01.cselabs.umn.edu). Please make sure you
specify the full Python version you are using as well as instructions on how to run your program
the size of the datasets, including number of data points and dimensionality of features, as well
as number of classes can be readily extracted from the datasets in scikit-learn. Each function
must take the inputs in the order specified in the problem and display the output via the terminal
or as specified. For Q3, you can submit additional files/functions (as needed) which will be used
Extra Credit Problem:
EC1 (15 points) Consider Problem 2 with specific choices of the activation function g(a). We will
assume L(r
t
i
, yt
i
) = (r
t
i − y
t
i
)
2
.
(a) (5 points) Let g(u) = max(0, u). What is the gradient g
0
(a)? What does the update in
(1) look like with these specific choices of g(·) and L(·, ·)? Clearly explain your answer
and show details.
(b) (5 points) For some α ∈ [0, 1], let
g(a) = max(0, a) + α min(0, a) . (5)
0
(a) in terms of α? Clearly explain your answer and show details.
(c) (5 points) For the activation function in (5), is there a specific choice of α ∈ [0, 1] which
makes the two layer perceptron effectively a linear model, i.e., the predictions y
t
i
are a
linear function of the inputs x
t
j