CSCI 5521 Homework 3 logistic regression solution


Original Work ?


5/5 - (4 votes)

1. (25 points) Let {(x1, y1), . . . ,(xn, yn)} be a given training set where xi ∈ R
, yi ∈ {0, 1}. We
consider the following regularized logistic regression objective function:
f(w) = 1

−yiwT xi + log(1 + exp(wT xi))
where λ > 0 is a constant. Let w∗ be the global minimizer of the objective, and let kw∗k2 ≤ c,
for some known constant c > 0.
(a) (10 points) Clearly show and explain the steps of the projected gradient descent algorithm for optimizing the regularized logistic regression objective function. The steps
should include an exact expression for the gradient.
(b) (5 points) Is the objective function strongly convex? Clearly explain your answer by
stating and using the definition of strong convexity.1
(c) (5 points) Is the objective function smooth? Clearly explain your answer by stating and
using the definition of smoothness.1
(d) (5 points) Let wT be the iterate after T steps of the projected gradient descent algorithm.
What is a bound on the difference f(wT ) − f(w∗
)? Clearly explain all quantities in the
bound (hint:please consider step size).
2. (25 points) Let X = {x1, . . . , xn}, xi ∈ R
d be a set of n samples drawn i.i.d. from a mixture of
k multivariate Gaussian distributions in R
. For component Gh, h = 1, . . . , k, let πh, µh, Σh
respectively denote the prior probability, mean, and covariance matrix of Gh. We will focus on
the expectation maximization (EM) algorithm for learning the mixture model, in particular
for estimating the parameters {(πh, µh, Σh), h = 1, . . . , k} as well as the posterior probabilities
(a) (10 points) In your own words, describe the EM algorithm for mixture of Gaussians,
highlighting the two key steps (E- and M-), illustrating the methods used in the steps
on a high level, and what information they need.
(b) (10 points) Assuming the posterior probabilities p(Gh|xi) are known, show the estimates
of the component prior, mean, and covariance πh, µh, Σh, h = 1, . . . , k given by the Mstep (you do not need to show the derivation).
(c) (5 points) Assuming the component prior, mean, and covariance πh, µh, Σh, h = 1, . . . , k
are known, show how the posterior probabilities p(Gh|xi) are computed in the E-step.
1Since the function is twice differentiable, you can answer the question by considering the second derivative.
Programming assignments:
The next two problems involve programming. For Question 3, we will be using the 2-class
classification datasets from Boston50 and Boston75, and for Question 4, we will be using the the
Digits dataset. For Q3, we will develop code for 2-class logistic regression with only one set of
parameters (w, w0). For Q4, we will develop code for PCA.
3. (25 points) We will develop code for 2-class logistic regression with one set of parameters
(w, w0) where w ∈ R
, w0 ∈ R. Assuming the two classes are {0, 1}, and the data x ∈ R
the posterior probability of class C1 is given by
P(1|x) = exp(wT x + w0)
1 + exp(wT x + w0)
and P(0|x) = 1 − P(1|x). f We will develop code for MyLogisticReg2 with corresponding,y) and MyLogisticReg2.predict(X) functions. Parameters for the
model can be initialized following suggestions in the textbook. In the fit function, the
parameters will be estimated using gradient descent as described in the textbook and in
We will compare the performance of MyLogisticReg2 with LogisticRegression2 on two
datasets: Boston50 and Boston75. Using my cross val with 5-fold cross-validation, report
the error rates in each fold as well as the mean and standard deviation of error rates across
all folds for the two methods: MyLogisticReg2 and LogisticRegression, applied to the two
2-class classification datasets: Boston50 and Boston75.
You will have to submit (a) code and (b) summary of results:
(a) Code: You will have to submit code for MyLogisticReg2() as well as a wrapper code
For MyLogisticReg2(), you are encouraged to consult the code for MultiGaussClassify()
from HW2. You need to make sure you have init , fit, and predict implemented
in MyLogisticReg2. init (d) will initialize the parameters (w, w0) and will take the
data dimensionality d as input. fit(X,y) will take the data features X and labels y
and will use gradient descent to estimate the parameters w, w0. Convergence of gradient descent can be determined by checking the difference between subsequent iterates
, wt−1
) and (wt
, wt
) and making sure the change is below a pre-specified (small)
threshold. You can also specify tmax, the maximum number of iterations gradient descent
can run, and set it to a suitably large value. predict(X) will take a feature matrix corresponding to the test set and return the predicted labels. Your class MyLogisticReg2()
will not inherit any base class in sklearn.
The wrapper code (main file has no input and is used to prepare the datasets,
and make calls to my cross val(method,X,y,k) to generate the error rate results for
each dataset and each method. The code for my cross val(method,X,y,k) must be
yours (e.g., code you wrote in HW1 with modifications as needed) and you cannot use
cross val score() in sklearn. The results should be printed to terminal (not generating an additional file in the folder). Make sure the calls to my cross val(method,X,y,k)
2You should use LogisticRegression from scikit-learn, similar to HW1 and HW2.
are made in the following order and add a print to the terminal before each call to show
which method and dataset is being used:
(1) MyLogisticReg2 with Boston50;
(2) MyLogisticReg2 with Boston75;
(3) LogisticRegression with Boston50;
(4) LogisticRegression with Boston75.
One should be able to run your wrapper code by “python” in command
line window.
(b) Summary of results: For each dataset and each method, report the test set error rates
for each of the k = 5 folds, the mean error rate over the k folds, and the standard deviation
of the error rates over the k folds. Make a table to present the results for each method
and each dataset (4 tables in total). Each column of the table represents a fold and add
two columns at the end to show the overall mean error rate and standard deviation over
the k folds. For example:
Error rates for MyLogisticReg2 with Boston50
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean SD
# # # # # # #
4. (25 points) We will modify the provided ipython notebook generate digits.ipynb to add
our own code for the parts where PCA is getting used. The notebook shows how to learn
and subsequently generate data which look like the data from the Digits dataset. Let the
data matrix be denoted as X ∈ R
D×n where D is the number of features and n is the number
of samples. Each column of X corresponds to a data point xj ∈ R
D. For Digits, we have
D = 64 and n ≈ 1800. The feature covariance matrix of the data can be computed as:
Σ =ˆ
(xi − µˆ)(xi − µˆ)
, (1)
where µˆ =
i=1 xi
is the mean of the data points. The notebook has the following steps:
(a) Compute a PCA projection Z ∈ R
, d ≤ D of the original data X ∈ R
so that α%
of the variance is preserved in the projected space. The notebook has details for α = 90,
where d = 21 was sufficient. The notebook also has details for α = 99, where d = 41
was sufficient. The projection can be represented by a matrix W ∈ R
In the homework, instead of using sklearn’s functionality for determining d and W, we
will develop our own function myPCA (see below) using numpy and scipy as needed. This
will be the only major change you will have to do in the notebook.
(b) Consider the dataset Z = {z1, . . . , zn} of n points in R
, and fit a mixture of Gaussians
(MoG) model where the number of mixture components is determined by a suitable
model selection technique. The notebook already implements this part, and we will not
change it.
(c) Sample a set of m new points Z˜ = {z˜1, . . . , z˜m}, z˜i ∈ R
from the MoG model in the
projected space. Let Z˜ ∈ R
d×m be the matrix corresponding to the sampled data Z˜,
where the columns are z˜i
. The notebook already implements this part with m = 100,
and we will not change it.
(d) Inverse transform the new points Z˜ in the d-dimensional space to the original Ddimensional space. The notebook currently uses sklearn’s functionality for doing the
inverse transformation. Instead we will use our own code to do the inverser transformation. In particular, the inverse transformation can be computed using W and µˆ which
we will get as outputs from myPCA (see below). The inverse transform is a simple linear
algebraic operation (say, 1-2 lines of code), so we will not write a separate function for
this, but just inline the code.
The main function we will write is myPCA(X,α), which takes in the original data X ∈ R
and the percentage α ∈ [0, 100] of variance to be preserved, and returns three things as a
(i) the PCA projection dimensionality d which is sufficient to preserve α% of the original
variance in the projected space,
(ii) the projection matrix W ∈ R
d×D, and
(iii) the estimated mean ˆµ ∈ R
D of the original data.
Add myPCA as a function in the notebook and update the rest of the notebook to use myPCA
instead of sklearn’s PCA. Please make sure the current functionality in the notebook is
maintained. In particular, your notebook should not import PCA but still maintain the same
functionality. You will have to submit your notebook my generate digits.ipynb.
Additional instructions: Code can only be written in Python; no other programming languages
will be accepted. One should be able to execute your code for Q3 directly from command prompt
(e.g., “python”) without the need to run Python interactive shell first. Test your code
yourself before submission and suppress any warning messages that may be printed. Your code
must be run on a CSE lab machine (e.g., Please make sure you
specify the full Python version you are using as well as instructions on how to run your program
in the README file (must be readable through a text editor such as Notepad). Information on
the size of the datasets, including number of data points and dimensionality of features, as well as
number of classes can be readily extracted from the datasets in scikit-learn. Each function must
take the inputs in the order specified in the problem and display the output via the terminal or as
specified. For Q3, you can submit additional files/functions (as needed) which will be used by the
main file. Please put comments in your code so that one can follow the key parts and steps in your
code. Code for Q4 will be a modification of the provided notebook generate digits.ipynb.
Follow the rules strictly. If we cannot run your code, you will not get any credit.
• Things to submit
1. hw3.pdf: A document which contains the solution to Problems 1, 2, 3 and 4 including
the summary of results for 3 and 4. This document must be in PDF format (no word,
photo, etc., is accepted). If you submit a scanned copy of a hand-written document,
make sure the copy is clearly readable, otherwise no credit may be given.
2. Python code for Problems 3 and 4 (must include the required and
my generate digits.ipynb ).
3. README.txt: README file that contains your name, student ID, email, instructions
on how to run your code, the full Python version (e.g., Python 3.5) your are using, any
assumptions you are making, and any other necessary details. The file must be readable
by a text editor such as Notepad.
4. Any other files, except the data, which are necessary for your code.