Homework 4 COMS 4771 solution

$29.99

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (4 votes)

Problem 1 (Linear regression; 20 points). In this problem, we’ll consider an optimization formu-
lation for linear regression. This is a supervised learning problem where the output space is Y = R
(rather than a discrete space like f0; 1g). Instead of classi cation error, the goal here is to learn a
function f : Rd ! R so as to minimize the expected squared error with respect to a distribution P
over Rd  R: E[(f(X) 􀀀 Y )2] (the expectation is taken with respect to the random couple (X; Y )
which has distribution P). We consider the case where f is a linear function represented by a weight
vector w 2 Rd. Let S be an i.i.d. sample from P; consider the following optimization problem:
min
w2Rd

2
kwk2
2 +
1
jSj
X
(x;y)2S
􀀀
hw; xi 􀀀 y
2:
(Above,  > 0 is assumed to be a positive scalar.)
(a) Prove that the Hessian of the objective function, at any point w 2 R, is positive de nite.
Explain why this establishes that the optimization problem is convex.
(b) Derive a gradient descent algorithm for solving the optimization problem (following the recipe
from lecture). Give concise and unambiguous pseudocode for the algorithm, and be explicit
about how to compute gradients. You may use vector addition, scaling, and inner products
as primitive operations (in addition to usual arithmetic operations). Furthermore, assume
the initial solution, step sizes, and number of iterations are provided as inputs.
(c) Suppose we add the following constraints to the optimization problem:
w2
i  1 for all i = 1; 2; : : : ; d :
Is the optimization problem still convex? Brie y explain why or why not.
(d) Same as (c), but with the following constraints instead (assuming d is even):
w2i􀀀1 = 1 􀀀 w2i for all i = 1; 2; : : : ; d=2 :
(Hint: can you write an equality constraint as a pair of inequality constraints?)
(e) Same as (c), but with the following constraints instead:
w2
i = 1 for all i = 1; 2; : : : ; d :
1
Problem 2 (Logistic regression; 30 points). Let S be a training data set from Rd  f0; 1g for a
binary classi cation problem. Consider the following optimization problem for computing the MLE
of the logistic regression parameters:
min
02R; 2Rd
1
n
Xn
i=1

ln(1 + exp( 0 + h ; xii)) 􀀀 yi( 0 + h ; xii)

:
(a) Derive a gradient descent algorithm for solving the optimization problem. Give concise and
unambiguous pseudocode for your algorithm, and be explicit about how to compute gradients.
You may use vector addition, scaling, and inner products as primitive operations (in addition
to usual arithmetic operations); and natural logarithm (ln) and exponential (exp) functions
as subroutines. Furthermore, assume the initial solution, step sizes, and number of iterations
are provided as inputs.
(b) Implement the gradient descent algorithm from part (a), except use 0 = 0 and = 0 as the
initial solution, choose the step sizes using a line search, and use as many iterations as are
required to achieve a prescribed objective value. Run your gradient descent code on the data
set hw4data.mat from Courseworks (which has training features vectors and labels stored
as data and labels, respectively). Roughly how many iterations are needed to achieve an
objective value that is at most 0:65064?1
(c) The feature vectors in the data set from hw4data.mat are three-dimensional, so they are
relatively easy to inspect. Investigate the data by plotting it and/or computing some statistics
about the features. Do you notice anything peculiar about the features? Use what you
discover to design a simple invertible linear transformation of the feature vectors xi 7! Axi
such that running your gradient descent code with this transformed data set f(Axi; yi)gn
i=1
reaches an objective value of 0:65064 in (many) fewer iterations.
Describe the steps you took in this investigation, and your reasoning behind them, as well as
your chosen linear transformation (represented as a 33 matrix). State roughly how many
iterations were required to achieve this stated objective value.
(d) Modify your gradient descent code in the following way.
• Use only the rst b0:8nc examples to de ne the objective function; keep the remaining
n 􀀀 b0:8nc examples as a hold-out set.2
• The stopping condition is as follows. After every power-of-two iterations of gradient
descent, record the hold-out error rate for the linear classi er based on the current
( 0; ). If this hold-out error rate is more than 0:99 times that of the best hold-out error
rate previously computed, and the number of iterations executed is at least 32 (which is
somewhat of an arbitrary number), then stop.
Run this modi ed gradient descent code on the original hw4data.mat data, as well as the
linearly transformed data (from part (c)). In each case, report: (1) the number of iterations
executed, (2) the nal objective value, and (3) the nal hold-out error rate.
Please also submit your gradient descent source code (both versions) in separate les.
1The actual minimum value is less than 0:65064.
2Normally you would not simply select the rst b0:8nc examples, but rather pick a random subset of b0:8nc
examples. But here, the order of the examples has already been randomized.
2