## Description

Problem 1 (10 points) Different class conditional probabilities. Consider a classification

problem with features in R

d and labels in {−1, +1}. Consider the class of linear classifiers of

the form ( ¯w, 0), namely all the classifiers (hyper planes) that pass through the origin (or t = 0).

Instead of logistic regression, suppose the class probabilities are given by the following function,

where X¯ ∈ R

d are the features:

P

y = +1|X, ¯ w¯

=

1

2

1 +

w¯ · X¯

p

1 + ( ¯w · X¯)

2

!

, (1)

where ¯w · X¯ is the dot product between ¯w and X¯.

Suppose we obtain n examples (X¯

i

, yi) for i = 1, . . . , n.

1. Show that the log-likelihood function is

J( ¯w) = −n log 2 +Xn

i=1

log

1 +

yi( ¯w · X¯

i)

p

1 + ( ¯w · X¯

i)

2

!

. (2)

2. Compute the gradient of J( ¯w) and write one step of gradient ascent. Namely fill in the blank:

w¯j+1 = ¯wj + η ·

hint: use the chain rule and ∇w¯w¯ · X¯ = X¯.

In Problem 2, and Problem 3, we will study linear regression. We will assume in both the

problems that w

0 = 0. (This can be done by translating the features and labels to have mean zero,

but we will not worry about it). For ¯w = (w

1

, . . . , wd

), and X¯ = (X¯ 1

, . . . , X¯ d

), the regression we

want is:

y = w

1X¯ 1 + . . . + w

dX¯ d = ¯w · X. ¯ (3)

We considered the following regularized least squares objective, which is called as Ridge Regression. For the dataset S = {(X¯

1, y1), . . . ,(X¯

n, yn)},

J( ¯w, λ) = Xn

i=1

yi − w¯ · X¯

i

2 + λ · kw¯k

2

2

.

Problem 2 (10 points) Gradient Descent for regression.

1. Instead of using the closed form expression we mentioned in class, suppose we want to perform

gradient descent to find the optimal solution for J( ¯w). Please compute the gradient of J, and

write one step of the gradient descent with step size η.

2. Suppose we get a new point X¯

n+1, what will the predicted yn+1 be when λ → ∞?

Problem 3 (15 points) Regularization increases training error. In the class we said that

when we regularize, we expect to get weight vectors with smaller, but never proved it. We also

displayed a plot showing that the training error increases as we regularize more (larger λ). In this

assignment, we will formalize the intuitions rigorously.

Let 0 < λ1 < λ2 be two regularizer values. Let ¯w1, and ¯w2 be the minimizers of J( ¯w, λ1), and

J( ¯w, λ2) respectively.

1. Show that kw¯1k

2

2 ≥ kw¯2k

2

2

. Therefore more regularization implies smaller norm of solution!

Hint: Observe that J( ¯w1, λ1) ≤ J( ¯w2, λ1), and J( ¯w2, λ2) ≤ J( ¯w1, λ2) (why?).

2. Show that the training error for ¯w1 is less than that of ¯w2. In other words, show that

Xn

i=1

yi − w¯1 · X¯

i

2 ≤

Xn

i=1

yi − w¯2 · X¯

i

2

.

Hint: Use the first part of the problem.

Problem 4 (25 points) Linear and Quadratic Regression. Please refer to the Jupyter Notebook in the assignment, and complete the coding part in it! You can use sklearn regression package:

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html