## Description

Question 1 (Gradient Computation)

For a scalar-valued function f : R

d → R, the gradient evaluated at w ∈ R

d

is

∇f(w) =

∂f(w)

∂w1

· · ·

∂f(w)

∂wd

>

∈ R

d

.

Using this definition, compute the gradients of following functions, where A ∈ R

d×d

is not necessarily a symmetric matrix.

(i) f(w) = w

>Av + w

>A

>v + v

>Aw + v

>A

>w, v ∈ R

d

(ii) f(w) = w

>Aw

Compute the gradients of following functions using above definition and the chain rule.

(iii) f(w) = Pd

i=1 log(1 + exp(wi))

(iv) f(w) = p

1 + kwk

2

2

1

Homework Problems – Tutorial #3 Due: February 6, 2022 11:59 PM

Question 2 (Logistic Regression)

You are given a dataset D = {(xn, yn)}

N

n=1, where xn ∈ R

d

, d ≥ 1, and yn ∈ {+1, −1}. For

w ∈ R

d+1 and x ∈ R

d+1, we wish to train a logistic regression model

h(x) = θ(b +

Pd

i=1 wixi) = θ(w

>x), (1)

where θ(z) = e

z

1 + e

z

, z ∈ R is the logistic function. Following the arguments on page 91 of LFD,

the in-sample error can be written as

Ein(w) = 1

N

PN

n=1 log

1

Pw(yn|xn)

, (2)

where

Pw(y|x) = (

h(x) y = +1

1 − h(x) y = −1

. (3)

(a) Show that Ein(w) can be expressed as

Ein(w) = 1

N

PN

n=1Jyn = +1K log

1

h(xn)

+ Jyn = −1K log

1

1 − h(xn)

, (4)

where JargumentK evaluates to 1 if the argument is true and 0 if it is false.

(b) Show that Ein(w) can also be expressed as

Ein(w) = 1

N

PN

n=1 log(1 + exp(−ynw

>xn)). (5)

(c) Use (5) to show that ∇Ein(w) = 1

N

PN

n=1 −ynxnθ(−ynw

>xn), and argue that a “misclassified” example contributes more to the gradient than a correctly classified one.

(d) Show that ∇Ein(w) can be expressed as

∇Ein(w) = 1

N

X>p, (6)

for some expression p, where X is the data matrix you are familiar with from linear regression.

What is p and how does it compare with the gradient of the in-sample error of linear regression?

Homework Problems – Tutorial #3 Due: February 6, 2022 11:59 PM

Question 3 (Problem 4, Midterm 2017)

Consider the logistic regression setup as in the previous question. Suppose we are given a dataset

D = {(x1, y1),(x2, y2)} with

x1 =

1 1>

, y1 = 1 and x2 =

1 0>

, y2 = −1.

We consider the l2-regularized error as

Ein(w) = −

PN

n=1 log [Pw(yn|xn)] + λkwk

2

2

, λ > 0, (7)

where

Pw(y|x) = (

h(x) y = +1

1 − h(x) y = −1

, (8)

and h(x) = e

w

>x

1 + ew>x

=

1

1 + e−w>x

.

(a) For λ = 0, find the optimal w that minimizes Ein(w) and the minimum value of Ein(w).

(Hint: you are given xn, yn, so plug those values into the expression of the in-sample error).

(b) Suppose λ is a very large constant such that it suffices to consider weights that satisfy kwk2

1. Since w has a small magnitude, we may use the Taylor series approximation

log(1 + exp(−ynw

>xn)) ≈ log(2) −

1

2

ynw

>xn. (9)

Assuming the above approximation is exact, find w that minimizes Ein(w) (it should be

expressed in terms of λ).