## Description

## 1 Written Problems (50 points)

1.1. (Matrix Differential, 20 points) It is a fundamental ability to derive the derivatives of

some functions in machine learning. Nine basic cases can be summarized as in Table 1. Nevertheless,

only three of them (i.e., cells in gray background) are common in our course. In this problem, you

will follow the next steps to derive the derivatives for these cases.

(1) Layout of Derivatives

• Differentiation of a scalar function w.r.t. a vector: If f : R

n

7→ R is a scalar

function of n variables, w ∈ R

n

is a n × 1 vector, then differentiation of f(w) w.r.t. w

Table 1: Vector/Matrix derivatives

f f F

w

f

w

df

dw

dF

dw

w

df

dw

df

dw

dF

dw

W df

dW

df

dW

dF

dW

results in a n × 1 vector;

df

dw

=

∂f

∂w1

∂f

∂w2

.

.

.

∂f

∂wn

• Differentiation of a scalar function w.r.t. a matrix: If f : R

m×n

7→ R is a scalar

function of mn variables, w ∈ R

m×n

is a m × n matrix, then differentiation of f(w) w.r.t.

w results in a m × n matrix;

df

dw

=

∂f

∂w11

· · ·

∂f

∂w1n

.

.

. . . .

.

.

.

∂f

∂wm1

· · ·

∂f

∂wmn

• Differentiation of a vector function w.r.t. a vector: If f : R

n

7→ R

m is a vector

function of size m × 1 and w is a n × 1 vectors, then differentiation of f(w) w.r.t. w

results in a n × m matrix.

df

T

dw

=

∂f1

∂w1

· · ·

∂fh

∂w1

.

.

. . . .

.

.

.

∂f1

∂wd

· · ·

∂fh

∂wd

We also call this matrix the gradient matrix of f w.r.t. w. The tranpose of this matrix is

exactly the Jacobian matrix df

dwT (i.e.,

df

dw

in convention) as

df

dwT

=

∂f1

∂w1

· · ·

∂f1

∂wd

.

.

. . . .

.

.

.

∂fh

∂w1

· · ·

∂fh

∂wd

(2) (10 points) Derivation by definition: Use definition of multivariate calculus to derive the

derivatives

Example 1 : For f(w) = a

T w with w ∈ R

n

, we have df

dw = a since

df

dw

=

∂

Pn

i=1 aixi

∂xi

=

∂aixi

∂xi

= ai

Derive the following derivatives by definition.

• (5 pts) f(w) = wT Aw with w ∈ R

n and A ∈ R

n×n

;

• (5 pts) f(w) = Aw with A ∈ R

m×n and w ∈ R

n

.

(3) (Bonus task: 5 points) Derivation by differentiation: derive the derivatives based on the

relation between differentiation and gradient as

f : R

n

7→ R : df =

Xn

i=1

∂f

∂wi

dwi = ( df

dw

)

T

dw = tr(( df

dw

)

T

dw)

f : R

m×n

7→ R : df =

X

i,j

∂f

∂Wij

dWij = tr(( df

dW

)

T

dW)

Example 2 : For f(W) = a

TW b with W ∈ R

m×n

, a ∈ R

m, and b ∈ R

n

, we have df

dW = abT

since

df = d(a

TW b) = d(a

T

)W b + a

T

d(W)b + a

TWd(b) = a

T

d(W)b

= tr(a

T

d(W)b) = tr(baT

dW) = tr((abT

)

T

dW)

Hint: Some useful formulas are here.

d(X + Y ) = dX + dY d(XY ) = (dX)Y + Xd(Y ) dXT = (dX)

T

tr(XT

) = tr(X) tr(XY ) = tr(Y X) d(tr(X)) = tr(dX)

Derive the following derivatives by differentiation.

• (2 pts) f(w) = wT Aw with A ∈ R

n×n and w ∈ R

n

;

• (3 pts) f(W) = tr(WT AW) with A ∈ R

m×m and W ∈ R

m×n

.

(4) (10 points) Derivation by chain rule: In the realm of machine learning, the input is

commonly a feature vector while the loss is almost always a scalar objective function. Therefore,

chain rule is useful for gradient update.

(a) (10 pts) Vector chain ending with a scalar: For x → y1 → y2 → · · · → yn → z, the chain

rule is

dz

dxT

=

dz

dyT

n

∂yn

∂y

T

n−1

· · ·

∂y2

∂y

T

1

∂y1

xT

Assume ℓ = ∥Xw − y∥

2

2

, let z = Xw − y, then ℓ = ∥z∥

2

2 = z

T z. Consequently, here is a

chain like w → z → ℓ. Follow the chain rule, the derivative ∂ℓ

∂wT =

∂ℓ

∂zT

∂z

∂wT which implies

that

∂ℓ

∂w

= ( ∂z

∂wT

)

T ∂ℓ

∂z

Determine the closed-form solution of ∂ℓ

∂w

.

(b) (Bonus Task, 5 points) Matrix chain ending with a scalar: For X → Y → ℓ, the chain

rule is

∂ℓ

∂Xij

=

X

kl

∂ℓ

∂Ykl

∂Ykl

∂Xij

Assume ℓ = f(Y ) where Y = AX + B, derive the derivative of ∂ℓ

∂X .

Hint: Firstly, it has

∂Ykl

∂Xij

=

∂

P

s AksXsl

∂Xij

=

∂AkiXil

∂Xij

= Akiδlj

where δlj = 1 for l = j; otherwise, δlj = 0.

Secondly, we have

∂ℓ

∂Xij

=

X

kl

∂ℓ

∂Ykl

Akiδlj =

X

k

∂ℓ

∂Ykj

Aki = AT

:,i(

∂ℓ

∂Y

):,j

Lastly, you can derive the form of ∂ℓ

∂X .

1.2. (Convexity of functions, 10 points) Prove that: (1) f(x) = ReLU(x) = (x)

+ = max(0, x)

is convex for x ∈ R; (2) f(x) = |x| is convex for x ∈ R; (3) f(x) = ∥Ax − b∥

2

2

is convex where

A ∈ R

m×n and x ∈ R

n

.

1.3. (Gradient Descent, 10 points) Suppose we have training data {(x1, y1),(x2, y2), . . . ,(xN , yN )},

where xi ∈ R

d and yi ∈ R

k

, i = 1, 2, . . . , N. (1) Find the closed-form solution of the following problem.

min

W,b

X

N

i=1

αi∥yi − W xi − b∥

2

2

,

where the diagonal of diagonal matrix diag(A) = (α1, α2, . . . , αN )

T are weights for different

sample; (2) Show how to use gradient descent to solve the problem.

Hint: You can use either definition or differentiation method to derive the derivatives. If you

use differentiation method, please note that

X

N

i=1

αi∥yi − W xi − b∥

2

2 = tr[(Y − XW)

T A(Y − XW)]

where Y = (y1, y2, . . . , yN )

T ∈ R

N×k

, X = [(x

T

1

, 1)T

,(x

T

2

, 1)T

, . . . ,(x

T

N , 1)T

]

T ∈ R

N×(d+1)

,

W = (W, b)

T ∈ R

(d+1)×k

, and A = diag(α1, α2, . . . , αN ).

1.4. (Maximum Likelihood Estimation, 10 points) Suppose x1, x2, . . . , xN are drawn from

N (µ, σ2

). Show that the maximum likelihood estimation (MLE) of σ

2

is ˆσ

2

MLE =

1

N

PN

n=1(xn −

µMLE)

2

.

## 2 Programming (50 points)

2.1. (Polynomial regression, 20 points) In this exercise, we will try to fit a non-linear function

g with polynomial regression on the feasible space X = [0, 11]:

Unknown g(x) =?

Construct f(x) = Xn

i=0

αix

i ⇐⇒ f(x) = w

T x

′

, x′ =

1

x

x

2

.

.

.

x

n

, s.t. ∀x ∈ X, f(x) ≈ g(x)

Where n is the polynomial degree of freedom and is manually chosen.

Follow the instructions given in the jupyter notebook. At the end of the exercise, you will retrieve

an estimation of the desired function and make some comment on this method.

2.2. (Linear regression, 30 points) The CSV or XLS file contains a dataset for regression.

There are 7750 samples with 25 features (described in the doc file). This data is for the purpose of

bias correction of next-day maximum and minimum air temperatures forecast of the LDAPS model

operated by the Korea Meteorological Administration over Seoul, South Korea.

This data consists of summer data from 2013 to 2017. The input data is largely composed

of the LDAPS model’s next-day forecast data, in-situ maximum and minimum temperatures of

present-day, and geographic auxiliary variables. There are two outputs (i.e. next-day maximum

and minimum air temperatures) in this data. Hindcast validation was conducted for the period

from 2015 to 2017.

You need to delete the first two attributes (station and date), and use attributes 3-23 to predict

attributes 24 and 25. Randomly split the data into two parts, one contains 80% of the samples

and the other contains 20% of the samples. Use the first part as training data and train a linear

regression model and make prediction on the second part. Report the training error and testing

error in terms of RMSE.

Repeat the splitting, training, and testing for 10 times. Use a loop and print the RMSEs in each

trial.

Note that you need to write the codes of learning the parameters by yourself. Do not use the

classification or regression packages of Sklearn. You can use their tools to shuffle the data randomly

for splitting.