Description
1. The probability density function of normal distribution is defined as
f(x) =
1 Z
exp−1 2
(x−µ)Σ−1(x−µ),
where
Z =Zx∈Rd exp−1 2
(x−µ)Σ−1(x−µ)dx
= (2π)d/2|Σ|1/2, where|Σ|is the determinant of the covariance matrix. Let us assume that the covariance matrixΣis a diagonal matrix, as below:
Σ=
σ2 1 0 ··· 00 σ2 2 ··· 0. . . 0 ··· 0 . . . . . . ··· . . . 0 0 ··· σ2 d
.
The probability density function simplifies to
f(x) =
d ∏ i=1
1 √2πσi exp−1 2
1 σ2 i
(xi−µi)2.
Show that this is indeed true.
1
2.
(a) Show that the following equation, called Bayes’ rule, is true.
p(Y|X) =
p(X|Y)p(Y) p(X)
.
(b) We learned the definition of expectation: E[X] = ∑ x∈Ω
xp(x).
Assuming that X andY are discrete random variables, show that
E[X +Y] =E[X]+E[Y].
(c) Further assume that c∈Ris a scalar and is not a random variable, show that E[cX] = cE[X].
(d) We learned the definition of variance: Var(X) = ∑ x∈Ω
(x−E[X])2p(x).
Assuming X being a discrete random variable, show that Var(X) =EX2−(E[X])2.
2
3. An optimal linear regression machine (without any regularization term), that minimizes the empirical cost function given a training set Dtra ={(x1,y∗ 1),…,(xN,y∗ N)}, can be found directly without any gradient-based optimization algorithm. Assuming that the distance function is defined as
D(M∗(x),M,x) =
1 2kM∗(x)−M(x)k2 2 =
1 2
q ∑
k=1 (y∗ k−yk)2, derive the optimal weight matrixW. (Hint: Moore–Penrose pseudoinverse)
3
4. SupposethatwehaveadatadistributionY = f(X)+ε,whereXisarandomvector, ε is an independent random variable with zero mean and fixed but unknown variance σ2, and f is an unknown deterministic function that maps a vector into a scalar. Now, we wish to approximate f(x) with our own model ˆ f(x;Θ) with some learnable parametersΘ. (a) Show that considering all possible ˆ f andΘ, the minimum of L2 loss EX[(Y− ˆ f(X;Θ))2] is achieved when for allx, ˆ f(x;Θ) = f(x)
(Hint: find the minimum of L2 loss for a single example first.) (b) If we train the same model varying initializations and examples from the underlyingdatadistribution,wemayendupwithdifferentΘ. Sowecanalsoconsider Θas a random variable if we fix ˆ f. Showthatforasingleunseeninputvectorx0 andafixed ˆ f,theexpectedsquared errorbetweenthegroundtruthy0 = f(x0)+ε andtheprediction ˆ f(x0;Θ) canbe decomposed into: E[(y0− ˆ f(x0;Θ))2] =E[f(x0)− ˆ f(x0;Θ)]2+Var[ ˆ f(x0;Θ)]+σ2 (Side note: this is usually known as the bias-variance decomposition, closely related to bias-variance tradeoff, and other concepts such as underfitting and overfitting.)