Description

5/5 - (1 vote)

A Mathematical Introduction to Data Science
1. Maximum Likelihood Method: consider n random samples from a multivariate normal distribution, Xi ∈ R
p ∼ N (µ, Σ) with i = 1, . . . , n.
(a) Show the log-likelihood function
ln(µ, Σ) = −
n
2
trace(Σ−1Sn) −
n
2
log det(Σ) + C,
where Sn =
1
n
Pn
i=1(Xi − µ)(Xi − µ)
T
, and some constant C does not depend on µ and
Σ;
(b) Show that f(X) = trace(AX−1
) with A, X 0 has a first-order approximation,
f(X + ∆) ≈ f(X) − trace(X−1A
0X−1∆)
hence formally df(X)/dX = −X−1AX−1
(note (I + X)
−1 ≈ I − X);
(c) Show that g(X) = log det(X) with A, X 0 has a first-order approximation,
g(X + ∆) ≈ g(X) + trace(X−1∆)
hence dg(X)/dX = X−1
(note: consider eigenvalues of X−1/2∆X−1/2
);
(d) Use these formal derivatives with respect to positive semi-definite matrix variables to
show that the maximum likelihood estimator of Σ is
ΣˆMLE
n = Sn.
A reference for (b) and (c) can be found in Convex Optimization, by Boyd and Vandenbergh,
examples in Appendix A.4.1 and A.4.3:
https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf
2. Shrinkage: Suppose y ∼ N (µ, Ip).
1
Homework 3. MLE and James-Stein Estimator 2
(a) Consider the Ridge regression
min
µ
1
2
ky − µk
2
2 +
λ
2
kµk
2
2
.
Show that the solution is given by
µˆ
ridge
i =
1
1 + λ
yi
.
Compute the risk (mean square error) of this estimator. The risk of MLE is given when
C = I.
(b) Consider the LASSO problem,
min
µ
1
2
ky − µk
2
2 + λkµk1.
Show that the solution is given by Soft-Thresholding
µˆ
sof t
i = µsof t(yi
; λ) := sign(yi)(|yi
| − λ)+.
For the choice λ =
√
2 log p, show that the risk is bounded by
Ekµˆ
sof t(y) − µk
2 ≤ 1 + (2 log p + 1)X
p
i=1
min(µ
2
i
, 1).
Under what conditions on µ, such a risk is smaller than that of MLE? Note: see Gaussian
Estimation by Iain Johnstone, Lemma 2.9 and the reasoning before it.
(c) Consider the l0 regularization
min
µ
ky − µk
2
2 + λ
2
kµk0,
where kµk0 := Pp
i=1 I(µi 6= 0). Show that the solution is given by Hard-Thresholding
µˆ
hard
i = µhard(yi
; λ) := yiI(|yi
| > λ).
Rewriting ˆµ
hard(y) = (1 − g(y))y, is g(y) weakly differentiable? Why?
(d) Consider the James-Stein Estimator
µˆ
JS(y) =
1 −
α
kyk
2

y.
Show that the risk is
Ekµˆ
JS(y) − µk
2 = EUα(y)
where Uα(y) = p − (2α(p − 2) − α
2
)/kyk
2
. Find the optimal α
∗ = arg minα Uα(y). Show
that for p > 2, the risk of James-Stein Estimator is smaller than that of MLE for all
µ ∈ R
p
.
Homework 3. MLE and James-Stein Estimator 3
(e) In general, an odd monotone unbounded function Θ : R → R defined by Θλ(t) with
parameter λ ≥ 0 is called shrinkage rule, if it satisfies
[shrinkage] 0 ≤ Θλ(|t|) ≤ |t|;
[odd] Θλ(−t) = −Θλ(t);
[monotone] Θλ(t) ≤ Θλ(t
0
) for t ≤ t
0
;
[unbounded] limt→∞ Θλ(t) = ∞.
Which rules above are shrinkage rules?
3. Necessary Condition for Admissibility of Linear Estimators. Consider linear estimator for
y ∼ N (µ, σ2
Ip)
µˆC(y) = Cy.
Show that ˆµC is admissible only if
(a) C is symmetric;
(b) 0 ≤ ρi(C) ≤ 1 (where ρi(C) are eigenvalues of C);
(c) ρi(C) = 1 for at most two i.
These conditions are satisfied for MLE estimator when p = 1 and p = 2.
Reference: Theorem 2.3 in Gaussian Estimation by Iain Johnstone,
https://statweb.stanford.edu/~imj/Book100611.pdf
4. *James Stein Estimator for p = 1, 2 and upper bound:
If we use SURE to calculate the risk of James Stein Estimator,
R(ˆµ
JS, µ) = EU(Y ) = p − Eµ
(p − 2)2
kY k
2
< p = R(ˆµ
MLE, µ)
it seems that for p = 1 James Stein Estimator should still have lower risk than MLE for any
µ. Can you find what will happen for p = 1 and p = 2 cases?
Moreover, can you derive the upper bound for the risk of James-Stein Estimator?
R(ˆµ
JS, µ) ≤ p −
(p − 2)2
p − 2 + kµk
2
= 2 +
(p − 2)kµk
2
p − 2 + kµk
2
.

MATH5473 Homework 3. MLE and James-Stein Estimator solution

Download Details:

Description

MATH5473 Homework 3. MLE and James-Stein Estimator solution

Download Details:

Description

Related products

MATH5473 Homework 5. SDP Extensions of PCA and MDS solution

MATH 5473 Homework 1. PCA and MDS solution

MATH 5473 Homework 2. Random Matrix Theory and PCA solution