CSE/ISYE 6740 Homework 4 Kernels solution

$30.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (3 votes)

1 Kernels [20 points]
(a) Identify which of the followings is a valid kernel. If it is a kernel, please write your
answer explicitly as ‘True’ and give mathematical proofs. If it is not a kernel, please
write your answer explicitly as ‘False’ and give explanations. [6 pts]
Suppose K1 and K2 are valid kernels (symmetric and positive definite) defined on Rm × Rm.
1. K(u, v) = k1(u, v) − k2(u, v) for k1, k2 valid kernels
2. K(u, v) = k1(u, v)k2(u, v) for k1, k2 valid kernels
3. K(u, v) = exp 
γ∥u − v∥
2

for some γ > 0
1Christopher M. Bishop, Pattern Recognition and Machine Learning, 2006, Springer.
1
(b) By considering the determinant of a 2×2 Gram matrix, show that a positive-definite
kernel function k(u, v) satisfies the Cauchy-Schwartz inequality [4 pts]
In the case of 2 × 2, the Gram matrix is
K =

k(u, u) k(u, v)
k(v, u) k(v, v)

and the Cauchy-Schwartz inequality is
k(u, v)
2 ≤ k(u, u)k(v, v)
(c) Gaussian Kernel [10 pts]
Given the Gaussian Kernel
k(u, v) = exp(−∥u − v∥
2
/2σ
2
)
By making use of the expansion
k(u, v) = exp(−u
Tu/2σ
2
) exp(u
Tv/σ2
) exp(−v
Tv/2σ
2
)
and then expanding the middle factor as a power series, show that the Gaussian kernel can be
expressed as the inner product of an infinite-dimensional feature vector.
2 Markov Random Field [20 pts]
TA: Pranathi B. Suresha
(a) Given an undirected graph G over variables X = A,B,C,D as given in the image
below, calculate the joint probability Pr(A,B,C,D) as normalized product of factors.
Show both the unnormalized and normalized product of factors in your solution[10
pts]
Hint : P r(A = a, B = b, C = c, D = d) = ϕ1(a, b)ϕ2(b, c)ϕ3(c, d)ϕ4(d, a)
P
a

P
b

P
c

P
d
′ ϕ1(a

, b′)ϕ2(b

, c′)ϕ3(c

, d′)ϕ4(d

, a′)
(1)
The factors ϕ1(A, B), ϕ2(B, C), ϕ3(C, D) and ϕ4(D, A) are as given below:
2
A B ϕ1(A, B)
a
0 b
0 5
a
0 b
1 10
a
1 b
0 15
a
1 b
1 20
B C ϕ2(B, C)
a
0 b
0 30
a
0 b
1 40
a
1 b
0 50
a
1 b
1 60
A B ϕ3(C, D)
a
0 b
0 25
a
0 b
1 50
a
1 b
0 75
a
1 b
1 100
B C ϕ4(D, A)
a
0 b
0 150
a
0 b
1 200
a
1 b
0 250
a
1 b
1 300
(b) Answer the following questions. Please give only one line explanation. [10 pts]
• What is a clique and what is a maximal clique? [2 pts]
• Write one difference between Directed graphical model (DGM)/Bayes Networks (BN) and
Undirected Graphical Models (UGM)/Markov Networks(MN)? [2 pts]
• In Markov random fields what is the reason for using a potential function instead of a probability function? [2 pts]
• Write the exponential form of Probability distribution P(X1, …., Xn) for pairwise markov
networks [2 pts]
• Write one application of pairwise Markov random fields in computer vision.
3 Hidden Markov Model [10 pts]
TA: Chukang Zhong
This problem will let you get familiar with HMM algorithms by doing the calculations by hand.
There are three coins (1, 2, 3), to throw them randomly, and record the result. S = 1, 2, 3; V = H, T
(Head or Tail); A, B, π is given as
A:
1 2 3
1 0.9 0.05 0.05
2 0.45 0.1 0.45
3 0.45 0.45 0.1
B:
1 2 3
H 0.5 0.75 0.25
T 0.5 0.25 0.75
π : π 1/3 1/3 1/3
3
(a) Given the model above, what’s the probability of observation O = H, T, H. [10 pts]
4 Neural networks [20 pts]
TA: Chukang Zhong
(a) Consider a neural network for a binary classification using sigmoid function for
each unit. If the network has no hidden layer, explain why the model is equivalent to
logistic regression. [5 pts]
(b) Consider a simple two-layer network. Given the cost function used to train the
neural network
ℓ(w, α, β) = Xm
i=1

y
i − σ

w
T
z
i
2
where σ(x) = 1
1+e−x is the sigmoid function. Show that the gradient is given by
∂ℓ(w, α, β)
∂w =
Xm
i=1
2

y
i − σ

u
i
 σ

u
i
 1 − σ

u
i
 z
i
where z
i
1 = σ

α
T x
i

, z
i
2 = σ

β
T x
i

. Also find the gradient of ℓ with respect to α and β.
[15 pts]
5 Programming [30 pts]
TA: Zongen Li
In this problem, you will implement algorithm to analyze the behavior of SP500 index over a period
of time. For each week, we measure the price movement relative to the previous week and denote it
using a binary variable (+1 indicates up and -1 indicates down). The price movements from week
1 (the week of January 5) to week 39 (the week of September 28) are included in sp500.mat.
Consider a Hidden Markov Model in which xt denotes the economic state (good or bad) of week t
and yt denotes the price movement (up or down) of the SP500 index. We assume that x(t+1) = xt
with probability 0.8, and P(Yt|Xt)
(yt = +1|xt = good) = P(Yt|Xt)
(yt = −1|xt = bad) = q. In
addition, assume that P(X1)
(x1 = bad) = 0.8. Load the sp500.mat, implement the algorithm
in algorithm.m/algorithm.py and submit this file. In your report, briefly describe how you
implement your algorithm and report the follo
(a) Assuming q = 0.7, plot P(Xt|Y )
(xt = good|y) for t = 1, 2, …, 39. What is the probability
that the economy is in a good state in the week of week 39. [15 pts]
(b) Repeat (a) for q = 0.9, and compare the result to that of (a). Explain your
comparison in one or two sentences. [15 pts]
6 Extra credits : Support Vector Machines [20 pts]
TA: Pranathi B Suresha
Recall that in class, we talked about soft margin SVM:
minw,b,ξ ||w||2 + C
Xm
i
ξ
i
s.t. yi
(w
⊤x
i + b) ≥ 1 − ξ
i
, ξi ≥ 0, ∀i
(2)
Let (x
i
, yi
)
m
i=1 with x
i ∈ Rn and y
i ∈ (±1), i ∈ [1 : m], be a linearly separable set of training data.
Show that if C is sufficiently large, the solution of the primal soft SVM problem will give the unique
maximum margin separating hyperplane. How large does C need to be?
(Hint: (1) The hard margin SVM has a unique solution for the linearly separable set, but the
soft margin SVM (eq. 2) doesn’t give a unique solution for an arbitrary C unless C satisfies some
condition, which is what the question asks. For more information, please refer to this paper. (2)
In order to obtain the same unique hyperplane as the solution to the hard SVM, we need to make
all slack variables ξ
i = 0. In that case, the soft SVM becomes the hard SVM. (3) derive KKT
conditions for soft margin SVM and use KKT conditions to find C so that hint (2) is satisfied).
5