# CS 6375 Final Exam solution

\$45.00

Original Work ?
Category:

5/5 - (1 vote)

## 1. Missing Entries:

Suppose that you are given a symmetric matrix A ∈ R
n×n
that is missing
some entries, e.g., Aij =? for some indices i, j ∈ {1, . . . , n}. To determine which entries are
missing, we will use an index matrix Q ∈ {0, 1}
n×n
such that Qij = 1 if Aij =? and Qij = 0
otherwise.
(a) (3 points) Suppose A =

3 1 ?
1 1 −2
? −2 6

. Find a matrix B such that B is positive semidefinite and that the Frobenius norm, ||A − B||2
F =
P
i,j (Aij − Bij )
2
, is as small as possible.
(b) (3 points) Any symmetric positive semidefinite matrix B ∈ R
n×n
can be written as Bij =
v
(i)
T
v
(j)
for some v
(1), . . . , v(n) ∈ R
n
. Using this observation, formulate the problem of
filling in the missing entries of A as using the same strategy as part (a) as an optimization
problem.

(c) (10 points) The optimization problem in part (b) is not convex. Describe, in detail, a
block coordinate descent scheme for your objective in part (b) that has the property that
the optimization problem over each block is convex (recall that the EM algorithm enjoys
a similar property). For full credit, you should prove that your algorithm has the desired
property.

(d) (4 points) Explain why adding an `2 penalty to the optimization problem in part (b)
might be important in practice. Rewrite the optimization problem to include it.
Page 2 of 6
CS 6375 Final Exam Initials:

## 2. Neural Networks :

(a) (10 points) Derive the backpropagation algorithm in the case of a feedforward neural
network with softplus activation functions and a squared loss function. For full credit, you
should explicitly compute the derivatives as we did in class.

(b) (15 points) Suppose that we want to fit a conditional mixture model for a regression task.
That is, given data points x
(1), . . . , x(M) ∈ R
n with corresponding labels y
(1), . . . , y(M) ∈ R,
we would like to fit a model of the form
p(y|x, λ, θ1, . . . , θK) = X
K
k=1
λkp(y|x, θk),

where p(y|x, θk) is a conditional probability distribution parameterized by θk.
Suppose we choose these distributions to be of the form
p(y|x, θk) = N (y; NNk(x|θk)),

where NNk(x) is a neural network, whose parameters are given by θk, that takes as input
x and returns a pair of outputs µk and σ
2
k > 0, which act as the mean and variance of the
k
th normal distribution.

1. Explain how to apply the EM algorithm to fit this conditional mixture model to data.
For full credit, your explanation should provide sufficient algorithmic details.

2. Give an example to illustrate why you might want to apply models of this form in
practice. Describe how you might structure the neural networks for this application.

## 3. Rolling Dice:

Consider the following maximum likelihood estimation problems.
(a) (5 points) Suppose you are given a collection of observed 4-sided die rolls (i.e., the numbers one through 4) that were generated by picking one of two die (A or B) with respect
to some probability distribution and then rolling it.

If, in addition to the observed die
rolls, you are also told which die was rolled, write the formula for the log-likelihood, and
compute the maximum likelihood estimate of the parameters using the following sequence
of observations:

Outcome Observed Die
1 A
2 A
3 A
2 A
4 A
1 B
4 B
4 B
4 B
4 B
3 B
2 B
3 B
1 B

(b) Now suppose that you are given a sequence of data observations generated by rolling k,
6-sided dice for some k ∈ {1, 2, 3} and summing up their pips, e.g., if k = 2 and die one is
a 5 and die two is a six then the observation would be 11.

Suppose that each of the k die
are identical, i.e., the probability of observing n pips, denoted θn, for n ∈ {1, . . . , 6} is the
same for each die. However, k is unknown and must be estimated from data.

1. (10 points) Write the formula for the log-likelihood and compute the maximum likelihood estimate for k and θ given the following data.
Observed Sum
5
9
10
12
12
12
9
12
12
10
2. (5 points) What would be a good choice of prior be for θ? What is the MAP estimate
for the above data under this prior?

(c) (10 points) Finally, consider data generated by first flipping a coin with bias b. If the coin
comes up heads, then a possibly loaded dice D1 is rolled twice. If the coin comes up tails,
then a possibly loaded dice D2 is rolled twice. Given that you only observe the sum of
whichever die is rolled, explain how to estimate the bias of the coin and the probability
distribution of the outcomes of each of the loaded dice using the EM algorithm. For full
credit, you should completely specify the E and the M steps.

## 4. Dots and Boxes:

Consider the classic children’s game in which, starting from a 9 × 9 grid of
evenly spaced dots, two players alternatively take turns adding either a vertical or horizontal
line between neighboring dots. Whenever a player places a line that completes a box around
four neighboring dots, the player scores a single point and gets to take another turn.

The game
ends when no more lines can be placed. The aim is to be the player that completes the most
boxes / scores the most points. For more details and a visualization, see the Wikipedia page
for Dots-and-Boxes.

(a) (15 points) Describe a Markov decision process for an agent acting as a player in this
game whose aim is to win the game. You should provide the states, actions, transition
function, and reward function.

(b) (5 points) Given your MDP in part (a), does the choice of discount factor, i.e., γ ∈ (0, 1),
matter in terms of the policy that maximizes the cumulative reward function?

(c) (5 points) Explain, in detail, how to apply deep Q-learning with a neural network to
approximate the Q-value function. In particular, you should describe the neural network
and how you can use it to find the ”optimal” action, with respect to its Q-value, for a
given state.
Page 6 of 6