Description
CSC412 Assignment 1
The goal of this assignment is to get you familiar with the basics of decision theory and
gradient-based model fitting.
1 Decision theory [13pts]
One successful use of probabilistic models is for building spam filters, which take in an email
and take different actions depending on the likelihood that it’s spam.
Imagine you are running an email service. You have a well-calibrated spam classifier that
tells you the probability that a particular email is spam: p(spam|email). You have three
options for what to do with each email: You can show it to the user, put it in the spam
folder, or delete it entirely.
Depending on whether or not the email really is spam, the user will suffer a different amount
of wasted time for the different actions we can take, L(action,spam):
Action Spam Not spam
Show 10 0
Folder 1 50
Delete 0 200
1. [3pts] Plot the expected wasted user time for each of the three possible actions, as a
function of the probability of spam: p(spam|email)
losses = [[10, 0],
[1, 50],
[0, 200]]
num actions = length(losses)
function expected loss of action(prob spam, action)
#TODO: Return expected loss over a Bernoulli random variable
# with mean prob spam.
# Losses are given by the table above.
end
prob range = range(0., stop=1., length=500)
# Make plot
1
using Plots
for action in 1:num actions
display(plot!(prob range, expected loss of action(prob range, action)))
end
2. [2pts] Write a function that computes the optimal action given the probability of spam.
function optimal action(prob spam)
#TODO: return best action given the probability of spam.
# Hint: Julia’s findmin function might be helpful.
end
3. [4pts] Plot the expected loss of the optimal action as a function of the probability of
spam.
Color the line according to the optimal action for that probability of spam.
prob range = range(0, stop=1., length=500)
optimal losses = []
optimal actions = []
for p in prob range
# TODO: Compute the optimal action and its expected loss for
# probability of spam given by p.
end
plot(prob range, optimal losses, linecolor=optimal actions)
4. [4pts] For exactly which range of the probabilities of an email being spam should we
delete an email?
Find the exact answer by hand using algebra.
2 Regression
2.1 Manually Derived Linear Regression [10pts]
Suppose that X ∈ Rm×n with n ≥ m and Y ∈ Rn, and that Y ∼ N (XT β, σ2I).
In this question you will derive the result that the maximum likelihood estimate βˆ of β is
given by
βˆ = (XXT )
−1
XY
1. [1pts] What happens if n<m?
2. [2pts] What are the expectation and covariance matrix of βˆ, for a given true value of
β?
3. [2pts] Show that maximizing the likelihood is equivalent to minimizing the squared
error !n
i=1(yi − xiβ)2. [Hint: Use !n
i=1 a2
i = aT a]
2
4. [2pts] Write the squared error in vector notation, (see above hint), expand the expression, and collect like terms. [Hint: Use βT xT y = yT xβ and xT x is symmetric]
5. [3pts] Use the likelihood expression to write the negative log-likelihood. Write the
derivative of the negative log-likelihood with respect to β, set equal to zero, and solve
to show the maximum likelihood estimate βˆ as above.
2.2 Toy Data [2pts]
For visualization purposes and to minimize computational resources we will work with 1-
dimensional toy data.
That is X ∈ Rm×n where m = 1.
We will learn models for 3 target functions
• target f1, linear trend with constant noise.
• target f2, linear trend with heteroskedastic noise.
• target f3, non-linear trend with heteroskedastic noise.
using LinearAlgebra
function target f1(x, σ true=0.3)
noise = randn(size(x))
y = 2x .+ σ true.*noise
return vec(y)
end
function target f2(x)
noise = randn(size(x))
y = 2x + norm.(x)*0.3.*noise
return vec(y)
end
function target f3(x)
noise = randn(size(x))
y = 2x + 5sin.(0.5*x) + norm.(x)*0.3.*noise
return vec(y)
end
1. [1pts] Write a function which produces a batch of data x ∼ Uniform(0, 20) and y =
target f(x)
function sample batch(target f, batch size)
x =#TODO
y =#TODO
return (x,y)
end
3
using Test
@testset “sample dimensions are correct” begin
m = 1 # dimensionality
n = 200 # batch-size
for target f in (target f1, target f2, target f3)
x,y = sample batch(target f,n)
@test size(x) == (m,n)
@test size(y) == (n,)
end
end
2. [1pts] For all three targets, plot a n = 1000 sample of the data. Note: You will use
these plots later, in your writeup display once other questions are complete.
using Plots
x1,y1 =#TODO
plot f1 = #TODO
x2,y2 =#TODO
plot f2 = #TODO
x3,y3 = #TODO
plot f3 = #TODO
2.3 Linear Regression Model with βˆ MLE [4pts]
1. [2pts] Program the function that computes the the maximum likelihood estimate given
X and Y . Use it to compute the estimate βˆ for a n = 1000 sample from each target
function.
function beta mle(X,Y)
beta = #TODO
return beta
end
n=1000 # batch size
x 1, y 1 = #TODO
β mle 1 = #TODO
x 2, y 2 = #TODO
β mle 2 = #TODO
x 3, y 3 =#TODO
β mle 3 = #TODO
2. [2pts] For each function, plot the linear regression model given by Y ∼ N (XT βˆ, σ2I)
for σ = 1.. This plot should have the line of best fit given by the maximum likelihood
estimate, as well as a shaded region around the line corresponding to plus/minus one
standard deviation (i.e. the fixed uncertainty σ = 1.0). Using Plots.jl this shaded
uncertainty region can be achieved with the ribbon keyword argument. Display 3
plots, one for each target function, showing samples of data and maximum
likelihood estimate linear regression model
4
plot!(plot f1,#TODO)
plot!(plot f2,#TODO)
plot!(plot f3,#TODO)
2.4 Log-likelihood of Data Under Model [6pts]
1. [2pts] Write code for the function that computes the likelihood of x under the Gaussian
distribution N (µ, σ). For reasons that will be clear later, this function should be able
to broadcast to the case where x, µ, σ are all vector valued and return a vector of
likelihoods with equivalent length, i.e., xi ∼ N (µi, σi).
function gaussian log likelihood(µ, σ, x)
“””
compute log-likelihood of x under N(µ,σ)
“””
return #TODO: log-likelihood function
end
# Test Gaussian likelihood against standard implementation
@testset “Gaussian log likelihood” begin
using Distributions: pdf, Normal
# Scalar mean and variance
x = randn()
µ = randn()
σ = rand()
@test size(gaussian log likelihood(µ,σ,x)) == () # Scalar log-likelihood
@test gaussian log likelihood.(µ,σ,x) ≈ log.(pdf.(Normal(µ,σ),x)) # Correct Value
# Vector valued x under constant mean and variance
x = randn(100)
µ = randn()
σ = rand()
@test size(gaussian log likelihood.(µ,σ,x)) == (100,) # Vector of log-likelihoods
@test gaussian log likelihood.(µ,σ,x) ≈ log.(pdf.(Normal(µ,σ),x)) # Correct Values
# Vector valued x under vector valued mean and variance
x = randn(10)
µ = randn(10)
σ = rand(10)
@test size(gaussian log likelihood.(µ,σ,x)) == (10,) # Vector of log-likelihoods
@test gaussian log likelihood.(µ,σ,x) ≈ log.(pdf.(Normal.(µ,σ),x)) # Correct Values
end
2. [2pts] Use your gaussian log-likelihood function to write the code which computes the
negative log-likelihood of the target value Y under the model Y ∼ N (XT β, σ2 ∗ I) for
a given value of β.
function lr model nll(β,x,y;σ=1.)
return #TODO: Negative Log Likelihood
end
3. [1pts] Use this function to compute and report the negative-log-likelihood of a n ∈
{10, 100, 1000} batch of data under the model with the maximum-likelihood estimate
βˆ and σ ∈ {0.1, 0.3, 1., 2.} for each target function.
5
for n in (10,100,1000)
println(“——– $n ————“)
for target f in (target f1,target f2, target f3)
println(“——– $target f ————“)
for σ model in (0.1,0.3,1.,2.)
println(“——– $σ model ————“)
x,y = #TODO
β mle = #TODO
nll = #TODO
println(“Negative Log-Likelihood: $nll”)
end
end
end
4. [1pts] For each target function, what is the best choice of σ?
Please note that σ and batch-size n are modelling hyperparameters. In the expression of
maximum likelihood estimate, σ or n do not appear, and in principle shouldn’t affect the
final answer. However, in practice these can have significant effect on the numerical stability
of the model. Too small values of σ will make data away from the mean very unlikely, which
can cause issues with precision. Also, the negative log-likelihood objective involves a sum
over the log-likelihoods of each datapoint. This means that with a larger batch-size n, there
are more datapoints to sum over, so a larger negative log-likelihood is not necessarily worse.
The take-home is that you cannot directly compare the negative log-likelihoods achieved by
these models with different hyperparameter settings.
2.5 Automatic Differentiation and Maximizing Likelihood [3pts]
In a previous question you derived the expression for the derivative of the negative loglikelihood with respect to β. We will use that to test the gradients produced by automatic
differentiation.
1. [3pts] For a random value of β, σ, and n = 100 sample from a target function, use
automatic differentiation to compute the derivative of the negative log-likelihood of the
sampled data with respect β. Test that this is equivalent to the hand-derived value.
using Zygote: gradient
@testset “Gradients wrt parameter” begin
β test = randn()
σ test = rand()
x,y = sample batch(target f1,100)
ad grad = #TODO
hand derivative = #TODO
@test ad grad[1] ≈ hand derivative
end
2.5.1 Train Linear Regression Model with Gradient Descent [5pts]
In this question we will compute gradients of of negative log-likelihood with respect to β.
We will use gradient descent to find β that maximizes the likelihood.
6
1. [3pts] Write a function train lin reg that accepts a target function and an initial
estimate for β and some hyperparameters for batch-size, model variance, learning rate,
and number of iterations. Then, for each iteration:
– sample data from the target function
– compute gradients of negative log-likelihood with respect to β
– update the estimate of β with gradient descent with specified learning rate
and, after all iterations, returns the final estimate of β.
using Logging # Print training progress to REPL, not pdf
function train lin reg(target f, β init; bs= 100, lr = 1e-6, iters=1000, σ model = 1. )
β curr = β init
for i in 1:iters
x,y = #TODO
@info “loss: $() β: $β curr” #TODO: log loss, if you want to monitor training
progress
grad β = #TODO: compute gradients
β curr = #TODO: gradient descent
end
return β curr
end
2. [2pts] For each target function, start with an initial parameter β, learn an estimate
for βlearned by gradient descent. Then plot a n = 1000 sample of the data and the
learned linear regression model with shaded region for uncertainty corresponding to
plus/minus one standard deviation.
β init = 1000*randn() # Initial parameter
β learned= #TODO: call training function
#TODO: For each target function, plot data samples and learned regression
2.5.2 Non-linear Regression with a Neural Network [9pts]
In the previous questions we have considered a linear regression model
Y ∼ N (XT β, σ2
)
This model specified the mean of the predictive distribution for each datapoint by the product
of that datapoint with our parameter.
Now, let us generalize this to consider a model where the mean of the predictive distribution
is a non-linear function of each datapoint. We will have our non-linear model be a simple
function called neural net with parameters θ (collection of weights and biases).
Y ∼ N (neural net(X, θ), σ2
)
7
1. [3pts] Write the code for a fully-connected neural network (multi-layer perceptron)
with one 10-dimensional hidden layer and a tanh nonlinearirty. You must write this
yourself using only basic operations like matrix multiply and tanh, you may not use
layers provided by a library.
This network will output the mean vector, test that it outputs the correct shape for
some random parameters.
# Neural Network Function
function neural net(x,θ)
return #TODO
end
# Random initial Parameters
θ = #TODO
@testset “neural net mean vector output” begin
n = 100
x,y = sample batch(target f1,n)
µ = neural net(x,θ)
@test size(µ) == (n,)
end
2. [2pts] Write the code that computes the negative log-likelihood for this model where
the mean is given by the output of the neural network and σ = 1.0
function nn model nll(θ,x,y;σ=1)
return #TODO
end
3. [2pts] Write a function train nn reg that accepts a target function and an initial
estimate for θ and some hyperparameters for batch-size, model variance, learning rate,
and number of iterations. Then, for each iteration:
– sample data from the target function
– compute gradients of negative log-likelihood with respect to θ
– update the estimate of θ with gradient descent with specified learning rate
and, after all iterations, returns the final estimate of θ.
using Logging # Print training progress to REPL, not pdf
function train nn reg(target f, θ init; bs= 100, lr = 1e-5, iters=1000, σ model = 1. )
θ curr = θ init
for i in 1:iters
x,y = #TODO
@info “loss: $()” #TODO: log loss, if you want to montior training
grad θ = #TODO: compute gradients
θ curr = #TODO: gradient descent
end
return θ curr
end
8
4. [2pts] For each target function, start with an initialization of the network parameters,
θ, use your train function to minimize the negative log-likelihood and find an estimate
for θlearned by gradient descent. Then plot a n = 1000 sample of the data and the
learned regression model with shaded uncertainty bounds given by σ = 1.0
#TODO: For each target function
θ init = #TODO
θ learned = #TODO
#TODO: plot data samples and learned regression
2.5.3 Non-linear Regression and Input-dependent Variance with a Neural Network [8pts]
In the previous questions we’ve gone from a gaussian model with mean given by linear
combination
Y ∼ N (XT β, σ2
)
to gaussian model with mean given by non-linear function of the data (neural network)
Y ∼ N (neural net(X, θ), σ2
)
However, in all cases we have considered so far, we specify a fixed variance for our model
distribution. We know that two of our target datasets have heteroscedastic noise, meaning
any fixed choice of variance will poorly model the data.
In this question we will use a neural network to learn both the mean and log-variance of our
gaussian model.
µ, log σ = neural net(X, θ)
Y ∼ N (µ, exp(log σ)
2
)
1. [1pts] Write the code for a fully-connected neural network (multi-layer perceptron)
with one 10-dimensional hidden layer and a tanh nonlinearirty, and outputs both a
vector for mean and log σ. Test the output shape is as expected.
# Neural Network Function
function neural net w var(x,θ)
return #TODO
end
# Random initial Parameters
θ = #TODO
@testset “neural net mean and logsigma vector output” begin
n = 100
x,y = sample batch(target f1,n)
µ, logσ = neural net w var(x,θ)
@test size(µ) == (n,)
@test size(logσ) == (n,)
end
9
2. [2pts] Write the code that computes the negative log-likelihood for this model where
the mean and log σ is given by the output of the neural network. (Hint: Don’t forget
to take exp log σ)
function nn with var model nll(θ,x,y)
return #TODO
end
3. [1pts] Write a function train nn w var reg that accepts a target function and an initial
estimate for θ and some hyperparameters for batch-size, learning rate, and number of
iterations. Then, for each iteration:
– sample data from the target function
– compute gradients of negative log-likelihood with respect to θ
– update the estimate of θ with gradient descent with specified learning rate
and, after all iterations, returns the final estimate of θ.
function train nn w var reg(target f, θ init; bs= 100, lr = 1e-4, iters=10000)
θ curr = θ init
for i in 1:iters
x,y = #TODO
@info “loss: $()” #TODO: log loss
grad θ = #TODO compute gradients
θ curr = #TODO gradient descent
end
return θ curr
end
4. [4pts] For each target function, start with an initialization of the network parameters,
θ, learn an estimate for θlearned by gradient descent. Then plot a n = 1000 sample of
the dataset and the learned regression model with shaded uncertainty bounds corresponding to plus/minus one standard deviation given by the variance of the predictive
distribution at each input location (output by the neural network). (Hint: ribbon
argument for shaded uncertainty bounds can accept a vector of σ)
Note: Learning the variance is tricky, and this may be unstable during training. There
are some things you can try:
– Adjusting the hyperparameters like learning rate and batch size
– Train for more iterations
– Try a different random initialization, like sample random weights and bias matrices with lower variance.
For this question you will not be assessed on the final quality of your model.
Specifically, if you fails to train an optimal model for the data that is okay. You are
expected to learn something that is somewhat reasonable, and demonstrates that
this model is training and learning variance.
If your implementation is correct, it is possible to learn a reasonable model with fewer
than 10 minutes of training on a laptop CPU. The default hyperparameters should
help, but may need some tuning.
10
#TODO: For each target function
θ init = #TODO
θ learned = #TODO
#TODO: plot data samples and learned regression
If you would like to take the time to train a very good model of the data (specifically for
target functions 2 and 3) with a neural network that outputs both mean and log σ you can
do this, but it is not necessary to achieve full marks. You can try
• Using a more stable optimizer, like Adam. You may import this from a library.
• Increasing the expressivity of the neural network, increase the number of layers or the
dimensionality of the hidden layer.
• Careful tuning of hyperparameters, like learning rate and batchsize.
11
CSC412 Assignment 2 version 2: Stochastic Variational Inference in the TrueSkill Model
Assignment 2 version 2:
Stochastic Variational Inference in the TrueSkill Model
STA414/STA2014 and CSC412/CSC2506
The goal of this assignment is to get you familiar with the basics of Bayesian inference in large models
with continuous latent variables, and the basics of stochastic variational inference.
Background We’ll implement a variant of the TrueSkill model, a player ranking system for competitive
games originally developed for Halo 2. It is a generalization of the Elo rating system in Chess. For the
curious, the original 2007 NIPS paper introducing the trueskill paper can be found here: https://papers.
nips.cc/paper/3079-trueskilltm-a-bayesian-skill-rating-system.pdf
This assignment is based on one developed by Carl Rasmussen at Cambridge for his course on probabilistic
machine learning: https://mlg.eng.cam.ac.uk/teaching/4f13/1920/
0.1 Model definition
We’ll consider a slightly simplified version of the original trueskill model. We assume that each player has a
true, but unknown skill zi 2 R. We use N to denote the number of players.
The prior. The prior over each player’s skill is a standard normal distribution, and all player’s skills are
a prior independent.
The likelihood. For each observed game, the probability that player i beats player j, given the player’s
skills zA and zB, is:
p(A beat B|zA, zB) = (zi zj )
where
(y) = 1
1 + exp(y)
There can be more than one game played between a pair of players, and in this case the outcome of each
game is independent given the players’ skills. We use M to denote the number of games.
The data. The data will be an array of game outcomes. Each row contains a pair of player indices. The
first index in each pair is the winner of the game, the second index is the loser. If there were M games
played, then the array has shape M ⇥ 2.
1
1 Implementing the model [10 points]
(a) [2 points] Implement a function log prior that computes the log of the prior over all player’s skills.
Specifically, given a K ⇥ N array where each row is a setting of the skills for all N players, it returns
a K ⇥ 1 array, where each row contains a scalar giving the log-prior for that set of skills.
(b) [3 points] Implement a function logp a beats b that, given a pair of skills za and zb evaluates the
log-likelihood that player with skill za beat player with skill zb under the model detailed above. To
ensure numerical stability, use the function log1pexp that computes log(1 + exp(x)) in a numerically
stable way. This function is provided by StatsFuns.jl and imported already, and also by Python’s
numpy.
(c) [3 points] Assuming all game outcomes are i.i.d. conditioned on all players’ skills, implement a
function all games log likelihood that takes a batch of player skills zs and a collection of observed
games games and gives a batch of log-likelihoods for those observations. Specifically, given a K ⇥ N
array where each row is a setting of the skills for all N players, and an M ⇥ 2 array of game outcomes,
it returns a K ⇥ 1 array, where each row contains a scalar giving the log-likelihood of all games for
that set of skills. Hint: You should be able to write this function without using for loops, although you
might want to start that way to make sure what you’ve written is correct. If A is an array of integers,
you can index the corresponding entries of another matrix B for every entry in A by writing B[A].
(d) [2 points] Implement a function joint log density which combines the log-prior and log-likelihood
of the observations to give p(z1, z2,…,zN , all game outcomes)
2 Examining the posterior for only two players and toy data [10
points]
To get a feel for this model, we’ll first consider the case where we only have 2 players, A and B. We’ll
examine how the prior and likelihood interact when conditioning on di↵erent sets of games.
Provided in the starter code is a function skillcontour! which evaluates a provided function on a grid of
zA and zB’s and plots the isocontours of that function. As well there is a function plot line equal skill!.
We have included an example for how you can use these functions.
We also provided a function two player toy games which produces toy data for two players. I.e.
two player toy games(5,3) produces a dataset where player A wins 5 games and player B wins 3 games.
(a) [2 points] For two players A and B, plot the isocontours of the joint prior over their skills. Also plot
the line of equal skill, zA = zB. Hint: you’ve already implemented the log of the likelihood function.
(b) [2 points] Plot the isocontours of the likelihood function. Also plot the line of equal skill, zA = zB.
(c) [2 points] Plot isocountours of the joint posterior over zA and zB given that player A beat player B
in one match. Since the contours don’t depend on the normalization constant, you can simply plot
the isocontours of the log of joint distribution of p(zA, zB, A beat B) Also plot the line of equal skill,
zA = zB.
(d) [2 points] Plot isocountours of the joint posterior over zA and zB given that 10 matches were played,
and player A beat player B all 10 times. Also plot the line of equal skill, zA = zB.
(e) [2 points] Plot isocountours of the joint posterior over zA and zB given that 20 matches were played,
and each player beat the other 10 times. Also plot the line of equal skill, zA = zB.
For all plots, label both axes.
2
3 Stochastic Variational Inference on Two Players and Toy Data
[18 points]
One nice thing about a Bayesian approach is that it separates the model specification from the approximate inference strategy. The original Trueskill paper from 2007 used message passing. Carl Rasmussen’s
assignment uses Gibbs sampling, a form of Markov Chain Monte Carlo. We’ll use gradient-based stochastic
variational inference, which wasn’t invented until around 2014.
In this question we will optimize an approximate posterior distribution with stochastic variational inference to approximate the true posterior.
(a) [5 points] Implement a function elbo which computes an unbiased estimate of the evidence lower
bound. As discussed in class, the ELBO is equal to the KL divergence between the true posterior
p(z|data), and an approximate posterior, q(z|data), plus an unknown constant. Use a fully-factorized
Gaussian distribution for q(z|data). This estimator takes the following arguments:
• params, the parameters of the approximate posterior q(z|data).
• A function logp, which is equal to the true posterior plus a constant. This function must take a
batch of samples of z. If we have N players, we can consider B-many samples from the joint over
all players’ skills. This batch of samples zs will be an array with dimensions (N,B).
• num samples, the number of samples to take.
This function should return a single scalar. Hint: You will need to use the reparamterization trick
when sampling zs.
(b) [2 points] Write a loss function called neg toy elbo that takes variational distribution parameters
and an array of game outcomes, and returns the negative elbo estimate with 100 samples.
(c) [5 points] Write an optimization function called fit toy variational dist which takes initial variational parameters, and the evidence. Inside it will perform a number of iterations of gradient descent
where for each iteration :
(a) Compute the gradient of the loss with respect to the parameters using automatic di↵erentiation.
(b) Update the parameters by taking an lr-scaled step in the direction of the descending gradient.
(c) Report the loss with the new parameters (using @info or print statements)
(d) On the same set of axes plot the target distribution in red and the variational approximation in
blue.
Return the parameters resulting from training.
(d) [2 points] Initialize a variational distribution parameters and optimize them to approximate the joint
where we observe player A winning 1 game. Report the final loss. Also plot the optimized variational
approximation contours (in blue) aand the target distribution (in red) on the same axes.
(e) [2 points] Initialize a variational distribution parameters and optimize them to approximate the joint
where we observe player A winning 10 games. Report the final loss. Also plot the optimized variational
approximation contours (in blue) aand the target distribution (in red) on the same axes.
(f) [2 points] Initialize a variational distribution parameters and optimize them to approximate the joint
where we observe player A winning 10 games and player B winning 10 games. Report the final loss.
Also plot the optimized variational approximation contours (in blue) aand the target distribution (in
red) on the same axes.
For all plots, label both axes.
3
4 Approximate inference conditioned on real data [24 points]
Load the dataset from tennis data.mat containing two matrices:
• W is a 107 by 1 matrix, whose i’th entry is the name of player i.
• G is a 1801 by 2 matrix of game outcomes (actually tennis matches), one row per game. The first
column contains the indices of the players who won. The second column contains the indices of the
player who lost.
Compute the following using your code from the earlier questions in the assignment, but conditioning on
the tennis match outcomes:
(a) [1 point] For any two players i and j, p(zi, zj |all games) is always proportional to p(zi, zj , all games). In
general, are the isocontours of p(zi, zj |all games) the same as those of p(zi, zj |games between i and j)?
That is, do the games between other players besides i and j provide information about the skill of
players i and j? A simple yes or no suces.
Hint: One way to answer this is to draw the graphical model for three players, i, j, and k, and the
results of games between all three pairs, and then examine conditional independencies. If you do this,
there’s no need to include the graphical models in your assignment.
(b) [5 points] Write a new optimization function fit variational dist like the one from the previous
question except it does not plot anything. Initialize a variational distribution and fit it to the joint
distribution with all the observed tennis games from the dataset. Report the final negative ELBO
estimate after optimization.
(c) [2 points] Plot the approximate mean and variance of all players, sorted by skill. For example, in
Julia, you can use: perm = sortperm(means); plot(means[perm], yerror=exp.(logstd[perm]))
There’s no need to include the names of the players.
(d) [2 points] List the names of the 10 players with the highest mean skill under the variational model.
(e) [3 points] Plot the joint approximate posterior over the skills of Roger Federer and Rafael Nadal. Use
the approximate posterior that you fit in question 4 part b.
(f) [5 points] Derive the exact probability under a factorized Guassian over two players’ skills that one
has higher skill than the other, as a function of the two means and variances over their skills. Express
your answer in terms of the cumulative distribution function of a one-dimensional Gaussian random
variable.
• Hint 1: Use a linear change of variables yA, yB = zA zB, zB. What does the line of equal skill
look like after this transformation?
• Hint 2: If X ⇠ N (µ, ⌃), then AX ⇠ N (Aµ, A⌃AT ) where A is a linear transformation.
• Hint 3: Marginalization in Gaussians is easy: if X ⇠ N (µ, ⌃), then the ith element of X has a
marginal distribution Xi ⇠ N (µi, ⌃ii)
(g) [2 points] Using the formula from part c, compute the exact probability under your approximate
posterior that Roger Federer has higher skill than Rafael Nadal. Then, estimate it using simple Monte
Carlo with 10000 examples, again using your approximate posterior.
(h) [2 points] Using the formula from part c, compute the probability that Roger Federer is better than
the player with the lowest mean skill. Compute this quantity exactly, and then estimate it using simple
Monte Carlo with 10000 examples, again using your approximate posterior.
(i) [2 points] Imagine that we knew ahead of time that we were examining the skills of top tennis players,
and so changed our prior on all players to Normal(10, 1). Which answers in this section would this
change? No need to show your work, just list the letters of the questions whose answers would be
di↵erent in expectation.
4
CSC412 Assignment 3: Variational Autoencoders
Assignment 3: Variational Autoencoders
STA414/STA2014 and CSC412/CSC2506
0.1 Background
In this assignment, we will implement and investigate the Variational Autoencoder on binarized MNIST
digits, as introduced by the paper Auto-Encoding Variational Bayes by Kingma and Welling (2013). Before
starting, we recommend reading this paper.
Data. Each datapoint in the MNIST dataset is a 28×28 grayscale image (i.e. pixels are values between 0
and 1) of a handwritten digit in {0 … 9}, and a label indicating which number. MNIST is the ‘fruit fly’ of
machine learning – a simple standard problem useful for comparing the properties of different algorithms.
Use the first 10000 samples for training, and the second 10000 for testing. Hint: Also build a dataset of
only 100 training samples to use when debugging, to make loading and training faster.
Tools. As before, you can (and should) use automatic differentiation provided by your package of choice.
Whereas in previous assignments you implemented neural network layers and stochastic gradient descent
manually, in this assignment feel free to use those provided by a machine learning framework. In Julia, these
will be provided by Flux.jl. You can also freely copy and adapt the Python autograd starter code provided.
If you do, you should probably remove batch normalization.
However, you may not use any probabilistic modelling elements from these frameworks. In
particular, sampling from and evaluating densities under distributions must be written by you or provided
by the starter code.
0.2 Model Definition
Prior. The prior over each digit’s latent representation is a multivariate standard normal distribution. For
all questions, we’ll set the dimension of the latent space Dz to 2. A larger latent dimension would provide a
more powerful model, but for this assignment we’ll use a two-dimensional latent space to make visualization
and debugging easier..
Likelihood. Given the latent representation z for an image, the distribution over all 784 pixels in the
image is given by a product of independent Bernoullis, whose means are given by the output of a neural
network fθ(z):
p(x|z, θ) = !784
d=1
Ber(xd|fθ(z)d)
The neural network fθ is called the decoder, and its parameters θ will be optimized to fit the data.
1
1 Implementing the Model [5 points]
For your convenience we have provided the following functions:
• factorized gaussian log density that accepts the mean and log standard deviations for a product
of independent gaussian distribtuions and computes the likelihood under them. This function will
produce the log-likelihood for each batch element 1 × B
• bernoulli log density that accepts the logits of a bernoulli distribution over D-dimensional data
and returns D × B log-likelihoods.
• sample diag gaaussian that accepts above parameters for a factorized Gaussian distribution and
samples with the reparameterization trick.
• sample bernoulli that accepts above parameters for a Bernoulli distribution and samples from it.
• load binarized mnist that loads and binarizes the MNIST dataset.
• batch x and batch x that splits the data, and just the images, into batches.
Further, in the file example flux model.jl we demonstrate how to specify neural network layers with
Flux library. Note that Flux provides convenience functions that allow us to take gradients of functions with
respect to parameters that are not passed around explicitly. Other AD frameworks, of if you prefer to
implement your own network layers, recycling code from previous assignments, you may need to explicilty
provide the network parameters to the functions below.
(a) [1 point] Implement a function log prior that computes the log of the prior over a digit’s representation log p(z).
(b) [2 points] Implement a function decoder that, given a latent representation z and a set of neural
network parameters θ (again, implicitly in Flux), produces a 784-dimensional mean vector of a product
of Bernoulli distributions, one for each pixel in a 28×28 image. Make the decoder architecture a multilayer perceptron (i.e. a fully-connected neural network) with a single hidden layer with 500 hidden
units, and a tanh nonlinearity. Its input will be a batch two-dimensional latent vectors (zs in Dz × B)
and its output will be a 784-dimensional vector representing the logits of the Bernoulli means for each
dimension Ddata × B. For numerical stability, instead of outputting the mean µ ∈ [0, 1], you should
output log ” µ
1−µ
#
∈ R called “logit”.
(c) [1 point] Implement a function log likelihood that, given a latent representation z and a binarized
digit x, computes the log-likelihood log p(x|z).
(d) [1 point] Implement a function joint log density which combines the log-prior and log-likelihood
of the observations to give log p(z, x) for a single image.
All of the functions in this section must be able to be evaluated in parallel, vectorized and non-mutating,
on a batch of B latent vectors and images, using the same parameters θ for each image. In particular,
you can not use a for loop over the batch elements.
2
2 Amortized Approximate Inference and training [13 points]
(a) [2 points] Write a function encoder that, given an image x (or batch of images) and recognition
parameters φ, evaluates an MLP to outputs the mean and log-standard deviation of a factorized
Gaussian of dimension Dz = 2. Make the encoder architecture a multi-layer perceptron (i.e. a fullyconnected neural network) with a single hidden layer with 500 hidden units, and a tanh nonlinearity.
This function must be able to be evaluated in parallel on a batch of images, using the same parameters
φ for each image.
(b) [1 points] Write a function log q that given the parameters of the variational distribution, evaluates
the likelihood of z.
(c) [5 points] Implement a function elbo which computes an unbiased estimate of the mean variational
evidence lower bound on a batch of images. Use the output of encoder to give the parameters for
qφ(z|data). This estimator takes the following arguments:
• x, an batch of B images, Dx × B.
• encoder params, the parameters φ of the encoder (recognition network). Note: these are not
required with Flux as parameters are implicit.
• decoder params, the parameters θ of the decoder (likelihood). Note: these are not required with
Flux as parameters are implicit.
This function should return a single scalar. Hint: You will need to use the reparamterization trick
when sampling zs. You can use any form of the ELBO estimator you prefer. (i.e., if you want you can
write the KL divergence between q and the prior in closed form since they are both Gaussians). You
only need to sample a single z for each image in the batch.
(d) [2 points] Write a loss function called loss that returns the negative elbo estimate over a batch of
data.
(e) [3 points] Write a function that initializes and optimizes the encoder and decoder parameters jointly
on the training set. Note that this function should optimize with gradients on the elbo estimate over
batches of data, not the entire dataset. Train the data for 100 epochs (each epoch involves a loop over
every batch). Report the final ELBO on the test set. Tip: Save your trained weights here (e.g. with
BSON.jl, see starter code, or by pickling in Python) so that you don’t need to train them again.
3
3 Visualizing Posteriors and Exploring the Model [15 points]
In this section we will investigate our model by visualizing the distribution over data given by the generative
model, sampling from it, and interpolating between digits.
(a) [5 points] Plot samples from the trained generative model using ancestral sampling:
(a) First sample a z from the prior.
(b) Use the generative model to compute the bernoulli means over the pixels of x given z. Plot these
means as a greyscale image.
(c) Sample a binary image x from this product of Bernoullis. Plot this sample as an image.
Do this for 10 samples z from the prior.
Concatenate all your plots into one 2×10 figure where each image in the first row shows the Bernoulli
means of p(x|z) for a separate sample of z, and each image in the the second row is a bianry image,
sampled from the distribution above it. Make each column an independent sample.
(b) [5 points] One way to understand the meaning of latent representations is to see which parts of the
latent space correspond to which kinds of data. Here we’ll produce a scatter plot in the latent space,
where each point in the plot represents a different image in the training set.
(a) Encode each image in the training set.
(b) Take the 2D mean vector of each encoding qφ(z|x).
(c) Plot these mean vectors in the 2D latent space with a scatterplot.
(d) Colour each point according to the class label (0 to 9).
Hopefully our latent space will group images of different classes, even though we never provided class
labels to the model!
(c) [5 points] Another way to examine a latent variable model with continuous latent variables is to
interpolate between the latent representations of two points.
Here we will encode 3 pairs of data points with different classes. Then we will linearly interpolate
between the mean vectors of their encodings. We will plot the generative distributions along the linear
interpolation.
(a) First, write a function which takes two points za and zb, and a value α ∈ [0, 1], and outputs the
linear interpolation zα = αza + (1 − α)zb.
(b) Sample 3 pairs of images, each having a different class.
(c) Encode the data in each pair, and take the mean vectors
(d) Linearly interpolate between these mean vectors
(e) At 10 equally-space points along the interpolation, plot the Bernoulli means p(x|zα)
(f) Concatenate these plots into one figure.
4
4 Predicting the Bottom of Images given the Top [15 points]
Now we’ll use the trained generative model to perform inference for p(z|top half of image x). Unfortunately,
we can’t re-use our recognition network, since it can only input entire images. However, we can still do
approximate inference without the encoder.
To illustrate this, we’ll approximately infer the distribution over the pixels in the bottom half an image
conditioned on the top half of the image:
p(bottom half of image x|top half of image x) = $
p(bottom half of image x|z)p(z|top half of image x)dz
To approximate the posterior p(z|top half of image x), we’ll use stochastic variational inference.
(a) [5 points] Write a function that computes p(z,top half of image x)
(a) First, write a function which returns only the top half of a 28×28 array. This will be useful for
plotting, as well as selecting the correct Bernoulli parameters.
(b) Write a function that computes log p(top half of image x|z). Hint: Given z, the likelihood factorizes, and all the unobserved dimensions of x are leaf nodes, so can be integrated out exactly.
(c) Combine this likelihood with the prior to get a function that takes an x and an array of zs, and
computes the log joint density log p(z,top half of image x) for each z in the array.
(b) [5 points] Now, to approximate p(z|top half of image x) in a scalable way, we’ll use stochastic variational inference. For a digit of your choosing from the training set (choose one that is modelled well,
i.e. the resulting plot looks reasonable):
(a) Initialize variational parameters φµ and φlog σ for a variational distribution q(z|top half of x).
(b) Write a function that computes estimates the ELBO over K samples z ∼ q(z|top half of x). Use
log p(z), log p(top half of x|z), and log q(z|top half of x).
(c) Optimize φµ and φlog σ to maximize the ELBO.
(d) On a single plot, show the isocontours of the joint distribution p(z,top half of image x), and the
optimized approximate posterior qφ(z|top half of image x).
(e) Finally, take a sample z from your approximate posterior, and feed it to the decoder to find the
Bernoulli means of p(bottom half of image x|z). Contatenate this greyscale image to the true top
of the image. Plot the original whole image beside it for comparison.
(c) [5 points] True or false: Questions about the model and variational inference.
There is no need to explain your work in this section.
(a) Does the distribution over p(bottom half of image x|z) factorize over the pixels of the bottom half
of image x?
(b) Does the distribution over p(bottom half of image x|top half of image x) factorize over the pixels
of the bottom half of image x?
(c) When jointly optimizing the model parameters θ and variational parameters φ, if the ELBO
increases, has the KL divergence between the approximate posterior qφ(z|x) and the true posterior
pθ(z|x) necessarily gotten smaller?
(d) If p(x) = N (x|µ, σ2), for some x ∈ R, µ ∈ R, σ ∈ R+, can p(x) < 0?
(e) If p(x) = N (x|µ, σ2), for some x ∈ R, µ ∈ R, σ ∈ R+, can p(x) > 1?
5