HOMEWORK 3 – V3 Sta414/Sta2104 Nature’s rejection sampler solution


Original Work


5/5 - (10 votes)

1. Nature’s rejection sampler, 30 points. A light source emits photons whose angle x is
normally distributed around θ:
p(x|θ) = N (x|µ = θ, σ2 = 42
Then the photons pass through a grating which absorbs photons with a probability that depends
on their angle. We denote the event that a photon makes it through the grating by saying that
the Bernoulli variable g = 1. For this particular grating, p(g|x, θ) = Ber(g|f(x, θ)) where
f(x, θ) = sin2
(5(x − θ))
25 sin2
(x − θ)
(1.1) .
The resulting unnormalized probability p(x, g = 1|θ = 0), is shown below as a function of x.
(a) [4 points] Use numerical integration (i.e. a brute-force sum over many values) to estimate
the fraction of photons that get absorbed on average, p(g = 0|θ = 0). That is, estimate
p(x, g = 0|θ = 0)dx by summing over 10,000 values of x, equally spaced from −20 to 20.
(b) [4 points] Implement a rejection sampler to sample from p(x|g = 1, θ = 0). Hint: Since
it’s a pmf, p(g|x, θ = 0) ≤ 1. Another hint: doing ancestral sampling in the above model
forms a rejection sampler. Plot a histogram 10,000 accepted samples, and print the fraction
of accepted samples.
(c) [4 points] Implement a self-normalized importance sampler to estimate the fraction of all
photons that get absorbed, p(g = 0|θ = 0). Use p(x|θ = 0) as a proposal distribution. As
a reminder, the formula for a self-normalized importance sampler for the expectation of a
function f over an (unnormalized) target distribution ˜p(z) and proposal distribution q(z)
eˆ(x1, x2, . . . , xK) = 1
p˜(xj )
q(xj )
(1.2) wherexi ∼iid q(x)
If you can derive a simpler expression by canceling terms, you don’t have to implement the
entire formula. Print the estimate of the fraction of photons that are absorbed, using 1000
(d) [4 points] A physicist is trying to estimate the location θ of the center of a light source.
She has a Cauchy prior p(θ) = 1
10π(1+( θ
10 )
. She observes a photon emitted from the particle
at position x = 1.7. Plot the unnormalized density p(x = 1.7, g = 1, θ), as a function of θ.
(e) [10 points] Write a Metropolis-Hastings sampler to sample from p(θ|x = 1.7, g = 1). For
the proposal distribution, use a Gaussian centered around the current sample. Fiddle with
the proposal distribution and the number of iterations of your sampler until the histogram
of samples looks roughly like the true unnormalized density. Plot the histogram of samples
from your chain.
(f) [4 points] The scientist wants to know how confident she should be, based on her observation
of the photon, that her instrument is calibrated to within 3 degrees around zero. Use samples
from your MH chain to estimate the posterior probability p(−3 < θ < 3|x = 1.7, g = 1).
2. Gradient estimators, 20 points. All else being equal, it’s useful for a gradient estimator
to be unbiased. The unbiasedness of a gradient estimator guarantees that, if we decay the step
size and run stochastic gradient descent for long enough (see Robbins & Monroe), it will converge
to a local optimum.
The standard REINFORCE, or score-function estimator is defined as:
gˆSF[f] = f(b)

∂θ log p(b|θ), b ∼ p(b|θ)(2.1)
(a) [5 points] First, let’s warm up with the score function. Prove that the score function has
zero expectation, i.e. Ep(x|θ)
[∇θ log p(x|θ)] = 0. Assume that you can swap the derivative
and integral operators.
(b) [5 points] Show that REINFORCE is unbiased: Ep(b|θ)


∂θ log p(b|θ)


(c) [5 points] Show that REINFORCE with a fixed baseline is still unbiased, i.e. show that

[f(b) − c]

∂θ log p(b|θ)


[f(b)] for any fixed c.
(d) [5 points] If the baseline depends on b, then REINFORCE will in general give biased
gradient estimates. Give an example where Ep(b|θ)

[f(b) − c(b)] ∂
∂θ log p(b|θ)


for some function c(b), and show that it is biased.
The takeaway is that you can use a baseline to reduce the variance of REINFORCE, but not
one that depends on the current action.
3. Comparing variances of gradient estimators, 25 points. If we restrict ourselves to
consider only unbiased gradient estimators, then the main (and perhaps only) property we need
to worry about is the variance of our estimators. In general, optimizing with a lower-variance
unbiased estimator will converge faster than a high-variance unbiased estimator. However, which
estimator has the lowest variance can depend on the function being optimized. In this question,
we’ll look at which gradient estimators scale to large numbers of parameters, by computing their
variance as a function of dimension.
For simplicity, we’ll consider a toy problem. The goal will be to estimate the gradients of
the expectation of a sum of D independent one-dimensional Gaussians. Each Gaussian has unit
variance, and its mean is given by an element of the D-dimensional parameter vector θ:
f(x) = X
(3.1) xd
L(θ) = Ex∼p(x|θ)
[f(x)] = Ex∼iidN (θ,I)
(a) [4 points] As a warm-up, compute the variance of a single-sample simple Monte Carlo
estimator of the objective L(θ):
LˆMC =
xd, where each xd ∼iid N (θd, 1)(3.3)
That is, compute V
as a function of D.
(b) [5 points] The score-function, or REINFORCE estimator with a baseline has the form:

SF[f] = [f(x) − c(θ)] ∂
∂θ log p(x|θ), x ∼ p(x|θ)(3.4)
Derive a closed form for this gradient estimator for the objective defined above as a deterministic function of , a D-dimensional vector of standard normals. Set the baseline to
c(θ) = PD
d=1 θd. Hint: When simplifying ∂
∂θ log p(x|θ), you shouldn’t take the derivative
through x, even if it depends on θ. To help keep track of what is being differentiated, you
can use the notation ∂
∂θ g(
¯θ, θ) to denote taking the derivative only w.r.t. the second θ.
(c) [8 points] Derive the variance of the above gradient estimator. Because gradients are Ddimensional vectors, their covariance is a D×D matrix. To make things easier, we’ll consider
only the variance of the gradient with respect to the first element of the parameter vector,
θ1. That is, derive the scalar value V


as a function of D. Hint: The third moment of
a standard normal is 0, and the fourth moment is 3. As a sanity check, consider the case
where D = 1.
(d) [8 points] Next, let’s look at the gold standard of gradient estimators, the reparameterization gradient estimator, where we reparameterize x = T(θ, ):

REPARAM[f] = ∂f
∂θ ,  ∼ p()(3.5)
In this case, we can use the reparameterization x = θ + , with  ∼ N (0, I). Derive this
gradient estimator for ∇θL(θ), and give V


as a function of D.
4. Complete course quiz, 5 points. We’re planning to update the prerequisites for this
course next year. To help us plan, go to this course’s page on quercus, and fill out the short quiz
on which courses you took before this one, and how well they prepared you for the different parts
of this course.