Description
1. Python and Numpy Warm-up Exercises [20 pts]: This part of the homework is intended to help
you practice your linear algebra and how to implement linear algebraic and statistical operations in
Python using numpy (to which we refer in the code below as np). For each of the problems below, write
a method (e.g., problem 1a) that returns the answer for the corresponding problem.
In all problems, you may assume that the dimensions of the matrices and/or vectors that are given
as input are compatible for the requested mathematical operations. You do not need to perform
error-checking.
Note 1: In mathematical notation we usually start indices with j = 1. However, in numpy (and many
other programming settings), it is more natural to use 0-based array indexing. When answering the
questions below, do not worry about “translating” from 1-based to 0-based indexes. For example, if
the (i, j)th element of some matrix is requested, you can simply write A[i,j].
Note 2: To represent and manipulate vectors and matrices, please use numpy’s array class (not the
matrix class).
Note 3: While the difference between a row vector and a column vector is important when doing math,
numpy does not care about this difference as long as the array is 1-D. This means, for example, that
if you want to compute the inner product between two vectors x and y, you can just write x.dot(y)
without needing to transpose the x. If x and y are 2-D arrays, however, then it does matter whether
they are row-vectors or column-vectors, and hence you might need to transpose accordingly.
(a) Given two matrices A and B, compute and return an expression for A + B. [ 0 pts ]
Answer : While it is completely valid to use np.add(A, B), this is unnecessarily verbose; you really
should make use of the “syntactic sugar” provided by Python’s/numpy’s operator overloading and
just write: A + B. Similarly, you should use the more compact (and arguably more elegant)
notation for the rest of the questions as well.
(b) Given matrices A, B, and C, compute and return AB − C (i.e., right-multiply matrix A by
matrix B, and then subtract C). Use dot or np.dot. [ 1 pts ]
(c) Given matrices A, B, and C, return A⊙B+C⊤, where ⊙ represents the element-wise (Hadamard)
product and ⊤ represents matrix transpose. In numpy, the element-wise product is obtained simply
with *. [ 1 pts ]
(d) Given column vectors x and y, compute the inner product of x and y (i.e., x
⊤y). [ 1 pts ]
(e) Given matrix A and integer i, return the sum of all the entries in the ith row whose column index
is even, i.e., P
j:j is even Aij . Do not use a loop, which in Python can be very slow. Instead use
the np.sum function. [ 2 pts ]
(f) Given matrix A and scalars c, d, compute the arithmetic mean over all entries of A that are
between c and d (inclusive). In other words, if S = {(i, j) : c ≤ Aij ≤ d}, then compute
1
|S|
P
(i,j)∈S Aij . Use np.nonzero along with np.mean. [ 2 pts ]
(g) Given an (n×n) matrix A and integer k, return an (n×k) matrix containing the right-eigenvectors
of A corresponding to the k eigenvalues of A with the largest absolute value. Use np.linalg.eig.
[ 2 pts ]
1
(h) Given a column vector (with n components) x, an integer k, and positive scalars m, s, return an
(n × k) matrix, each of whose columns is a sample from multidimensional Gaussian distribution
N (x + mz, sI), where z is column vector (with n components) containing all ones and I is the
identity matrix. Use either np.random.multivariate normal or np.random.randn. [ 2 pts ]
(i) Given a matrix A with n rows, return a matrix that results from randomly permuting the
columns (but not the rows) in A. [ 2 pts]
(j) Z-scoring: Given a vector x, return a vector y such that each yi = (xi − x)/σ, where x is the
mean (use np.mean) of the elements of x and σ is the standard deviation (use np.std). [ 2 pts ]
(k) Given an n-vector x and a non-negative integer k, return a n × k matrix consisting of k copies
of x. You can use numpy methods such as np.newaxis, np.atleast 2d, and/or np.repeat. [ 2
pts ]
(l) Given a k × n matrix X =
x
(1)
. . . x
(n)
and a k × m matrix Y =
y
(1)
. . . y
(m)
,
compute an n × m matrix D =
d11 . . . d1m
.
.
.
dn1 . . . dnm
consisting of all pairwise L2 distances dij =
∥x
(i) − y
(j)∥2. In this problem you may not use loops. Instead, you can avail yourself of numpy
objects & methods such as np.newaxis, np.atleast 3d, np.repeat, np.swapaxes, etc. (There
are various ways of solving it.) Hint: from X (resp. Y), construct a 3-d matrix that contains
multiple copies of each of the vectors in X (resp. Y); then subtract these 3-d matrices. [ 3 pts ]
2. Training 2-Layer Linear Neural Networks with Stochastic Gradient Descent [25 pts]:
(a) Train an age regressor that analyzes a (48 × 48 = 2304)-pixel grayscale face image and outputs a
real number ˆy that estimates how old the person is (in years). The training and testing data are
available here:
• https://s3.amazonaws.com/jrwprojects/age_regression_Xtr.npy
• https://s3.amazonaws.com/jrwprojects/age_regression_ytr.npy
• https://s3.amazonaws.com/jrwprojects/age_regression_Xte.npy
• https://s3.amazonaws.com/jrwprojects/age_regression_yte.npy
Your prediction model g should be a 2-layer linear neural network that computes ˆy = g(x; w) =
x
⊤w + b, where w is the vector of weights and b is the bias term. The cost function you should
optimize is
fMSE(w, b) = 1
2n
Xn
i=1
(ˆy
(i) − y
(i)
)
2
where n is the number of examples in the training set Dtr = {(x
(1), y(1)), . . . ,(x
(n)
, y(n)
)}, each
x
(i) ∈ R
2304 and each y
(i) ∈ R. To optimize the weights, you should implement stochastic gradient
descent (SGD).
Note: you must complete this problem using only linear algebraic operations in numpy – you may
not use any off-the-shelf linear regression or neural network software, as that would defeat the
purpose.
There are several different hyperparameters that you will need to optimize:
• Mini-batch size ˜n.
• Learning rate ϵ.
• Number of epochs.
In order not to cheat (in the machine learning sense) – and thus overestimate the performance
of the network – it is crucial to optimize the hyperparameters only on a validation set. (The
training set would also be acceptable but typically leads to worse performance.) To create
2
a validation set, simply set aside a fraction (e.g., 20%) of the age regression Xtr.npy and
age regression ytr.npy to be the validation set; the remainder (80%) of these data files will
constitute the “actual” training data. While there are fancier strategies (e.g., Bayesian optimization) that can be used for hyperparameter optimization, it’s often effective to just use a grid
search over a few values for each hyperparameter. In this problem, you are required to explore
systematically (e.g., using nested for loops) at least 2 different values for each hyperparameter.
Performance evaluation: Once you have tuned the hyperparameters and optimized the weights
and bias term so as to minimize the cost on the validation set, then: (1) stop training the network
and (2) evaluate the network on the test set. Report both the training and the test fMSE in the
PDF document you submit, as well as the training cost values for the last 10 iterations of gradient
descent (just to show that you actually executed your code).
3. Gradient descent: what can go wrong? [30 pts] Please enter your code in a file named
homework1 problem3.py, with one Python function (e.g., problem3a) for each subproblem.
(a) [10 pts]: The graph below plots the function f(x) that is defined piece-wise as:
f(x) =
−x
3
: x < −0.1
−3x/100 − 1/500 : −0.1 ≤ x < 3
−(x − 31/10)3 − 23/250 : 3 ≤ x < 5
1083
200 (x − 6)2 − 6183/500 : x >= 5
4 2 0 2 4 6 8 10
0
20
40
60
As you can see, the function has a long nearly flat section (sometimes known as a plateau) just
before the minimum.1 Plateaux can cause big problems during optimization. To show this:
i. Derive by hand the (piece-wise) function ∇f and implement it in Python/numpy.
ii. Use your implementation of ∇f to conduct gradient descent for T = 100 iterations. Always
start from an initial x = −3. Try using various learning rates: 1e-3, 1e-2, 1e-1, 1e0,
1e1. Plot f, ∇f, as well as superimposed dots that show the sequence ((x
(1), y(1)), . . . ,(x
(T)
, y(T)
))
of gradient descent. Use plt.legend to indicate which scatter plot corresponds to which
learning rate.
iii. Describe in 1-2 sentences what you observe during gradient descent for the set of learning
rates listed above.
iv. Find a learning rate ϵ for which gradient descent successfully converges to min f(x), and
report ϵ in the PDF file.
(b) [8 pts]: Even a convex paraboloid – i.e., a parabola in multiple dimensions that has only one local
minimum and no plateaux – can cause problems for “vanilla” SGD (i.e., the kind we’ve learned in
1The ugly constants in f were chosen to give rise to these characteristics while ensuring that it remains differentiable.
3
class so far). Examine the scatter-plot below which shows the sequence ((x
(1), y(1)), . . . ,(x
(T)
, y(T)
))
of gradient descent on a convex paraboloid f, starting at x
(1) = [1, −3]⊤, where each x ∈ R
2
. The
descent produces a zig-zag pattern that takes a long time to converge to the local minimum.
4 2 0 2 4
x1
3
2
1
0
1
2
3
4
x2
i. Speculate how the SGD trajectory would look if the learning rate were made to be very small
(e.g., 100x smaller than in the figure above).
ii. Let f(x1, x2) = a1(x1 − c1)
2 + a2(x2 − c2)
2
. Pick values for a1, a2, c1, c2 so that – when
the scatter-plot shown above is superimposed onto it – the gradient descent is realistic for
f. Rather than just guess randomly, consider: why would the zig-zag be stronger in one
dimension than another, and how would this be reflected in the function’s constants? Plot a
contour graph using plt.contour and superimpose the scatter-plot using plt.scatter. You
can find the gradient descent sequence in gradient descent sequence.txt. Note that you
are not required to find the exactly correct constants or even to estimate them algorithmically.
Rather, you can combine mathematical intuition with some trial-and-error. Your solution
should just look visually “pretty close” to get full credit. Note: to ensure proper rendering,
use plt.axis(’equal’) right before calling plt.show().
(c) [6 pts]: This problem is inspired by this paper. Consider the function f(x) = 2
3
|x|
3/2
. Derive
∇f, implement gradient descent and plot the descent trajectories ((x
(1), y(1)), . . . ,(x
(T)
, y(T)
)) for
a variety of learning rates 1e-3, 1e-2, 1e-1 and a variety of starting points. See what trend
emerges, and report it in the PDF.
(d) [6 pts]: While very (!) unlikely, it is theoretically possible for gradient descent to converge to a
local maximum. Give the formula of a function (in my own solution, I used a degree-4 polynomial),
a starting point, and a learning rate such that gradient descent converges to a local maximum
after 1 descent iteration (i.e., after 1 iteration, it reaches the local maximum, and the gradient
is exactly 0). Prove (by deriving the exact values for the descent trajectory) that this is true.
You do not need to implement this in code (and, in fact, due to finite-precision floating-point
arithmetic, it might not actually converge as intended).
Submission: Create a Zip file containing both your Python and PDF files, and then submit on Canvas. If
you are working as part of a group, then only one member of your group should submit (but make sure you
have already signed up in a pre-allocated team for the homework on Canvas).