Description
1. Linear Regression/SVD.
Throughout this problem let X ∈ R
n×d with i-th row xi ∈ R
d
. Suppose X =
Pr
i=1 siuiv
⊤
i
is its singular
value decomposition, where ui ∈ R
n are left singular vectors, vi ∈ R
d are right singular vectors, si > 0 are
singular values, and r = rank(X).
(a) [hw1] Let the dataset consist of ni > 0 copies of the standard basis vector ei ∈ R
d
for each
i ∈ {1, . . . , d}, with labels
yij
ni
j=1 and Pd
i=1 ni = n. Equivalently, the training set is
n
ei
, yij
: i ∈ {1, . . . , d} , j ∈ {1, . . . , ni}
o
.
Show that for a vector w that minimizes the empirical risk under squared loss, the components wi of
w are the averages of the labels
yij
ni
j=1: wi =
1
ni
Pni
j=1 yij
.
Hint: Write out the expression for the empirical risk with the squared loss and set the gradient equal
to zero.
Remark: This illustrates the classic “regression toward the mean” phenomenon.
(b) [hw1] Returning to a general matrix X, show that if the label vector y is a linear combination of
{ui}
r
i=1 then there exists a w for which the empirical risk is zero (meaning Xw = y).
Hint: Either consider the range of X and use the SVD, or compute the empirical risk explicitly with
y =
Pr
i=1 aiui for some constants ai and wˆ ols = X+y.
Remark: It’s also not hard to show that if y is not a linear combination of the {ui}
r
i=1, then the
empirical risk must be nonzero.
(c) [hw1] Show that X⊤X is invertible if and only if (xi)
n
i=1 spans R
d
.
Hint: Recall that the squares of the singular values of X are eigenvalues of X⊤X.
Remark: This characterizes when linear regression has a unique solution due to the normal equation
(note that we always have at least one solution obtained by the pseudoinverse). We would not have
had a unique solution for part (a) if some ni = 0.
(d) [hw1] Provide a matrix X such that X⊤X is invertible and XX⊤ is not. Include a formal verification
of this for full points.
Hint: Use part (c). It may be helpful to think about conditions under which a matrix is not invertible.
Solution.
2
2. Linear Regression.
Recall that the empirical risk in the linear regression method is defined as Rb(w) :=
1
2n
Pn
i=1(w⊤xi − yi)
2
,
where xi ∈ R
d
is a data point and yi
is an associated label.
(a) [hw1code] Implement linear regression using gradient descent in the linear gd(X, Y, lrate,
num iter) function of hw1.py. You are given as input a training set X as an n × d tensor, training
labels Y as an n × 1 tensor, a learning rate lrate, and the number of iterations of gradient descent to
run num iter. Using gradient descent, find parameters w that minimize the empirical risk Rb(w).
Use w = 0 as your initial parameters, and return your final w as output. Prepend a column of ones
to X in order to accommodate a bias term in w. Return w ∈ R
(d+1)×1 with the bias as the first entry.
Library routines: torch.matmul (@), X.shape, X.t(), torch.cat, torch.ones, torch.zeros,
torch.reshape.
(b) [hw1code] Implement linear regression by using the pseudoinverse to solve for w in the linear normal(X,Y)
function of hw1.py. You are given a training set X as an n × d tensor and training labels Y as an n × 1
tensor. Return your parameters w as output. As before, make sure to accommodate a bias term by
prepending ones to the training examples X. Return w ∈ R
(d+1)×1 with the bias as the first entry.
Library routines: torch.matmul (@), torch.cat, torch.ones, torch.pinverse.
(c) [hw1] Implement the plot linear() function in hw1.py. Use the provided function hw1 utils.load reg data()
to generate a training set X and training labels Y. Plot the curve generated by linear normal()
along with the points from the data set. Return the plot as output. Include the plot in your written
submission.
Library routines: torch.matmul (@), torch.cat, torch.ones, plt.plot, plt.scatter,
plt.show, plt.gcf where plt refers to the matplotlib.pyplot library.
Solution.
3
3. Polynomial Regression.
In Problem 2 you constructed a linear model w⊤x =
Pd
i=1 xiwi
. In this problem you will use the same
setup as in the previous problem, but enhance your linear model by doing a quadratic expansion of the
features. Namely, you will construct a new linear model fw with parameters
(w0, w01, . . . , w0d, w11, w12, . . . , w1d, w22, w23, . . . , w2d, . . . , wdd)
⊤,
defined by
fw(x) = w⊤ϕ(x) = w0 +
X
d
i=1
w0ixi +
X
d
i≤j
wijxixj .
Warning: If the computational complexity of your implementation is high, it may crash the autograder
(try to optimize your algorithm if it does)!
(a) [hw1] Given a 3-dimensional feature vector x = (x1, x2, x3), completely write out the quadratic
expanded feature vector ϕ(x).
(b) [hw1code] Implement the poly gd() function in hw1.py. The input is in the same format as it was
in Problem 2. Implement gradient descent on this training set with w initialized to 0. Return w as
the output with terms in this exact order: bias, linear, then quadratic. For example, if d = 3 then
you would return (w0, w01, w02, w03, w11, w12, w13, w22, w23, w33).
Library routines: torch.cat, torch.ones, torch.zeros, torch.stack.
Hint (feature order, important for autograder): Prepend a column of ones to X, then append
the quadratic features (including cross terms). Use the following exact order:
ϕ(x) = h
1, x1, . . . , xd, x2
1
, x1x2, . . . , x1xd, x2
2
, x2x3, . . . , x2xd, . . . , x2
d
i⊤
.
Equivalently, generate terms in lexicographic order over index pairs (i, j) with 1 ≤ i ≤ j ≤ d:
(1, 1),(1, 2), . . . ,(1, d),(2, 2),(2, 3), . . . ,(d, d). Do not include duplicates (e.g., include x1x2 but not
x2x1). The autograder expects this exact ordering.
(c) [hw1code] Implement the poly normal function in hw1.py. You are given the same data set as from
part (b), but this time determine w by using the pseudoinverse. Return w in the same order as in
part (b). Return w ∈ R
1+d+
d(d+1)
2
×1 with the bias first, then linear, then quadratic terms.
Library routines: torch.pinverse.
Hint: You will still need to transform the matrix X in the same way as in part (b).
(d) [hw1] Implement the plot poly() function in hw1.py. Use the provided function hw1 utils.load reg data()
to generate a training set X and training labels Y. Plot the curve generated by poly normal() along
with the points from the data set. Return the plot as output and include it in your written submission.
Compare and contrast this plot with the plot from Problem 2. Which model appears to approximate
the data better? Justify your answer.
Library routines: plt.plot, plt.scatter, plt.show, plt.gcf.
(e) [hw1] The Minsky-Papert XOR problem is a classification problem with data set:
X = {(−1, +1),(+1, −1),(−1, −1),(+1, +1)}
where the label for a given point (x1, x2) is given by its product x1x2. For example, the point (−1, +1)
would be given label y = (−1)(1) = −1. Implement the poly xor() function in hw1.py. In this
function you will load the XOR data set by calling the hw1 utils.load xor data() function, and
then apply the linear normal() and poly normal() functions to generate predictions for the XOR
points. Include a plot of contour lines that show how each model classifies points in your written
submission. Return the predictions for both the linear model and the polynomial model and use
contour plot() in hw1 utils.py to help with the plot. Do both models correctly classify all points?
4
(Note that red corresponds to larger values and blue to smaller values when using contour plot with
the “coolwarm” colormap).
Using contour plot() effectively: Define a small prediction function that maps an (n × d) tensor
of grid points to an (n × 1) tensor of scores, then pass it to contour plot. For a linear model with
parameters w ∈ R
(d+1)×1
, one convenient choice is
pred lin(Z) =
1 Z
w, where Z ∈ R
n×d
and 1 ∈ R
n×1
.
For the quadratic model, use your feature map ϕ(Z) constructed in the order specified above and
compute pred poly(Z) = ϕ(Z) w. The decision boundary corresponds to the 0-level contour; you
can emphasize it by supplying a levels argument (e.g., levels=[0]) to draw just that contour.
Hint: A “Contour plot” is a way to represent a 3-dimensional surface in a 2-D figure. In this example,
the data points are pined to the figure with their features (x1, x2) as the coordinates in 2-D space
(e.g., x and y axis); the third dimension (e.g., the predictions of the data points) is labeled on the
points in the figure. The lines or curves that link the grid points with the same predictions together
are called the “contours”. See contour plot() in hw1 utils.py for details.
Solution.
5
4. Logistic Regression.
Recall the empirical risk Rb for logistic regression (as presented in lecture 3). Throughout this problem,
assume labels yi ∈ {−1, +1}:
Rblog(w) = 1
n
Xn
i=1
ln(1 + exp(−yiw⊤xi)).
Here you will minimize this risk using gradient descent.
(a) [hw1] In your written submission, derive the gradient descent update rule for this empirical risk by
taking the gradient. Write your answer in terms of the learning rate η, previous parameters w, new
parameters w′
, number of examples n, and training examples (xi
, yi). Show all of your steps.
(b) [hw1code] Implement the logistic() function in hw1.py. You are given as input a training set
X, training labels Y, a learning rate lrate, and number of gradient updates num iter. Implement
gradient descent to find parameters w that minimize the empirical risk Rblog(w). Perform gradient
descent for num iter updates with a learning rate of lrate, initializing w = 0 and returning w as
output. Don’t forget to prepend X with a column of ones. Return w ∈ R
(d+1)×1 with the bias as the
first entry.
Library routines: torch.matmul (@), X.t(), torch.sigmoid, torch.special.softplus (or
torch.logaddexp), torch.exp.
(c) [hw1] Implement the logistic vs ols() function in hw1.py. Use hw1 utils.load logistic data()
to generate a training set X and training labels Y. Run logistic(X,Y) from part (b) taking X and Y
as input to obtain parameters w. Also run linear gd(X,Y) from Problem 2 to obtain parameters
w. Plot the decision boundaries for your logistic regression and least squares models along with the
data X. Which model appears to classify the data better? Explain why you believe your choice is the
better classifier for this problem.
Library routines: torch.linspace, plt.scatter, plt.plot, plt.show, plt.gcf.
Hints:
• The positive and negative points are guaranteed to be linearly separable (though an algorithm
may or may not find the optimal line to separate them).
• Decision boundary with bias: let z =
1
x
denote the augmented feature and w include the bias;
the boundary is
z : w⊤z = 0
. When plotting in 2D with w = (w0, w1, w2)
⊤, draw the line
x2 = −(w0 + w1x1)/w2 for |w2| > 0, or a vertical line at x1 = −w0/w1 if |w2| is tiny.
• Average the gradient over examples (divide by n); do not sum.
• In order to make the two models significantly different, we recommend that you train the logistic
regression with a large num iter (e.g., 1,000,000 or even larger).
Solution.
6
5. N-Gram Next Token Prediction.
Recall the empirical risk Rb for cross entropy (as presented in lecture 2). In this problem, let X ∈ R
n×d
,
W ∈ R
d×k
, and labels Y ∈ {0, . . . , k − 1}
n×1
. We consider a linear predictor f : R
d → R
k given by
f(x) = W⊤x (no bias term):
RbCE(W) = −
1
n
Xn
i=1
ln
exp(fyi
(xi))
Pk
j=1 exp(fj (xi))
.
You will minimize this risk using gradient descent, and apply it to next token prediction.
(a) [hw1] In your written submission, derive the gradient descent update rule for this empirical risk by
taking the gradient. Write your answer in terms of the learning rate η, previous parameters W,
new parameters W′
, number of examples n, training examples (xi
, yi), and probabilities p(c|xi) =
P
exp(fc(xi))
k
j=1 exp(fj (xi)) . Show all of your steps.
(b) [hw1code] Implement the cross entropy() function in hw1.py. You are given as input a training set
X (n × d), training labels Y (integer class indices in {0, . . . , k − 1}, shaped n × 1), number of classes k,
a learning rate lrate, and number of gradient updates num iter. Implement gradient descent to
find parameters W ∈ R
d×k
that minimize the empirical risk RbCE(W). Perform gradient descent
for num iter updates with a learning rate of lrate, initializing W = 0 and returning W as output.
Unlike previous problems, do not incorporate a bias term. Average the gradient over the batch (divide
by n), do not sum. Construct one-hot targets via either torch.nn.functional.one hot(Y.view(-1),
num classes=k).float() or torch.zeros(n,k).scatter (1, Y, 1.0).
Library routines: torch.matmul (@), X.t(), torch.softmax;
optionally torch.logsumexp/torch.log softmax for stability.
(c) [hw1code] Implement the get ntp weights() function in hw1.py. You are given as input the
context size n and the embedding dimension embedding dim. We will use a small subset of
the TinyStories dataset (Eldan and Li, 2023) as our text sample, which you can extract using
hw1 utils.load ntp data(). This function will return text data split into tokens tokenized data
(our tokenizer treats each word as a token), the list of all tokens in order of their id sorted words,
and the inverse mapping of token to id word to idx. You will then need to create the appropriate
N-gram training data from tokenized data. To do so, for every list of words in tokenized data
(which is a list of lists of words), then create a training sample from every set of n + 1 consecutive
words (the first n words are the context, and the last word is the target). Given a list of w words,
you should create exactly w − n training samples (assuming w ≥ n, of course); skip sentences shorter
than n + 1. Use hw1 utils.load random embeddings() to get random feature embeddings for each
word in the vocabulary. Run cross entropy(X,Y,k) from part (b) taking X, Y, and vocabulary size
k as input to obtain parameters W. Use the default learning rate and number of iterations.
Library routines: torch.stack.
(d) [hw1code] Implement the generate text() function in hw1.py. You are given as input the parameters
W from the get ntp weights() function in part (c), the context size n, the number of additional
tokens to generate num tokens, the embedding dimension embedding dim, and an initial context
string context. Use hw1 utils.load ntp data() and hw1 utils.load random embeddings() to
get the rest of the necessary data. Greedy decoding: assume the context contains at least n tokens; if
longer, use the last n tokens at each step. At each step, map tokens to indices, concatenate embeddings
of the last n indices to form x ∈ R
d with d = n · embedding dim, compute logits W⊤x ∈ R
k
, take
argmax to pick the next token index, append it, and repeat. Return a string containing the initial
context as well as all of the generated words, with each word separated by a space.
Library routines: torch.argmax.
(e) [hw1] Try generating at least 5 strings using different initial context strings using the generate text()
function in part (d) (use n = 4, num tokens ≥ 20, and embedding dim = 10), and include the results
here. Do you notice anything unusual about the generated strings? Why do you think this happens?
Solution.
7
6. LLM Use, collaboration, and other sources.
[hw1] Please document, in detail, all your sources, including include LLMs, friends, internet resources, etc.
For example:
1a. I asked my friend, then I found a different way to derive the same solution.
1b. GPT-5 solved the problem in one shot, but then I rewrote it once on paper, and a few days later
tried to re-derive an answer from scratch.
1c. I accidentally found this via a google search, and had trouble forgetting the answer I found, but still
typed it from scratch without copy-paste.
1d. . . .
.
.
.
6. I used my solution to problem 5 to write this answer.
Answer.
8
References
Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent
english?, 2023. URL https://arxiv.org/abs/2305.07759.

