Description
Linear Regression – closed form solution
Now you should be ready to implement Linear Regression. First, recall the equation for our general linear
model in matrix/vector notation
hθ(x) = x
>θ
where we’re using convention that x0 = 1 (we prepend 1 to our feature vector). This tells us how to use θ to
predict y for a given x. To actually find θ we can construct a system of linear equations out of our training
data like so:
X =
x
(1)
0 x
(1)
1
· · · x
(1)
D
x
(2)
0 x
(2)
1
· · · x
(2)
D
.
.
.
.
.
.
x
(N)
0 x
(N)
1
· · · x
(N)
D
=
— x
(1)> —
— x
(2)> —
.
.
.
— x
(N)> —
y =
y
(1)
y
(2)
.
.
.
y
(N)
y = Xθ
which we can use to find an estimate for the parameters, ˆθ:
ˆθ =
X>X
−1
X>y (1)
Note that since this is just a bunch of matrix multiplications, transposes, and one matrix inverse, we can
actually code this as a single line using numpy, although you can break it out into multiple steps to make it
obvious and easier to debug.
Look at the documentation for numpy.dot(…), numpy.transpose(…), and
numpy.linalg.inv(…), play around with them for simple 2-by-2 matrices until you understand them, then
use them to build up equation 1 in the function linreg closed form(…). You can test your work on the
provided datasets or data you make yourself, and compare with the results that numpy.linalg.lstsq(…)
produces. (Note: to get credit for this part be sure to actually implement the equation, don’t use the output of
numpy.linalg.lstsq(…)).
You can also visualize the model you are finding with the vis linreg model(…)
function.
Gradient Descent
What happens when we have a lot of data points or a lot of features? Notice we’re computing
X>X
−1
which becomes computationally expensive as that matrix gets larger. In the section after this we’re going
to need to be able to compute the solution for some really large matrices, so we’re going to need a method
that doesn’t compute this inverse.
Gradient Descent is one such method, one that’s used in a variety of algorithms in ML. The basic outline
of the gradient descent method is to “follow” the gradient of the loss function, ∇θLS(hθ), to a minimum for
the parameters, θ. First lets define the loss function as the squared-loss:
LS(hθ) = 1
2N
X
N
i=1
hθ(x
(i)
) − y
(i)
2
(2)
Implement equation 2 in the function loss(…). You can check that it works by using numpy.linalg.lstsq(…)
on the provided datasets. Note that the residuals returned by numpy.linalg.lstsq(…) are the sum of
squared errors, so you will have to scale appropriately.
We can use Gradient Descent on this loss function to find the optimal parameters for our linear model
by iteratively updating θ, adding in a small bit of the negative of the gradient. Here’s a formal description
of the algorithm:
Data: initial estimate ˆθ
(0), loss function LS(hθ), number of iterations K, learning rate α
Result: estimate of optimal parameters, ˆθ
(K)
for k ← 1 to K do
ˆθ
(k) ← ˆθ
(k−1) − α∇LS(
ˆθ
(k−1))
end
As we showed in class, for our linear model and using equation 2 for the loss, we can write the gradient
as:
∇LS(hθ) = 1
N
X>Xθ − X>Y
(3)
Implement Gradient Descent using equation 3 in the function linreg grad desc(…). You can check
that it works by comparing your answers for the closed form solution or numpy.linalg.lstsq(…). The
default values for alpha and num iters should be enough for gradient descent to converge to the same
answers for the linear datasets, but you may want to test for higher or lower values to see how it affects the
results. Again, it will probably be helpful to use vis linreg model(…) to visualize the resulting models.
Random Fourier Features
Linear Regression, unsurprisingly enough, can only fit linear data with any degree of accuracy. However,
as we discussed in the class it’s fairly easy to extend Linear Regression by transforming the data using a set
of K non-linear feature functions, which we’ll denote Φ = {φk(x)}
K
k=1.
There are many possible ways to choose the set Φ4
, but there are actually a few methods that extend
Linear Regression to fit models from almost any class of functions. One particularly simple (to implement)
method is called Random Fourier Features which is described by Rahimi and Recht (2008).
The key idea is
that any “well behaved” function can be represented by its Fourier transform, which is basically a sum of
an infinite number of sin and cos functions with different amplitudes and frequencies. The trick is that we
can approximate this infinite sum with a finite number of samples, as long as we sample appropriately at
randomly selected frequencies.
The way this is implemented is by using the following set of basis functions:
φk(x) = cos(ωkx + bk), k = 1, . . . , K
So using using this set of basis functions, and randomly selecting ωk and bk, gives us a problem that’s
linear in the features (so we can use linear regression), and which should let us fit almost any function (where
almost includes any function for which the Fourier inversion theorem holds).
This also means that as we
add more and more features (as we increase K), we should be able to more and more accurately estimate
any crazy non-linear function we like! The proof that this is true, and the details of how exactly to do this
sampling for ωk and bk are in the cited paper, which you’ll need to read in order to implement the function
random fourier features(…).
The paper gives different ways of sampling for different “kernels;” we’ll
talk about these in more detail later, but for now use the method for the Gaussian kernel.
The math in the paper may be a bit advanced, but the actual code that you have to write is minimal (less
than 5 lines total), as a lot has been filled in for you, there and in the helper functions. Perhaps the most
complicated part is getting the shape of the matrix right for the helper funciton apply RFF transform(…)
to work with the code as written.
The helper function is expecting a 1D array for b and a D-by-K matrix
for the ω’s:
Ω =
ω1 ω2 · · · ωK
4and if you have any insight into the kinds of functions that generated your training data, it’s a good idea to start with
those.
The comments in that function also have good pointers for which numpy methods you’ll want to use for
the sampling.
A note about testing RFF
Since you are randomly sampling the transform for Random Fourier Features, your results may vary from
one run to the next. Be sure to test your work, using the provided non-linear datasets like 1D-exp-*.txt and
1D-quad-*.txt, multiple times to get a good idea of how this method performs.
As with gradient descent,
the default parameters should produce reasonable results, but you will want to change these settings to get
a good feel for what’s going on, as well as to help with debugging.
Questions
1. Linear Regression
(a) Use your closed form implementation to fit a model to the data in 1D-no-noise-lin.txt and
2D-noisy-lin.txt. What is the loss of the fit model in both cases? Include plots for both. Note:
for some of these datasets the y-intercept will be non-zero, make sure you handle this case!
(b) What happens when you’re using the closed form solution and one of the features (columns of X)
is duplicated? Explain why.
(c) Does the same thing happen if one of the training points (rows of X) is duplicated? Explain why.
(d) Does the same thing happen with Gradient Descent? Explain why.
2. Gradient Descent
(a) Now use your Gradient Descent implementation to fit a model to 1D-no-noise-lin.txt with
alpha=1.0, num iters=10, and initial Theta set to the zero vector. What is the output (the
full list of (loss, θ) tuples for each of the 10 iterations)?
(b) Using the default parameters for alpha and num iters, and with initial Theta to 0, do you get
the same model parameters as you did with the closed form solution? the same loss?
(c) Find a set of values of the learning rate and iterations where you get the same answers and a set
of values where the answers are noticeably different. Explain what’s going on in both cases.
3. Random Fourier Features
(a) Use your Random Fourier Features implementation to fit models for 1D-exp-samp.txt, 1D-exp-uni.txt,
1D-quad-uni.txt, and 1D-quad-uni-noise.txt, each with 3 different values for K: small,
medium, and large. The small value should noticeably under-fit the data, the large value should
fit almost perfectly, and the medium value should be somewhere in between. Report the values
for K and include graphs for each of the four datasets (so you should report 12 values of K and
have 12 graphs in total).
(b) EXTRA CREDIT: What happens with Random Fourier Features if you try to predict a value
far away from any points in the training dataset? Modify the plotting functions (or write your
own) and show what the model looks like far away from the training points. Describe what you
are seeing qualitatively, compare the model’s output with the “true” output quantitatively, and
explain why this is happening. You may want to run these experiments multiple times to get a
better understanding of what is going on.
References
Rahimi, Ali, and Benjamin Recht. “Random features for large-scale kernel machines.” Advances in neural
information processing systems. 2008.