## Description

## Linear Regression – closed form solution

Now you should be ready to implement Linear Regression. First, recall the equation for our general linear

model in matrix/vector notation

hθ(x) = x

>θ

where we’re using convention that x0 = 1 (we prepend 1 to our feature vector). This tells us how to use θ to

predict y for a given x. To actually find θ we can construct a system of linear equations out of our training

data like so:

X =

x

(1)

0 x

(1)

1

· · · x

(1)

D

x

(2)

0 x

(2)

1

· · · x

(2)

D

.

.

.

.

.

.

x

(N)

0 x

(N)

1

· · · x

(N)

D

=

— x

(1)> —

— x

(2)> —

.

.

.

— x

(N)> —

y =

y

(1)

y

(2)

.

.

.

y

(N)

y = Xθ

which we can use to find an estimate for the parameters, ˆθ:

ˆθ =

X>X

−1

X>y (1)

Note that since this is just a bunch of matrix multiplications, transposes, and one matrix inverse, we can

actually code this as a single line using numpy, although you can break it out into multiple steps to make it

obvious and easier to debug.

Look at the documentation for numpy.dot(…), numpy.transpose(…), and

numpy.linalg.inv(…), play around with them for simple 2-by-2 matrices until you understand them, then

use them to build up equation 1 in the function linreg closed form(…). You can test your work on the

provided datasets or data you make yourself, and compare with the results that numpy.linalg.lstsq(…)

produces. (Note: to get credit for this part be sure to actually implement the equation, don’t use the output of

numpy.linalg.lstsq(…)).

You can also visualize the model you are finding with the vis linreg model(…)

function.

## Gradient Descent

What happens when we have a lot of data points or a lot of features? Notice we’re computing

X>X

−1

which becomes computationally expensive as that matrix gets larger. In the section after this we’re going

to need to be able to compute the solution for some really large matrices, so we’re going to need a method

that doesn’t compute this inverse.

Gradient Descent is one such method, one that’s used in a variety of algorithms in ML. The basic outline

of the gradient descent method is to “follow” the gradient of the loss function, ∇θLS(hθ), to a minimum for

the parameters, θ. First lets define the loss function as the squared-loss:

LS(hθ) = 1

2N

X

N

i=1

hθ(x

(i)

) − y

(i)

2

(2)

Implement equation 2 in the function loss(…). You can check that it works by using numpy.linalg.lstsq(…)

on the provided datasets. Note that the residuals returned by numpy.linalg.lstsq(…) are the sum of

squared errors, so you will have to scale appropriately.

We can use Gradient Descent on this loss function to find the optimal parameters for our linear model

by iteratively updating θ, adding in a small bit of the negative of the gradient. Here’s a formal description

of the algorithm:

Data: initial estimate ˆθ

(0), loss function LS(hθ), number of iterations K, learning rate α

Result: estimate of optimal parameters, ˆθ

(K)

for k ← 1 to K do

ˆθ

(k) ← ˆθ

(k−1) − α∇LS(

ˆθ

(k−1))

end

As we showed in class, for our linear model and using equation 2 for the loss, we can write the gradient

as:

∇LS(hθ) = 1

N

X>Xθ − X>Y

(3)

Implement Gradient Descent using equation 3 in the function linreg grad desc(…). You can check

that it works by comparing your answers for the closed form solution or numpy.linalg.lstsq(…). The

default values for alpha and num iters should be enough for gradient descent to converge to the same

answers for the linear datasets, but you may want to test for higher or lower values to see how it affects the

results. Again, it will probably be helpful to use vis linreg model(…) to visualize the resulting models.

## Random Fourier Features

Linear Regression, unsurprisingly enough, can only fit linear data with any degree of accuracy. However,

as we discussed in the class it’s fairly easy to extend Linear Regression by transforming the data using a set

of K non-linear feature functions, which we’ll denote Φ = {φk(x)}

K

k=1.

There are many possible ways to choose the set Φ4

, but there are actually a few methods that extend

Linear Regression to fit models from almost any class of functions. One particularly simple (to implement)

method is called Random Fourier Features which is described by Rahimi and Recht (2008).

The key idea is

that any “well behaved” function can be represented by its Fourier transform, which is basically a sum of

an infinite number of sin and cos functions with different amplitudes and frequencies. The trick is that we

can approximate this infinite sum with a finite number of samples, as long as we sample appropriately at

randomly selected frequencies.

The way this is implemented is by using the following set of basis functions:

φk(x) = cos(ωkx + bk), k = 1, . . . , K

So using using this set of basis functions, and randomly selecting ωk and bk, gives us a problem that’s

linear in the features (so we can use linear regression), and which should let us fit almost any function (where

almost includes any function for which the Fourier inversion theorem holds).

This also means that as we

add more and more features (as we increase K), we should be able to more and more accurately estimate

any crazy non-linear function we like! The proof that this is true, and the details of how exactly to do this

sampling for ωk and bk are in the cited paper, which you’ll need to read in order to implement the function

random fourier features(…).

The paper gives different ways of sampling for different “kernels;” we’ll

talk about these in more detail later, but for now use the method for the Gaussian kernel.

The math in the paper may be a bit advanced, but the actual code that you have to write is minimal (less

than 5 lines total), as a lot has been filled in for you, there and in the helper functions. Perhaps the most

complicated part is getting the shape of the matrix right for the helper funciton apply RFF transform(…)

to work with the code as written.

The helper function is expecting a 1D array for b and a D-by-K matrix

for the ω’s:

Ω =

ω1 ω2 · · · ωK

4and if you have any insight into the kinds of functions that generated your training data, it’s a good idea to start with

those.

The comments in that function also have good pointers for which numpy methods you’ll want to use for

the sampling.

## A note about testing RFF

Since you are randomly sampling the transform for Random Fourier Features, your results may vary from

one run to the next. Be sure to test your work, using the provided non-linear datasets like 1D-exp-*.txt and

1D-quad-*.txt, multiple times to get a good idea of how this method performs.

As with gradient descent,

the default parameters should produce reasonable results, but you will want to change these settings to get

a good feel for what’s going on, as well as to help with debugging.

## Questions

## 1. Linear Regression

(a) Use your closed form implementation to fit a model to the data in 1D-no-noise-lin.txt and

2D-noisy-lin.txt. What is the loss of the fit model in both cases? Include plots for both. Note:

for some of these datasets the y-intercept will be non-zero, make sure you handle this case!

(b) What happens when you’re using the closed form solution and one of the features (columns of X)

is duplicated? Explain why.

(c) Does the same thing happen if one of the training points (rows of X) is duplicated? Explain why.

(d) Does the same thing happen with Gradient Descent? Explain why.

## 2. Gradient Descent

(a) Now use your Gradient Descent implementation to fit a model to 1D-no-noise-lin.txt with

alpha=1.0, num iters=10, and initial Theta set to the zero vector. What is the output (the

full list of (loss, θ) tuples for each of the 10 iterations)?

(b) Using the default parameters for alpha and num iters, and with initial Theta to 0, do you get

the same model parameters as you did with the closed form solution? the same loss?

(c) Find a set of values of the learning rate and iterations where you get the same answers and a set

of values where the answers are noticeably different. Explain what’s going on in both cases.

## 3. Random Fourier Features

(a) Use your Random Fourier Features implementation to fit models for 1D-exp-samp.txt, 1D-exp-uni.txt,

1D-quad-uni.txt, and 1D-quad-uni-noise.txt, each with 3 different values for K: small,

medium, and large. The small value should noticeably under-fit the data, the large value should

fit almost perfectly, and the medium value should be somewhere in between. Report the values

for K and include graphs for each of the four datasets (so you should report 12 values of K and

have 12 graphs in total).

(b) EXTRA CREDIT: What happens with Random Fourier Features if you try to predict a value

far away from any points in the training dataset? Modify the plotting functions (or write your

own) and show what the model looks like far away from the training points. Describe what you

are seeing qualitatively, compare the model’s output with the “true” output quantitatively, and

explain why this is happening. You may want to run these experiments multiple times to get a

better understanding of what is going on.

References

Rahimi, Ali, and Benjamin Recht. “Random features for large-scale kernel machines.” Advances in neural

information processing systems. 2008.