## Description

Implement Lasso using the Coordinate Descent (CD) algorithm and test your algorithm on

the Boston housing data. Check [Rcode W3 VarSel RidgeLasso.html] on relevant background information on this dataset.

• First, load the transformed Boston Housing Data, Coding2 myData.csv.

• Next write your own function MyLasso to implement CD, which should output estimated Lasso coefficients similar to the ones returned by R with option “standardized

= TRUE”.

In case you don’t know where to start, you can follow the structure given on the next

page to prepare your function. In our script, we run a fixed number of iterations,

“maxit = 100,” which seems enough for this assignment. You could set it to be a

bigger number, or change it to a while loop to stop when some convergence criterion

is satisfied.

• Test your algorithm with the following lambda sequence.

lam .seq = exp(seq ( -1 , -8 , length . out = 80) )

myout = MyLasso (X , y , lam .seq , maxit = 100)

• Produce a path plot for the 13 non-intercept coefficients along the lambda values in

log scale.

• Check the accuracy of your algorithm against the output from glmnet. The maximum

difference between the two coefficient matrices should be less than 0.005.

lasso . fit = glmnet (X , y , alpha = 1 , lambda = lam .seq)

max(abs( coef ( lasso . fit ) – myout ) )

Students who use Python for this assignment can find the dataset Coding2 myData.csv

and the target Lasso coefficients, Coding2 lasso coefs.csv in the Coding Assignments

folder at the Resources page on Piazza.

What you need to submit?

An R/Python Markdown file in HTML format, which should contain all code used to

produce your results.

• You are allowed to use two packages: MASS (for the data) and glmnet.

• Name your file starting with Assignment 2 xxxx netID where “xxxx” is the last 4-dig

of your University ID.

One _var _ lasso = function (r , x , lam ) {

# ##############

# YOUR CODE

# ##############

}

MyLasso = function (X , y , lam .seq , maxit = 50) {

# X: n-by -p design matrix without the intercept

# y: n-by -1 response vector

# lam.seq: sequence of lambda values

# maxit : number of updates for each lambda

n = length ( y )

p = dim( X ) [2]

nlam = length ( lam . seq)

# #############################

# YOUR CODE

# Center and scale X

# Record the corresponding means and scales

# #############################

# Initialize coef vector b and residual vector r

b = XXX

r = XXX

B = XXX

# Triple nested loop

for( m in 1: nlam ) {

lam = XXX # assign lambda value

for ( step in 1: maxit ) {

for( j in 1: p ) {

r = r + ( X [ , j ]*b [ j ])

b [ j ] = One_var_ lasso (r , X [ , j ] , lam )

r = r – X [ , j ] * b [ j ]

}

}

B [m , -1] = b

}

# #############################

# YOUR CODE

# Scale back the coefficients and update the intercepts B[ , 1]

# #############################

return (t( B ) )

}

Note: You need to write your own function One var lasso to solve the one-variable Lasso

for βj . Check hints given on the next page.

2

In the CD algorithm, at each iteration, we repeatedly solve a one-dimensional Lasso problem

for βj while holding the other (p − 1) coefficients at their current values:

min

βj

Xn

i=1

(yi −

X

k6=j

xikβˆ

k − xijβj )

2 + λ

X

k6=j

|βˆ

k| + λ|βj |,

which is equivalent to solving

min

βj

Xn

i=1

(ri − xijβj )

2 + λ|βj |. (1)

where

ri = yi −

X

k6=j

xikβˆ

k.

How to solve (1)? In class we have discussed how to find the minimizer of

f(x) = (x − a)

2 + λ|x|,

which is given by

x

∗ = arg min

x

f(x) = sign(a)(|a| − λ/2)+ =

a − λ/2, if a > λ/2;

0, if |a| ≤ λ/2;

a + λ/2, if a < −λ/2.

(2)

We can rewrite (1) in the form of f(x) and then use the solution given above.

Define two n×1 vectors: r = (r1, . . . , rn)

t with ri being its i-element and xj = (x1j , . . . , xnj )

T

with xij as its i-th element. Then

r1 − x1jβj

r2 − x2jβj

· · ·

rn − xnjβj

=

r1

r2

· · ·

rn

−

x1j

x2j

· · ·

xnj

βj = r − xjβj .

Then rewrite (1) as

Xn

i=1

(ri − xijβj )

2 + λ|βj | = kr − xjβjk

2 + λ|βj |. (3)

The first term above is like the RSS from a regression model with only one predictor (whose

coefficient is βj ) without the intercept. The corresponding LS estimate is given by

βˆ

j = r

txj/kxjk

2

.

3

Then we have

kr − xjβjk

2 = kr − xjβˆ

j + xj (βj − βˆ

j )k

2

= kr − xjβˆ

jk

2 + kxj (βj − βˆ

j )k

2

+2 × inner-product-of-vectors (r − xjβˆ

j ) and xj (βj − βˆ

j )

= kr − xjβˆ

jk

2 + kxj (βj − βˆ

j )k

2

, (4)

where the last equality is due to the fact that the inner product term is zero since vector

(r − xjβˆ

j ) is orthogonal to xj

1

.

Note that the first term of (4) has nothing to do with βj . So to minimize (1) or equivalently

(3) with respect to βj , we can ignore the first term and instead minimize

kxj (βj − βˆ

j )k

2 + λ|βj | = kxjk

2

(βj − βˆ

j )

2 + λ|βj |

= kxjk

2

(βj − βˆ

j )

2 +

λ

kxjk

2

|βj |

∝ (βj − βˆ

j )

2 +

λ

kxjk

2

|βj |.

Now we can use (2), the solution we derived for f(x), with

a = βˆ

j = r

txj/kxjk

2

, λ = λ/kxjk

2

.

1This is because (r−xjβˆj ) represents the residual vector from a regression model with xj being a column

(actually the only column) of the design matrix, so it is orthogonal to xj .

4