## Description

## Problem 1 Pegasos: a stochastic gradient based solver for linear SVM

In this question, you will build a linear SVM (i.e. without kernel) classifier using the Pegasos algorithm [1].

Given a training set {(xn ∈ RD, yn ∈ {1, −1})}

N

n=1

, the primal formulation of linear SVM can be written as

L2-regularized hinge loss as shown in the class:

min

w

λ

2

||w||2

2 +

1

N ∑n

max{0, 1 − ynwT

xn}. (1)

Note that here we include the bias term b into parameter w by appending x with 1, which is slightly different

from what was discussed in the class.

Instead of turning it into dual formulation, we are going to solve the primal formulation directly with

a gradient-base algorithm. Recall for (batch) gradient descent, at each iteration of parameter update, we

compute the gradients for all data points and take the average (or sum).

When the training set is large

(i.e., N is a large number), this is often too computationally expensive. Stochastic gradient descent with

mini-batch alleviates this issue by computing the gradient on a subset of the data at each iteration.

Pegasos, a representative solver of eq. (1), is exactly applying stochastic gradient descent with minibatch to this problem, with an extra step of projection described below (this is also called projected stochastic gradient descent). The pseudocode of Pegasos is given in Algorithm 1. At the t-th iteration of parameter

update, we first take a mini-batch of data At of size K [step (3)], and define A

+

t ⊂ At to be the subset

of samples for which wt suffers a non-zero loss [step (4)]. Next we set the learning rate ηt = 1/(λt)

[step (5)].

We then perform a two-step parameter update as follows. We first compute (1 − ηtλ)wt

and for all samples (x, y) ∈ A

+

t we add the vector

yηt

K

x to (1 − ηtλ)wt

. We denote the resulting vector by

wt+ 1

2

[step (6)]. This step is in fact exactly stochastic gradient descent: wt+ 1

2

= wt − ηt∇t where

∇t = λwt −

1

K ∑

(x,y)∈A

+

t

yx

is the (sub)gradient of the objective function on the mini-batch At at wt

. Last, we set wt+1 to be the projection of wt+ 1

2

onto the set

B = {w : ||w||2 ≤ 1/√

λ}

to control its L2 norm. This can be obtained simply by scaling wt+1 by min (

1,

1/√

λ

||wt+ 1

2

||)

[step (e)].

For more details of Pegasos algorithm you may refer to the original paper [1].

Now you are going to implement Pegasos and train a binary linear SVM classifier on the dataset

mnist subset.json. You are going to implement three functions—objective function, pegasos train,

and pegasos test—in a script named pegasos.py. You will find detailed programming instructions in

the script.

1.1 Finish the implementation of the function objective function, which corresponds to the objective

function in the primal formulation of SVM.

1.2 Finish the implementation of the function pegasos train, which corresponds to Algorithm 1.

3

Algorithm 1 The Pegasos algorithm

Require: a training set S = {(xn ∈ RD, yn ∈ {1, −1})}

N

n=1

, the total number of iterations T, the batch size

K, and the regularization parameter λ.

Output: the last weight wT+1.

1: Initialization: w1 = 0

2: for t = 1, 2, · · · , T do

3: Randomly choose a subset of data At ∈ S (with replacement) such that |At

| = K

4: Set A

+

t = {(x, y) ∈ At

: ywT

t

xt < 1} 5: Set ηt = 1 λt 6: Set wt+ 1 2 = (1 − ηtλ)wt + ηt K ∑(x,y)∈A + t yx 7: Set wt+1 = min ( 1, 1/√ λ ||wt+ 1 2 ||) wt+ 1 2 8: end for 1.3 After you train your model, run your classifier on the test set and report the accuracy, which is defined as: # of correctly classified test samples # of test samples Finish the implementation of the function pegasos test. 1.4 After you complete above steps, run pegasos.sh, which will run the Pegasos algorithm for 500 iterations with 6 settings (mini-batch size K = 100 with different λ ∈ {0.01, 0.1, 1} and λ = 0.1 with different K ∈ {1, 10, 1000}), and output a pegasos.json that records the test accuracy and the value of objective function at each iteration during the training process. What to do and submit: run script pegasos.sh.

It will take mnist subset.json as input and generate pegasos.json. Add, commit, and push both pegasos.py and pegasos.json before the due date. 1.5 Based on pegasos.json, you are encouraged to make plots of the objective function value versus the number of iterations (i.e., a convergence curve) in different settings of λ and K, but you are not required to submit these plots.

## References

[1] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, Andrew Cotter Mathematical Programming, 2011, Volume 127, Number 1, Page 3. Pegasos: primal estimated sub-gradient solver for SVM. 4 Problem 2 Boosting In the lectures, we discussed boosting algorithms, which construct a strong (binary) classifier based on iteratively adding one weak (binary) classifier into it. A weak classifier is learned to (approximately) maximize the weighted training accuracy at the corresponding iteration, and the weight of each training example is updated after every iteration.

In this question, you will implement decision stumps as weak classifiers and AdaBoost. For simplicity, instead of implementing an actual base algorithm that learns decision stumps, we fix a finite set of decision stumps H in advance and pick the one with smallest weighted error at each round of boosting.

2.1 General boosting framework A boosting algorithm sequentially learns βt and ht ∈ H for t = 0, . . . , T − 1 and finally constructs a strong classifier as H(x) = sign h ∑ T−1 t=0 βtht(x) i . Please read the class “Boosting” defined in boosting.py, and then complete the class by implementing the “TODO” part(s) as indicated in boosting.py.

2.2 Decision stump A decision stump h ∈ H is a classifier characterized by a triplet (s ∈ {+1, −1}, b ∈ R, d ∈ {0, 1, · · · , D − 1}) such that h(s,b,d) (x) = ( s, if xd > b,

−s, otherwise.

(2)

That is, each decision stump function only looks at a single dimension xd of the input vector x, and check

whether xd

is larger than b. Then s decides which label to predict if xd > b.

Please first read classifier.py and decision stump.py, and then complete the class “DecisionStump” by implementing the “TODO” part(s) as indicated in decision stump.py.

2.3 AdaBoost AdaBoost is a powerful and popular boosting method, summarized in Algorithm

2.

Algorithm 2 The AdaBoost algorithm

Require: H: A set of classifiers, where h ∈ H takes a D-dimensional vector as input and outputs either +1

or −1, a training set {(xn ∈ RD, yn ∈ {+1, −1})}

N−1

n=0

, the total number of iterations T.

Ensure: Learn H(x) = sign h

∑

T−1

t=0

βtht(x)

i

, where ht ∈ H and βt ∈ R.

1: Initialization D0(n) = 1

N

, ∀n ∈ {0, 1, · · · , N − 1}.

2: for t = 0, 1, · · · , T − 1 do

3: find ht = argminh∈H ∑n Dt(n)I[yn 6= h(xn)]

4: compute et = ∑n Dt(n)I[yn 6= ht(xn)]

5: compute βt =

1

2

log

1 − et

et

6: compute Dt+1(n) = (

Dt(n) exp(−βt), if yn = ht(xn)

Dt(n) exp(βt), otherwise

, ∀n ∈ {0, 1, · · · , N − 1}

7: normalize Dt+1(n) ←

Dt+1(n)

∑n

0 Dt+1(n

0)

, ∀n ∈ {0, 1, · · · , N − 1}

8: end for

Please read the class ”AdaBoost” defined in boosting.py, and then complete the class by implementing the ”TODO” part(s) as indicated in boosting.py.

2.4 Final submission Please run boosting.sh, which will generate boosting.json. Add, commit,

and push boosting.py, decision stump.py and boosting.json before the due date.

5

## Problem 3 Decision Tree

In this question, you will implement a simple decision tree algorithm. For simplicity, only discrete features

are considered here.

3.1 Conditional Entropy Recall that the quality of a split can be measured by the conditional entropy.

Please read decision tree.py and complete the function “conditional entropy” (where we use base 2

for logarithm).

3.2 Tree construction The building of a decision tree involves growing and pruning. For simplicity, in

this programming set you are only asked to grow a tree.

As discussed a tree can be constructed by recursively splitting the nodes if needed, as summarized in

Algorithm 3 (whose interface is slightly different from the pseudocode discussed in the class). Please read

decision tree.py and complete the function “split”.

Algorithm 3 The recursive procedure of decision tree algorithm

1: function TREENODE.SPLIT(self)

2: find the best attribute to split the node

3: split the node into child nodes

4: for child node in child nodes do

5: if child node.splittable then . See Slide 27 of Lec 7 for the three “non-splittable” cases

6: child node.split()

7: end if

8: end for

9: end function

3.3 Final submission Please run decision tree.sh, which will generate decision tree.json. Add,

commit, and push decision tree.py, decision tree.json before the due date.

6