## Description

a. Prove that e

−x ≥ (1 − x).

Before you answer the following questions, you need to read the paper

’A decision-theoretic generalization of on-line learning and an application to

boosting’ by Y. Freund and R.E. Schapire, 1995 (also available from Blackboard, under Course Documents, Reading Material, Computation Learning

Theory). You do not have to focus too much on sections 1, 2 and 3, but

section 4 and 4.1 are important.

b. Implement a ’weak learner’: the decision stump. The decision stump

is a very simple classifier that chooses one feature f from a dataset,

and checks if the feature value xf is larger (or smaller) than a certain

threshold θ. If xf > (<)θ then the object is assigned to class ω1 and

otherwise to ω2.

To find the optimal feature f and threshold θ, you have to do an exhaustive search for a training set. So, try all values for f and θ and the

sign y of < or >, and remember those values f

∗

, θ∗

, y∗

for which the

classification error on the trainingset is minimum.

Make sure that the function accepts a dataset with labels as input, and

that the function outputs the optimal f

∗

, θ

∗ and y

∗

.

Show your (Matlab) code.

1

c. Test the implementation on the dataset generated by gendats from Prtools.

(If you don’t want to use Prtools, just generate the two classes from

two Gaussian distributions, where the means are µ1 = [0, 0]T and

µ2 = [2, 0]T

, and the covariance matrices are identify matrices.) Make

a scatterplot of the data, and give the optimal parameters obtained by

your decision stump.

Does the decision stump change if you rescale one of the features (for

instance, when you multiply feature 2 by a factor of 10)?

d. Test the implementation on dataset optdigitsubset of last week (available from Blackboard). It consists of the pixel values of small 8 × 8

images, which are ordered in N = 1125 rows of 64 columns wide. The

first 554 rows contains the values of 8 × 8 images of zeros, while the remaining block of 571 rows contains the 64 pixel values of 571 ones. Use

the first 50 objects for each class for training, and the rest for testing.

What is the classification error on the test objects? How much does

this vary when you take other random subsets of 50 for training? Show

the mean and standard deviation of the error. Is the performance with

training on the first 50 object unexpectedly good or bad?

e. Extend the implementation of the weak learner such that it accepts a

weight per object. Therefore, next to a dataset and labels, the function

should accept a weight wi > 0 per object xi

. The weighted weak learner

should now minimize the weighted classification error.

Test the code by using a simple classification problem again, and convince yourself that the code is Bug Free. (That means not only that ‘it

does not crash’, but also show convincingly that objects with heigher

weight have a larger influence on the solution.)

f. Implement the algorithm that is described in Figure 2 of the paper: AdaBoost. Show the code.

Also implement the code to classify new and unseen data. Show the

code.

g. Test your implementation on some dataset. Not only use a simple dataset

like gendats but also a more complicated dataset like gendatb. Find

2

out which objects obtain a large weight w

t

i

.

1 Keep the number of iterations T fixed to, say, T = 100.

h. Test the implementation on dataset optdigitsubset of last week. Use

the first 50 objects for each class for training, and the rest for testing.

What is the classification error on the test objects? How does this error

depend on the number of iterations T? If you take the classifier with

the optimal T, which training objects get a high weight? Show them!

1And no, it is not sufficient to just say ’object 13 and 7 have high weight’. Show me,

do the more simple or more difficult objects get high weight?

3