## Description

This assignment is related to the simulation study described in Section 2.3.1 (the so-called

Scenario 2) of “Elements of Statistical Learning” (ESL).

Scenario 2: the two-dimensional data X ∈ R2

in each class is generated from a

mixture of 10 different bivariate Gaussian distributions with uncorrelated components

and different means, i.e.,

X|Y = k, Z = l ∼ N

mkl, s2

I2

,

where k = 0, 1, l = 1 : 10, P(Y = k) = 1/2, and P(Z = 1) = 1/10. In other words,

given Y = k, X follows a mixture distribution with density function

1

10

X

10

l=1

1

√

2πs2

2

e

−kx−mklk

2/(2s

2

)

.

First, generate the twenty two-dimensional vectors as follows: for l = 1, . . . , 10,

m0l

i.i.d. ∼ N

(0, 1)T

, σ2

I2

m1l

i.i.d. ∼ N

(1, 0)T

, σ2

I2

.

Then, repeat the following simulation 20 times using the same set of centers generated

above. In each simulation,

1. follow the data generating process to generate a training sample of size 200 and a test

sample of size 10,000, and

2. calculate the training and test errors (the averaged 0/1 error1

)

for each the following four procedures:

• Linear regression with cut-off value2 0.5,

• quadratic regression with cut-off value 0.5,

• kNN classification with k chosen by 10-fold cross-validation, and

• the Bayes rule (assume your know the values of mkl’s and s).

Summarize your results on training errors and test errors graphically, e.g., using boxplot or

stripchart. Also report the mean and standard error for the chosen k values.

Continue on the next page —-

1For each sample, the incurred error is 1 if there is a mistake, and 0 otherwise.

2predict Y to be 1 if the returned estimate is bigger than the cut-off value, and 0 otherwi

What you need to submit?

An R Markdown file in HTML format.

• You are only allowed to use two packages: class and ggplot2. In other words, you

have to write your own function to select the optimal K value based on 10-fold CV.

• Set the seed at the beginning of your code to be the last 4-dig of your University ID.

So once we run your code, we can get the same result.

• Specify the values for s and σ. Suggest to choose a larger value for σ and a smaller

value for s, e.g., pick a value for σ and then set s

2 = σ

2/5.

• Name your file starting with

Assignment 1 xxxx netID

where “xxxx” is the last 4-dig of your University ID and make sure the same 4-dig is

used as the seed in your code.

For example, the submission for Max Chen with UID 672757127 and netID mychen12

should be named as

Assignment 1 7127 mychen12 MaxChen.html

You can add whatever characters after your netID.

2