## Description

The purpose of this homework is to help you to be prepared to analyze datasets in your future studies and

career. Since we are learning how to analyze the dataset, this HW (and other early HWs) will provide the

detailed R codes and technical details.

Hence, besides running these R codes or their extensions, we expect

you to write your homework solution in the format of a report (in pdf or word) that summarizes your findings,

understandings, and interpretations.

In the main body of the report, please be concise (possibly 2 ∼ 8 pages) and

easy-understanding, e.g., using the descriptive tables/figures to summarize your results (instead of blindly copying

and pasting R/Python output. Of course, if you want, please feel free to include R or python codes/outputs as

an appendix (as many pages as you want).

Problem (KNN). Consider the well-known zipcode data set in the machine learning and data mining literature,

which are available from the book website: <www-stat.stanford.edu/ElemStatLearn>. You can also find it at

Canvas: the training data set is the file “zip.train.csv” and the testing dataset is “zip.test.csv”. In the zipcode

data, the first column stands for the response (Y ) and the other columns stand for the independent variables

(Xi

’s).

The detailed description can be found from

http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/zip.info.txt

Here we consider only the classification problem between 2’s and 7’s, e.g., denote by “ziptrain27” the training

data that only includes the data when Y = 2 or when Y = 7.

(1) Exploratory Data Analysis of Training data: when playing with the training data “ziptrain27”,

e.g., report some summary information/statistics of training data that you think are important or interesting.

Please do not copy and paste the results of R or Python codes — you need to be selective, and use your own

language to write up some sentences to summarize those important or interesting results.

(2) Build the classification rule by using the training data “ziptrain27” with the following methods: (i)

linear regression; and (ii) the KNN with k = 1, 3, 5, 7, 9, 11, 13, and 15. Find the training errors of each choice.

(3) Consider the provided testing data set, and derive the testing errors of each classification rule in (3).

(4) Cross-Validation. The above steps are sufficient in many machine learning or data mining questions

when both training and testing data sets are very large. However, for relatively small data sets, one may

want to do further to assess the robustness of each approach.

One general approach is Monte Carlo Cross

Validation algorithm that splits the observed data points into training and testing subsets, and repeats the

above computation B times (B = 100 say). In the context of this homework, we can combine n1 = 1376 training

data and n2 = 345 testing data together into a larger data set.

Then for each loop b = 1, · · · , B, we randomly

select n1 = 1376 as a new training subset and use the remaining n2 = 345 data as the new testing subset.

Within each loop, we first build different models from “the training data of that specific loop” and then evaluate

their performances on “the corresponding testing data.”

Therefore, for each model or method in part (3), we will

obtain B values of the testing errors on B different subsets of testing data, denote by T Eb for b = 1, 2, · · · , B.

Then the “average” performances of each model can be summarized by the sample mean and sample variances

of these B values:

T E∗ =

1

B

X

B

b=1

T Eb and V ar ˆ (T E) = 1

B − 1

X

B

b=1

T Eb − T E∗

2

.

Compute and compare the “average” performances of each model or method mentioned in part (2). In

particular, based on your results, write some paragraphs to provide a brief summary of what you discover

in the cross-validation, including reporting the “optimal” choice of the tuning parameter k in the KNN method,

and explaining how confident you are on the usefulness of your optimal choice in real-world applications.

Appendix: Please feel free to use the following sample R codes if you want. Of course, you are free to use

Python or other softwares

## Below assume that you save the datasets in the folder ‘‘C://Temp” in your laptop

## 1. Read Training data

ziptrain <- read.table(file=”C://Temp/zip.train.csv”, sep = “,”);

ziptrain27 <- subset(ziptrain, ziptrain[,1]==2 | ziptrain[,1]==7);

## some sample Exploratory Data Analysis

dim(ziptrain27); ## 1376 257

sum(ziptrain27[,1] == 2);

summary(ziptrain27);

round(cor(ziptrain27),2);

## To see the letter picture of the 5-th row by changing the row observation to a matrix

rowindex = 5; ## You can try other “rowindex” values to see other rows

ziptrain27[rowindex,1];

Xval = t(matrix(data.matrix(ziptrain27[,-1])[rowindex,],byrow=TRUE,16,16)[16:1,]);

image(Xval,col=gray(0:1),axes=FALSE) ## Also try “col=gray(0:32/32)”

### 2. Build Classification Rules

### linear Regression

mod1 <- lm( V1 ~ . , data= ziptrain27);

pred1.train <- predict.lm(mod1, ziptrain27[,-1]);

y1pred.train <- 2 + 5*(pred1.train >= 4.5);

## Note that we predict Y1 to $2$ and $7$,

## depending on the indicator variable whether pred1.train >= 4.5 = (2+7)/2.

mean( y1pred.train != ziptrain27[,1]);

## KNN

library(class);

kk <- 1;

xnew <- ziptrain27[,-1];

ypred2.train <- knn(ziptrain27[,-1], xnew, ziptrain27[,1], k=kk);

mean( ypred2.train != ziptrain27[,1])

### 3. Testing Error

### read testing data

ziptest <- read.table(file=”C://Temp/zip.test.csv”, sep = “,”);

ziptest27 <- subset(ziptest, ziptest[,1]==2 | ziptest[,1]==7);

dim(ziptest27) ##345 257

## Testing error of KNN, and you can change the k values.

xnew2 <- ziptest27[,-1]; ## xnew2 is the X variables of the “testing” data

kk <- 1; ## below we use the training data “ziptrain27” to predict xnew2 via KNN

ypred2.test <- knn(ziptrain27[,-1], xnew2, ziptrain27[,1], k=kk);

mean( ypred2.test != ziptest27[,1]) ## Here “ziptest27[,1]” is the Y response of the “testing” data

### 4. Cross-Validation

### The following R code might be useful, but you need to modify it.

zip27full = rbind(ziptrain27, ziptest27) ### combine to a full data set

n1 = 1376; # training set sample size

n2= 345; # testing set sample size

n = dim(zip27full)[1]; ## the total sample size

set.seed(7406); ### set the seed for randomization

### Initialize the TE values for all models in all $B=100$ loops

B= 100; ### number of loops

TEALL = NULL; ### Final TE values

for (b in 1:B){

### randomly select n1 observations as a new training subset in each loop

flag <- sort(sample(1:n, n1));

zip27traintemp <- zip27full[flag,]; ## temp training set for CV

zip27testtemp <- zip27full[-flag,]; ## temp testing set for CV

### you need to write your own R code here to first fit each model to “zip27traintemp”

### then get the testing error (TE) values on the testing data “zip27testtemp”

### IMPORTANT: when copying your codes in (2) and (3), please change to

### these temp datasets, “zip27traintemp” and “zip27testtemp” !!!

###

### Suppose you save the TE values for these 9 methods (1 linear regression and 8 KNN) as

### te0, te1, te2, te3, te4, te5, te6, te7, te8 respectively, within this loop

### Then you can save these $9$ Testing Error values by using the R code

### Note that the code is not necessary the most efficient

TEALL = rbind( TEALL, cbind(te0, te1, te2, te3, te4, te5, te6, te7, et8) );

}

### Of course, you can also get the training errors if you want

dim(TEALL); ### This should be a Bx9 matrices

### if you want, you can change the column name of TEALL

colnames(TEALL) <- c(“linearRegression”, “KNN1”, “KNN3”, “KNN5”, “KNN7”,

“KNN9”, “KNN11”, “KNN13”, “KNN15”);

## You can report the sample mean/variances of the testing errors so as to compare these models

apply(TEALL, 2, mean);

apply(TEALL, 2, var);

### END ###