Description
Problem 1. An ISyE 7406 student was asked to find the weights of two balls, A and B, using a scale with
random measurement errors. When the student measured one ball at a time, the weights of A and B were 2
lbs and 1 lb, respectively.
However, if the student measured both balls simultaneously, then the total weight
(A + B) were 4 lbs. The poor student was confused, and decided to repeat the above measurements. The
new observed weights of A, B and A + B are 2, 2 and 5 lbs, respectively.
For your information, the observed
weights are summarized in the following table:
A B A + B
First Time 2 1 4
Second Time 2 2 5
(a) There are 6 observations, and we can write the observed data in the matrix form Yn×1 = Xn×pβp×1+n×1
with n = 6 and p = 2. From this viewpoint, using the linear regression to
(i) estimate the weights of balls A and B; and
(ii) find a 70% confidence interval on the weight of ball A.
(iii) Suppose the student plans to measure the weight of ball A one more time. Find a 70% prediction
interval on the new observed weight of ball A.
(b) Another approach is to consider the average weights, and this yields the following data table:
A B A + B
“New Observed Average Weights” 2 1.5 4.5
Repeat part (a) by writing the “new observed average weights” in the matrix form Yn×1 = Xn×pβp×1+
n×1 with n = 3 and p = 2.
(c) Compare your results in (a) and (b).
Problem 2. Consider a simple linear regression model Y = β0 + β1x + . Suppose that we choose m
different values of the independent variables xi
’s, and each choice of xi
is duplicated, yielding k independent
observations Yi1
, Yi2
, · · · , Yik
.
Is it true that the least squares estimates of the intercept and slope can be
found by doing a regression of the mean responses, Y¯
i = (Yi1 + Yi2 + · · · + Yik
)/k, on the xi
’s? Why or why
not? Explain.
Hints: this is a generalization of Problem 1. There are two kinds of linear regressions: one is based on
a total of n = mk “raw” observations (Yi
, xi)’s, and the other is based on the m “average” observations
(Y¯
i
, xi). See the hint pdf file for more details.
Problem 3 (R exercise). Consider the zipcode data, which are available from the book website: <wwwstat.stanford.edu/ElemStatLearn>. You can also find it at Canvas. In the zipcode data, the first column
stands for the response (Y ) and the other columns stand for the independent variables (Xi
’s).
The detailed
description can be found from
http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/zip.info.txt
Here we consider only the classification problem between 2’s and 7’s.
(1) Let us first obtain the training data. The following R code can yield the desired training data named
as “ziptrain27”
ziptrain <- read.table(file=”http://www.isye.gatech.edu/~ymei/7406/Handouts/zip.train.csv”,
sep = “,”);
ziptrain27 <- subset(ziptrain, ziptrain[,1]==2 | ziptrain[,1]==7);
(2) Exploratory Data Analysis. Play with the training data “ziptrain27” and report some summary
information/statistics of training data that you think are important or interesting. Some of the following R
code might be useful, but please do not copy and paste the R results — you need to be selective, and use
your own language to write up some sentences to summarize those important or interesting R results).
dim(ziptrain27);
sum(ziptrain27[,1] == 2);
summary(ziptrain27);
round(cor(ziptrain27),2);
## To see the letter picture of the 5-th row by changing the row observation to a matrix
rowindex = 5; ## You can try other “rowindex” values to see other rows
ziptrain27[rowindex,1];
Xval = t(matrix(data.matrix(ziptrain27[,-1])[rowindex,],byrow=TRUE,16,16)[16:1,]);
image(Xval,col=gray(0:1),axes=FALSE) ## Also try “col=gray(0:32/32)”
(3) Using the training data “ziptrain27” to build the classification rule by (i) linear regression; and (ii)
the KNN with k = 1, 3, 5, 7 and 15. Find the training errors of each choice.
### linear Regression
mod1 <- lm( V1 ~ . , data= ziptrain27);
pred1.train <- predict.lm(mod1, ziptrain27[,-1]);
y1pred.train <- 2 + 5*(pred1.train >= 4.5);
mean( y1pred.train != ziptrain27[,1]);
## KNN
library(class);
kk <- 1;
xnew <- ziptrain27[,-1];
ypred2.train <- knn(ziptrain27[,-1], xnew, ziptrain27[,1], k=kk);
mean( ypred2.train != ziptrain27[,1])
(4) Let us consider the testing data set, and derive the testing errors of each classification rule in (3).
ziptest <- read.table(file=”http://www.isye.gatech.edu/~ymei/7406/Handouts/zip.test.csv”,
sep = “,”);
ziptest27 <- subset(ziptest, ziptest[,1]==2 | ziptest[,1]==7);
## Testing error of KNN
kk <- 1;
xnew2 <- ziptest27[,-1];
ypred2.test <- knn(ziptrain27[,-1], xnew2, ziptrain27[,1], k=kk);
mean( ypred2.test != ziptest27[,1])
Based on the above analysis, write some paragraph to provide a brief summary of what you discover.
summarize your results.
the training data “ziptrain27” to build the classification rule by (i) linear regression; and (ii) the KNN
with k = 1, 3, 5, 7 and 15.