Stat 432 Homework 8 solution

$30.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (1 vote)

Question 1 (logistic regression) [5 points]

We consider a logistic regression problem using the South Africa heart data as a demonstration. The goal is
to estimate the probability of chd, the indicator of coronary heart disease.

The following code is used to
prepare the data and fit the logistic regression.
library(ElemStatLearn)
data(SAheart)
heart = SAheart
heart$famhist = as.numeric(heart$famhist)-1
n = nrow(heart)
p = ncol(heart)
heart.full = glm(chd~., data=heart, family=binomial)
# fitted value
yhat = (heart.full$fitted.values>0.5)
table(yhat, SAheart$chd)
##
## yhat 0 1
## FALSE 256 77
## TRUE 46 83

The goal is to replicate the following summary matrix using your own code. You are not allowed to use any
built-in optimization functions. The only statistical function you might use is pnorm().

# the coefficients and significance
round(summary(heart.full)$coef, dig=3)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.151 1.308 -4.701 0.000
## sbp 0.007 0.006 1.135 0.256
## tobacco 0.079 0.027 2.984 0.003
## ldl 0.174 0.060 2.915 0.004
## adiposity 0.019 0.029 0.635 0.526
## famhist 0.925 0.228 4.061 0.000
## typea 0.040 0.012 3.214 0.001
## obesity -0.063 0.044 -1.422 0.155
## alcohol 0.000 0.004 0.027 0.978
## age 0.045 0.012 3.728 0.000

You should proceed with the following steps:
• Write a function that computes the Hessian matrix
• Write a function that uses Newton–Raphson to iteratively update the parameter values and search for
the optimal solution. You should use rep(0, ncol(x)) as your initial value.

• From the solution, replicate the summary matrix displayed above.
• Change the initial value to rep(1, ncol(x)) and comment on your findings.

Question 2 (LDA and QDA)

Load the handwritten digit data with the following code to generate training and testing data
# a plot of some samples
findRows <- function(zip, n) {
# Find n (random) rows with zip representing 0,1,2,…,9
res <- vector(length=10, mode=”list”)
names(res) <- 0:9
ind <- zip[,1]
for (j in 0:9) {
res[[j+1]] <- sample( which(ind==j), n ) }
return(res)
}
set.seed(1)

# find 100 samples for each digit for both the training and testing data
train.id <- findRows(zip.train, 100)
train.id = unlist(train.id)
test.id <- findRows(zip.test, 100)
test.id = unlist(test.id)
X = zip.train[train.id, -1]
Y = zip.train[train.id, 1]
dim(X)
## [1] 1000 256
Xtest = zip.test[test.id, -1]
Ytest = zip.test[test.id, 1]
dim(Xtest)
## [1] 1000 256

a) [5 points]
We want to write a code to implement the LDA method. Obtain Wk and bk in the discriminant function
defined in page 29 of the lecture notes “Class”. Then use these quantities on the testing dataset to predict
the digit. Calculate the confusion matrix and the prediction accuracy. You are not allowed to use built-in
functions lda.

b) [extra 3 points]
We also demonstrated that QDA does not work on this dataset.
• Use one of the regularized approaches provided in the lecture note, implement a method that can
produce good prediction accuracy
• Since the problem was caused by p > n, how about we reduced the dimension of the dataset first, and
apply QDA? Compare this approach with the previous one.

Question 3 [extra 3 points]

On pages 42-44 of lecture class, we have a golf example using naive Bayes method for categorical variables.
Write a code to implement naiveBayes method for categorical variables. Automate the calculation in this
example and output the calssification result for the instance today=(Sunny, Hot, Normal, False). You
are not allowed to use built-in functions naiveBayes, but you can use it to check your answer.