Description
1 Recitation Problems
These problems are to be found in: Introduction to Statistical Learning,
7
th Printing (Online Edition) by Gareth James, Daniela Witten, Trevor
Hastie, Robert Tibshirani.
1.1 Chapter 3
Problems: 1,3,4
1.2 Chapter 4
Problems: 4,6,7,9
2 Practicum Problems
These problems will primarily reference the lecture materials and the examples
given in class using R and CRAN. It is suggested that a RStudio session be
used for the programmatic components.
2.1 Problem 1
Load the Boston sample dataset into R using a dataframe (it is part of the
MASS package). Use lm to fit a regression between medv and lstat – plot the
resuling fit and show a plot of fitted values vs. residuals. Is there a possible
non-linear relationship between the predictor and response? Use the predict
function to calculate values response values for lstat of 5, 10, and 15 – obtain
confidence intervals as well as prediction intervals for the results – are they the
same?
Why or why not? Modify the regression to include lstat2
(as well lstat
itself) and compare the R2 between the linear and non-linear fit – use ggplot2
and stat smooth to plot the relationship.
2.2 Problem 2
Load the abalone sample dataset from the UCI Machine Learning Repository
(abalone.data) into R using a dataframe. Remove all observations in the
Infant category, keeping the Male/Female classes. Using the caret package, use
createDataPartition to perform an 80/20 test-train split (80% training and 20%
testing). Fit a logistic regression using all feature variables via glm, and observe
which predictors are relevant. Do the confidence intervals for the predictors
contain 0 within the range? How does this relate to the null hypothesis? Use the
confusionMatrix function in caret to observe testing results (use a 50% cutoff to
tag Male/Female) – how does the accuracy compare to a random classifier ROC
curve? Use the corrplot package to plot correlations between the predictors.
How does this help explain the classifier performance?
2.3 Problem 3
Load the mushroom sample dataset from the UCI Machine Learning Repository
(agaricus-lepiota.data) into R using a dataframe (Note: There are missing
values with a ? character, you will have to explain your handling of these).
Create a Naive Bayes classifier using the e1071 package, using the sample function to split the data between 80% for training and 20% for testing. With the
target class of interest being edible mushrooms, calculate the accuracy of the
classifier both in-training and in-test. Use the table function to create a confusion matrix of predicted vs. actual classes – how many false positives did the
model produce?