## Description

Problem 1: (25 points) This problem will involve linear regression on the dataset midterm data 1.csv.

The response column is response and all other columns are features.

(a) (5 points) Load the dataset. Remove any unnecessary columns. Remove any rows that have an NA value.

Format the columns for feat.c and feat.g as categorical variables. Make pairwise plots showing the

relations between all columns. Compute the pairwise correlations between all numerical columns. Split

the dataset into a training set (75% of observations) and validation set (25% of observations).

(b) (5 points) Make a linear model using all features. How can you interpret the coefficients of feat.c?

What does R2

signify? How can you interpret the value of the residual standard error? What does the

F-statistic say about your model? Make a residual plot of the residuals vs. fitted values and comment

on what this says about validity of the linearity and constant-variance error assumptions of the model.

(c) (5 points) Make a linear model that includes all interactions between features and all quadratic terms

for numerical features. From this model, identify a reduced set of coefficients that are the most relevant

predictors. Look at the residual plot of your reduced model and comment on any observed differences

between this plot and the residual plot from part b).

(d) (5 points) Calculate the MSE value on the validation set using your full quadratic model and your

reduced model. Comment on the degree of overfitting compared to the model performance on training

data and the adequacy of your reduced model compared to your full model.

(e) (5 points) Using your reduced model, calculate a 95% confidence interval for each validation set prediction (you can do this using the predict function in R). Calculate the percentage of true observations

from your validation set that fall within your prediction interval. (For this problem, you don’t need to

print all of the confidence intervals. Please only print the final value of the number of true observations

that fall within your confidence interval).

1