## Description

Exercise 1 (EPA Emissions Data)

For this exercise, we will use the data stored in epa2017.csv. It contains detailed descriptions of vehicles

manufactured in 2017 that were used for fuel economy testing as performed by the Environment Protection

Agency. The variables in the dataset are:

• Make – Manufacturer

• Model – Model of vehicle

• ID – Manufacturer defined vehicle identification number within EPA’s computer system (not a VIN

number)

• disp – Cubic inch displacement of test vehicle

• type – Car, truck, or both (for vehicles that meet specifications of both car and truck, like smaller

SUVs or crossovers)

• horse – Rated horsepower, in foot-pounds per second

• cyl – Number of cylinders

• lockup – Vehicle has transmission lockup; N or Y

• drive – Drivetrain system code

– A = All-wheel drive

– F = Front-wheel drive

– P = Part-time 4-wheel drive

– R = Rear-wheel drive

– 4 = 4-wheel drive

• weight – Test weight, in pounds

• axleratio – Axle ratio

• nvratio – n/v ratio (engine speed versus vehicle speed at 50 mph)

• THC – Total hydrocarbons, in grams per mile (g/mi)

• CO – Carbon monoxide (a regulated pollutant), in g/mi

• CO2 – Carbon dioxide (the primary byproduct of all fossil fuel combustion), in g/mi

• mpg – Fuel economy, in miles per gallon

1

We will attempt to model CO2 using both horse and type. In practice, we would use many more predictors,

but limiting ourselves to these two, one numeric and one factor, will allow us to create a number of plots.

Load the data, and check its structure using str(). Verify that type is a factor; if not, coerce it to be a

factor.

(a) Do the following:

• Make a scatterplot of CO2 versus horse. Use a different color point for each vehicle type.

• Fit a simple linear regression model with CO2 as the response and only horse as the predictor.

• Add the fitted regression line to the scatterplot. Comment on how well this line models the data.

• Give an estimate for the average change in CO2 for a one foot-pound per second increase in horse for

a vehicle of type car.

• Give a 90% prediction interval using this model for the CO2 of a Subaru Impreza Wagon, which is a

vehicle with 148 horsepower and is considered type Both. (Interestingly, the dataset gives the wrong

drivetrain for most Subarus in this dataset, as they are almost all listed as F, when they are in fact

all-wheel drive.)

(b) Do the following:

• Make a scatterplot of CO2 versus horse. Use a different color point for each vehicle type.

• Fit an additive multiple regression model with CO2 as the response and horse and type as the predictors.

• Add the fitted regression “lines” to the scatterplot with the same colors as their respective points (one

line for each vehicle type). Comment on how well this line models the data.

• Give an estimate for the average change in CO2 for a one foot-pound per second increase in horse for

a vehicle of type car.

• Give a 90% prediction interval using this model for the CO2 of a Subaru Impreza Wagon, which is a

vehicle with 148 horsepower and is considered type Both.

(c) Do the following:

• Make a scatterplot of CO2 versus horse. Use a different color point for each vehicle type.

• Fit an interaction multiple regression model with CO2 as the response and horse and type as the

predictors.

• Add the fitted regression “lines” to the scatterplot with the same colors as their respective points (one

line for each vehicle type). Comment on how well this line models the data.

• Give an estimate for the average change in CO2 for a one foot-pound per second increase in horse for

a vehicle of type car.

• Give a 90% prediction interval using this model for the CO2 of a Subaru Impreza Wagon, which is a

vehicle with 148 horsepower and is considered type Both.

(d) Based on the previous plots, you probably already have an opinion on the best model. Now use an

ANOVA F-test to compare the additive and interaction models. Based on this test and a significance level

of α = 0.10, which model is preferred?

Exercise 2 (Hospital SUPPORT Data, White Blood Cells)

For this exercise, we will use the data stored in hospital.csv. It contains a random sample of 580 seriously ill

hospitalized patients from a famous study called “SUPPORT” (Study to Understand Prognoses Preferences

Outcomes and Risks of Treatment). As the name suggests, the purpose of the study was to determine what

factors affected or predicted outcomes, such as how long a patient remained in the hospital. The variables

in the dataset are:

2

• Days – Days to death or hospital discharge

• Age – Age on day of hospital admission

• Sex – Female or male

• Comorbidity – Patient diagnosed with more than one chronic disease

• EdYears – Years of education

• Education – Education level; high or low

• Income – Income level; high or low

• Charges – Hospital charges, in dollars

• Care – Level of care required; high or low

• Race – Non-white or white

• Pressure – Blood pressure, in mmHg

• Blood – White blood cell count, in gm/dL

• Rate – Heart rate, in bpm

For this exercise, we will use Age, Education, Income, and Sex in an attempt to model Blood. Essentially,

we are attempting to model white blood cell count using only demographic information.

(a) Load the data, and check its structure using str(). Verify that Education, Income, and Sex are factors;

if not, coerce them to be factors. What are the levels of Education, Income, and Sex?

(b) Fit an additive multiple regression model with Blood as the response using Age, Education, Income,

and Sex as predictors. What does R choose as the reference level for Education, Income, and Sex?

(c) Fit a multiple regression model with Blood as the response. Use the main effects of Age, Education,

Income, and Sex, as well as the interaction of Sex with Age and the interaction of Sex and Income. Use a

statistical test to compare this model to the additive model using a significance level of α = 0.10. Which do

you prefer?

(d) Fit a model similar to that in (c), but additionally add the interaction between Income and Age as well

as a three-way interaction between Age, Income, and Sex. Use a statistical test to compare this model to

the preferred model from (c) using a significance level of α = 0.10. Which do you prefer?

(e) Using the model in (d), give an estimate of the change in average Blood for a one-unit increase in Age

for a highly educated, low income, male patient.

Exercise 3 (Hospital SUPPORT Data, Stay Duration)

For this exercise, we will again use the data stored in hospital.csv. It contains a random sample of 580

seriously ill hospitalized patients from a famous study called “SUPPORT” (Study to Understand Prognoses

Preferences Outcomes and Risks of Treatment). As the name suggests, the purpose of the study was to

determine what factors affected or predicted outcomes, such as how long a patient remained in the hospital.

The variables in the dataset are:

• Days – Days to death or hospital discharge

• Age – Age on day of hospital admission

• Sex – Female or male

• Comorbidity – Patient diagnosed with more than one chronic disease

• EdYears – Years of education

• Education – Education level; high or low

• Income – Income level; high or low

• Charges – Hospital charges, in dollars

• Care – Level of care required; high or low

• Race – Non-white or white

3

• Pressure – Blood pressure, in mmHg

• Blood – White blood cell count, in gm/dL

• Rate – Heart rate, in bpm

For this exercise, we will use Blood, Pressure, and Rate in an attempt to model Days. Essentially, we are

attempting to model the time spent in the hospital using only health metrics measured at the hospital.

Consider the model

Y = β0 + β1×1 + β2×2 + β3×3 + β4x1x2 + β5x1x3 + β6x2x3 + β7x1x2x3 + ,

where

• Y is Days

• x1 is Blood

• x2 is Pressure

• x3 is Rate.

(a) Fit the model above. Also fit a smaller model using the provided R code.

days_add = lm(Days ~ Pressure + Blood + Rate, data = hospital)

Use a statistical test to compare the two models. Report the following:

• The null and alternative hypotheses in terms of the model given in the exercise description

• The value of the test statistic

• The p-value of the test

• A statistical decision using a significance level of α = 0.10

• Which model you prefer

(b) Give an expression based on the model in the exercise description for the true change in length of hospital

stay in days for a 1 bpm increase in Rate for a patient with a Pressure of 139 mmHg and a Blood of 10

gm/dL. Your answer should be a linear function of the βs.

(c) Give an expression based on the additive model in part (a) for the true change in length of hospital stay

in days for a 1 bpm increase in Rate for a patient with a Pressure of 139 mmHg and a Blood of 10 gm/dL.

Your answer should be a linear function of the βs.

Exercise 4 (t-test Is a Linear Model)

In this exercise, we will try to convince ourselves that a two-sample t-test assuming equal variance is the

same as a t-test for the coefficient in front of a single two-level factor variable (dummy variable) in a linear

model.

First, we set up the data frame that we will use throughout.

n = 30

sim_data = data.frame(

groups = c(rep(“A”, n / 2), rep(“B”, n / 2)),

values = rep(0, n))

str(sim_data)

4

## ‘data.frame’: 30 obs. of 2 variables:

## $ groups: chr “A” “A” “A” “A” …

## $ values: num 0 0 0 0 0 0 0 0 0 0 …

We will use a total sample size of 30, 15 for each group. The groups variable splits the data into two groups,

A and B, which will be the grouping variable for the t-test and a factor variable in a regression. The values

variable will store simulated data.

We will repeat the following process a number of times.

set.seed(20)

sim_data$values = rnorm(n, mean = 42, sd = 3.5) # simulate response data

summary(lm(values ~ groups, data = sim_data))

##

## Call:

## lm(formula = values ~ groups, data = sim_data)

##

## Residuals:

## Min 1Q Median 3Q Max

## -9.04 -1.11 -0.14 2.23 7.33

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 40.922 0.950 43.07 <2e-16 ***

## groupsB 0.029 1.344 0.02 0.98

## —

## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

##

## Residual standard error: 3.68 on 28 degrees of freedom

## Multiple R-squared: 1.66e-05, Adjusted R-squared: -0.0357

## F-statistic: 0.000465 on 1 and 28 DF, p-value: 0.983

t.test(values ~ groups, data = sim_data, var.equal = TRUE)

##

## Two Sample t-test

##

## data: values by groups

## t = -0.022, df = 28, p-value = 1

## alternative hypothesis: true difference in means is not equal to 0

## 95 percent confidence interval:

## -2.781 2.723

## sample estimates:

## mean in group A mean in group B

## 40.92 40.95

We use lm() to test

H0 : β1 = 0

for the model

5

Y = β0 + β1×1 +

where Y is the values of interest, and x1 is a dummy variable that splits the data in two. We will let R take

care of the dummy variable.

We use t.test() to test

H0 : µA = µB

where µA is the mean for the A group, and µB is the mean for the B group.

The following code sets up some variables for storage.

num_sims = 300

lm_t = rep(0, num_sims)

lm_p = rep(0, num_sims)

tt_t = rep(0, num_sims)

tt_p = rep(0, num_sims)

• lm_t will store the test statistic for the test H0 : β1 = 0.

• lm_p will store the p-value for the test H0 : β1 = 0.

• tt_t will store the test statistic for the test H0 : µA = µB.

• tt_p will store the p-value for the test H0 : µA = µB.

The variable num_sims controls how many times we will repeat this process, which we have chosen to be

300.

(a) Set a seed equal to your birthday. Then write code that repeats the above process 300 times. Each

time, store the appropriate values in lm_t, lm_p, tt_t, and tt_p. Specifically, each time you should use

sim_data$values = rnorm(n, mean = 42, sd = 3.5) to update the data. The grouping will always stay

the same.

(b) Report the value obtained by running mean(lm_t == tt_t), which tells us what proportion of the test

statistics is equal. The result may be extremely surprising!

(c) Report the value obtained by running mean(lm_p == tt_p), which tells us what proportion of the

p-values is equal. The result may be extremely surprising!

(d) If you have done everything correctly so far, your answers to the last two parts won’t indicate the

equivalence we want to show! What the heck is going on here? The first issue is one of using a computer

to do calculations. When a computer checks for equality, it demands equality; nothing can be different.

However, when a computer performs calculations, it can only do so with a certain level of precision. So, if we

calculate two quantities we know to be analytically equal, they can differ numerically. Instead of mean(lm_p

== tt_p) run all.equal(lm_p, tt_p). This will perform a similar calculation, but with a very small error

tolerance for each equality. What is the result of running this code? What does it mean?

(e) Your answer in (d) should now make much more sense. Then what is going on with the test statistics?

Look at the values stored in lm_t and tt_t. What do you notice? Is there a relationship between the two?

Can you explain why this is happening?

6