# DATA 303 Assignment 4 solution

\$25.00

Original Work ?

5/5 - (1 vote)

## Background and Data

Heart disease is the annual leading cause of death worldwide, accounting for more than 25% of deaths in
2016 (World Health Organization 2018).

It is also a significant economic burden for the healthcare system
with Nichols et al. (2010) estimating that heart disease and other cardiovascular diseases cost an average of
roughly USD \$19,000 per patient, according to a study in the United States over the period of 2000-2005.

Early detection of heart disease (along with many other diseases) is important in terms of reducing both
mortality and costs to the healthcare system.

We will examine data on 4,240 participants in the Framingham Heart Study (Boston University and the
National Heart, Lung, & Blood Institute 2020), an ongoing study that began in 1948 and has been instrumental
in the identification of a number of risk factors for heart disease and other cardiovascular diseases.

The data
are available in the file Framingham Heart Study.xlsx, which can be read into R using the code below but
with the path changed to point to the location of the file on your computer.

A full list of variables contained
in the dataset and descriptions of these variables is also provided, both here and in the Excel file.

# Read in the heart disease dataset.
sheet = “Data”, na = “NA”)
Table 1: Variables and their descriptions for data contained in the
file Framingham Heart Study.xlsx.
Variable Description
SEX Sex of the individual (0 = “Female”, 1 = “Male”).
AGE Age (in years) of the individual at the time of the health exam.

EDUC Highest level of education of the individual (1 = “Some high school”, 2 = “High school or
Graduate Equivalency Diploma”, 3 = “Some university or vocational school”, 4 =
“University”).

SMOKER Indicator of whether or not the individual is a current smoker (0 = “No”, 1 = “Yes”).
CIG Average number of cigarettes that the individual smokes each day.
BP_MED Indicator of whether or not the individual is on blood pressure medication (0 = “No”, 1 =
“Yes”).

STROKE Indicator of whether or not the individual previously had a stroke (0 = “No”, 1 = “Yes”).
HYPER Indicator of whether or not the individual was hypertensive (0 = “No”, 1 = “Yes”).
DIAB Indicator of whether or not the individual is diabetic (0 = “No”, 1 = “Yes”).
CHOL Total cholesterol level (in mg/dL).
SBP Systolic blood pressure (in mmHg).
DBP Diastolic blood pressure (in mmHg).
BMI Body mass index.
HR Resting heart rate (in beats per minute)
GLUC Glucose level (in mg/dL)

HD_RISK Indicator of whether the individual has 10-year risk of future coronary heart disease (0 =
“No”, 1 = “Yes”)
Our focus will be on 10-year risk of coronary heart disease (CHD). Ten-year risk of CHD is a predicted risk
(i.e., a probability ranging between 0 and 1) of developing CHD within the next 10 years. Although this is
not an observed outcome but rather an estimated value, 10-year risk of CHD is a well-established measure in
the medical community.

We will consider a binary version of this variable which indicates whether or not a
person would be considered as at risk of developing CHD within the next 10 years.

## 1. Missing data and variable recode: (10 marks)

Although our objective will be to consider inferential and predictive models for 10-year risk of CHD, we
will first ensure that we understand aspects of the underlying data as well as create a new variable
that may prove useful in producing comparisons of 10-year risk of CHD for medically-meaningful blood
pressure ranges. (In practice, we would want to examine each relevant variable to identify extreme
observations and be sure that there are not any erroneous values. As this dataset has already been
cleaned, we will not do so for this assignment.)

a. (2 marks) First, perform an analysis of the level of missing data for each variable. For only
those variables for which there are missing data, produce a table of the form shown below, where
VARIABLE_i is the name of the variable with missing data, ni
is the count for number of missing
observations for that variable, and pi
is the proportion (to 5dp) of missing observations for that
variable. Which variable has the highest level of missing data?

Table 2: Frequency and proportion of missing values for variables
with missing data.
Variable VARIABLE_1 VARIABLE_2 . . . VARIABLE_k
Frequency (n) n1 n2 . . . nk
Proportion (p) p1 p2 . . . pk

b. (3 marks) Create a new data frame called hd.complete, which only keeps people/observations
that have no missing data. In total, what proportion (to 5dp) of people have been removed from
the original dataset to produce this final data frame?

c. (3 marks) Add a variable to the data frame hd.complete called SBP_CAT, which converts systolic
blood pressure (SBP) from a numeric variable to a categorical variable according to the blood
pressure ranges specified by Madell and Cherney (2018). (See references listed at the end of the
assignment.)

For the purposes of coding SBP_CAT, you can assume that the values for each blood
pressure category go to just below that of the next category, as our dataset does not consist of
blood pressures that are rounded to the nearest whole number. This means that, for instance, the
systolic blood pressure range of 120 – 129 should in fact be interpreted as 120 – < 130. This should
produce five levels (i.e., blood pressure ranges) for SBP_CAT. (Note that the final level corresponds
to systolic blood pressure above 180 mmHg.) Produce a table for SBP_CAT which shows how many
observations fall into each blood pressure range.

d. (2 marks) Explain when we would expect that using the categorical variable SBP_CAT rather
than the numeric variable SBP would lead to a better fit for a regression model (whether logistic
regression, linear regression, or Poisson regression).

## 2. Inferential analysis: (25 marks)

Now we will focus on 10-year risk of CHD and look at the role that blood pressure may play in whether
or not someone is considered to be at risk of developing CHD within the next 10 years.

a. (3 marks) We will first consider a logistic regression model of 10-year risk of CHD (HD_RISK) on
systolic blood pressure (SBP) and diastolic blood pressure (DBP).

Previous research suggests that
the following variables are potential confounders for the true relationship between blood pressure
and 10-year risk of CHD and should also be included in the logistic regression model:
• sex of the individual (SEX)
• age of the individual (AGE)
• highest level of education of the individual (EDUC)
• average number of cigarettes smoked per day (CIG)

• total cholesterol level (CHOL)
• body mass index (BMI)
• glucose level (GLUC)

For this logistic regression model, calculate the variance inflation factors for predictors (to 3dp) to
determine whether or not there is evidence of significant multicollinearity among the predictors
in the model. If so, comment on which predictor(s) should be removed, and use this model for
subsequent parts of this question.

b. (3 marks) Using your model from part (a), produce a table of logistic regression model output
and write out the estimated logistic regression equation using the form
log 
pb
1 − pb

= βb0 + βb1X1 + · · · + βbkXk,
where you clearly define the variables X1, X2, . . . , Xk and replace βb0, βb1, . . . , βbk with their
estimated values (to 4dp).

c. (6 marks) Carry out Wald tests for the coefficients for
• systolic blood pressure and
• diastolic blood pressure.
For each coefficient, clearly state
i. the hypotheses you are testing,
ii. the value of the test statistic,
iii. the p-value, and
iv. your conclusion in terms of whether the “effect” of the predictor on the response is statistically
significant.

d. (3 marks) For any significant Wald tests in part (c), provide a precise interpetation of what the
estimated coefficient suggests about the “effect” of the predictor on the response, and calculate a
corresponding 95% confidence interval (to 3dp) for the estimated “effect”.

e. (4 marks) A 2015 study by Wu et al. (2015) found that
“cardiovascular and expanded-cardiovascular mortality risks were lowest when systolic
blood pressures were 120 to 129 mm Hg, and increased significantly when systolic blood
pressures (SBPs) were ≥ 160 mm Hg. . . .”

Although Wu et al. (2015) considered different ranges of systolic blood pressures (< 120, 120-–129,
130-–139, 140-–149, 150—159, ≥ 160 mmHg) than Madell and Cherney (2018), we will use those
specified by Madell and Cherney (2018) in investigating whether ranges of blood pressures may
differ in terms of associated 10-year risk of CHD.
Fit the same model as before, but replace SBP with SBP_CAT.

i. Produce a table of logistic regression model output for this model.

ii. Based strictly on p-values, comment on what conclusions you would make for Wald tests
based on coefficients for SBP_CAT. (Note that you do not need to state hypotheses or values
for test statistics. You simply need to use the p-values to explain what these results mean
about comparisons of systolic blood pressure ranges.)

iii. Do your results agree with the findings of Wu et al. (2015)?

f. (3 marks) Does the model that uses SBP_CAT (i.e., the model fit in part (e)) provide a better fit
than the model that uses SBP (i.e., the model from part (a))?

g. (3 marks) Finally, for the best model of the two you fit (in parts (a) and (e)), perform a
Hosmer-Lemeshow test for g = 10, 20, and 30 groups, and comment on what these suggest about
the goodness-of-fit of this model to the 10-year risk of CHD data.

## 3. Statistical learning: (15 marks)

Now we perform an exploratory analysis to try to identify the best set of predictors in predicting 10-year
risk of CHD. Consider as predictors all variables other than the new variable that you constructed in
Question 1 (SBP_CAT).

a. (4 marks) Find the optimal models identified by forward and backward selection algorithms.
Report the predictors included in these optimal models. If these models are different, highlight
how they differ, and explain why forward and backward selection algorithms may not arrive at the
same optimal model.

b. (5 marks) Find the optimal models identified by best subset selection using AIC and BIC as
selection criteria. Report the predictors included in these optimal models. If these models are
different, highlight how they differ, and explain why the criteria of AIC and BIC may lead to
different “best” models. If these models differ from those identified as “best” by forward and
backward selection, explain why that may be the case.

c. (6 marks) Although it would be most appropriate to consider all possible combinations of the
15 predictor variables for a cross-validation routine to select a model based on maximising the
accuracy or maximising area under the receiver operating characteristic curve (AUC), it is not
feasible to do so on home computers in a reasonable amount of time.

Consequently, use the
predictors identified by best subset selection according to the criterion of minimising AIC from
part (b). (If unable to perform the required subset selection in part (b), make note of that here
and use the predictors in the optimal model identified by backward selection in part (a).)

For this
set of predictors, use 20 repetitions of 10-fold cross-validation to identify the optimal model(s)
identified according to the criteria of
i. maximising accuracy and
ii. maximising AUC.

If the optimal model(s) identified according to these criteria are different, highlight how they differ,
and explain why the criteria of maximising accuracy and maximising AUC may lead to different
“best” models. If these models differ from those identified as “best” in parts (a) and (b), explain
why this may be the case.

Assignment total: 50 marks
References
Boston University and the National Heart, Lung, & Blood Institute. 2020. “The Framingham Heart Study.”
https://framinghamheartstudy.org/.
Madell, R., and K. Cherney. 2018. “Blood Pressure Readings Explained.” Healthline. https://www.healthli