BIST P8157 : Analysis of Longitudinal Data Homework 1 to 4 solutions

$90.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (1 vote)

BIST P8157 : Analysis of Longitudinal Data Homework 1

Question 1:
Suppose interest lies in characterizing the efficacy of treatment A versus treatment B with
respect to some continuous outcome Y . Let Yki denote the response of the k
th study participant
at the i
th time, where k=1, . . . K, and i=1, 2. Furthermore, suppose that the variance of the
response is σ
2
for i=1, 2, and that the correlation between the repeated measurements (within a
study participant) is ρ.

(a) For each of the following designs derive the variance for the given estimate of the treatment
effect:
(i) Cross-sectional design: A total of K/2 study participants are randomized to treatment A and K/2 study participants randomized to treatment B. All study participants
are measured after they received treatment, and the treatment effect is estimated with
γˆa = Y
A
1 − Y
B
1
. Note, with this study design we have K total study participants who
are each measured once.

(ii) Longitudinal comparison of change from baseline: A total of K/2 study participants are randomized to treatment A and K/2 study participants randomized to
treatment B. All study participants are measured at baseline (time=1) prior to receiving
treatment and after receiving treatment (time=2), and the treatment effect is estimated
with ˆγb = (Y
A
1 − Y
A
0
) − (Y
B
1 − Y
B
0
). Note, with this study design we have K total study
participants who are each measured twice.

(iii) Longitudinal comparison of treatment A and B (Crossover study) All K study
participants are observed on treatment A (time=1) AND treatment B (time=2), and
the treatment effect is estimated with ˆγc = Y
A
1 − Y
B
2
. Note, with this study design we
have K total study participants who are each measured twice.

(iv) Longitudinal comparison of averages: A total of K/2 study participants are randomized to treatment A and K/2 study participants randomized to treatment B. All
study participants are measured twice on the randomized treatment assignment, and the
treatment effect is estimated with ˆγd = Y
A
− Y
B
where Y
tx
is the average of the K/2
study participant-specific averages under treatment tx. Note, with this study design we
have K total study participants who are each measured twice.

(b) Assume we have a budget of $300, 000, and it costs $500 each time the response is measured.
How many people can be enrolled under each design? Calculate and compare the variances
of the estimators, and discuss which you would choose for each of ρ = {0.2, 0.5, 0.8} in order
to minimize uncertainty (i.e. variance).

(c) Assume we have a budget of $300, 000, it costs $250 to enroll someone into the study and
then $250 each time the response is measured. How many people can be enrolled under each
design? Calculate and compare the variances of the estimators, and discuss which you would
choose for each of ρ = {0.2, 0.5, 0.8} in order to minimize uncertainty (i.e. variance).
2

Question 2:
The Six Cities Study of Air Pollution and Health was a longitudinal study designed to characterize lung growth as measured by changes in pulmonary function in children and adolescents,
and the factors that influence lung function growth. A cohort of 13,379 children born on or after
1967 was enrolled in six communities across the U.S.: Watertown (Massachusetts), Kingston and
Harriman (Tennessee), a section of St. Louis (Missouri), Steubenville (Ohio), Portage (Wisconsin),
and Topeka (Kansas).

Most children were enrolled in the first or second grade (between the ages
of six and seven) and measurements of study participants were obtained annually until graduation
from high school or loss to follow-up. At each annual examination, spirometry, the measurement
of pulmonary function, was performed and a respiratory health questionnaire was completed by a
parent or guardian.

On the course website you’ll find a dataset that contains a subset of the pulmonary function data
collected in the Six Cities Study. The data consist of all measurements of FEV1, height and age
obtained from a randomly selected subset of the female participants living in Topeka, Kansas. The
random sample consists of 300 girls, with a minimum of one and a maximum of twelve observations
over time.

(a) Conduct an initial exploratory data analysis (EDA) for the Topeka data. In particular, consider the extent to which there are any unusual observations/outliers, as well as an initial
exploration of the mean and dependence structure. For each component of your EDA, comment on how it would inform how you move forward. Report your results in a concise manner,
using tables and/or figures. Note, what you submit for this may not be all of the EDA you
conduct.

(b) Similar to what we did in class, consider the types of questions that one might be able to
address with the Topeka data.

(c) Suppose that, instead of repeated measurements on each of the 300 girls, only a single measurement was obtained (say, at the start of the study). For any question that you considered
in part (b), discuss the extent to which the question could be addressed using cross-sectional
data albeit possibly with additional assumptions.
3

Question 3:
Consider the CD4+ cell count data we have been looking at in the notes. Specifically, consider
the K∗=266 participants with at least one pre- and one post-seroconversion measurement (see slide
41 of the notes). As in the notes, restrict attention to those patients for whom the pre-seroconversion
measurement was within 6 months of seroconversion. For the purposes of this analysis, take that
measurement to be the measurement at time 0 (i.e baseline).

(a) Construct a ‘Table 1’ summarizing the sample on the basis of their covariates at baseline.

(b) Conduct a two-stage least squares analysis of the CD4+ cell count progression postseroconversion.

Towards this, at the first stage model each patients trajectory as a function
of time since seroconversion. For these models you may consider the relationship to be linear
or some other, more flexible, form. At the second stage, model the coefficients you obtained
at the first stage as a function of baseline covariates.

Report your results succinctly in the
form of tables and/or figures. In addition, provide a brief summary of the results using
language that would be suitable for a non-biostatistician collaborator.
4

BIO 245: Analysis of Longitudinal Data Homework 2

Question 1:
Continuing our investigations with the MACS data, the MACS-VL.RData dataset on the course
website has longitudinal information on CD4+ cell counts for K=225 MACS participants with
baseline viral load data. In this question we are going to consider the relationship between baseline
viral load and the rate of decline of CD4 count.

(a) Summarize the key variables using simple numerical and/or graphical summaries as relevant
to the scientific question of interest.

(b) Use appropriate exploratory methods to characterize the covariance structure of the data.
What structured covariance model(s) appear plausible/reasonable?

(c) Use the gls() command in the nlme library to fit the model:
E[Yki] = β0 + β1tki + β2xk + β3tkixk
where xk is the (possibly transformed) baseline viral load and tki is time since seroconversion
in months. Use compound symmetric correlation but consider both maximum likelihood and
restricted maximum likelihood for estimation. Present your results in a concise manner that
would be suitable for a journal and provide a precise interpretation of the estimates for the
mean model. Ccomment on whether there is a significant association between baseline viral
load and the rate of decline in CD4+ based on the estimates from this model.

(d) The model in part (c) restricts the analysis in that it is estimating a linear relationship
between (possibly-transformed) baseline viral load and CD4 count over time. As a way of
relaxing this restriction, consider categorizing baseline viral load. Given a categorization with
J levels, one alternative to the model in part (c) is
in which the slope for time depends on the value of the covariate. Specifically, while
the model in (c) assumes that γ1(xk) = β1 + β3xk, the model in (d) assumes a discrete
function where γ1(xk) = β1 +
P
j
β3kxk(j).

Hence, the model in (c) utilizes viral load
in its continuous form, but is restrictive in the nature of the relationship (i.e. linearity), the model in (d) utilizes a categorial version of viral load but makes no assumptions
regarding the functional form of how the rate of decline differs across the viral load categories.

Beyond these two special cases, allowing γ0(xk) and γ1(xk) to consider richer functional forms
than the linear form used in the model in (c) provides a more flexible description of how the
rate of decline differs for different values of baseline viral load. With this in mind, use a
varying coefficient model for the rate of decline in CD4+ that characterizes how the rate
of decline depends on baseline viral load. I recommend that you use natural or restricted
cubic splines for the coefficient functions and simply choose two knots. Plot the estimated
coefficient function ˆγ1(xk) with pointwise 95% confidence bands, and interpret specific values.

What does this plot suggest about the adequacy of the model in (c)?
2
fit the model and present your results in a concise manner that would be suitable for a journal.
Provide a precise interpretation of the estimates in this regression model, and comment on
whether there is a significant association between baseline viral load and the rate of decline
in CD4+ based on the estimates from this model.

(e) (optional) The models in parts (c) and (d) can be viewed as special cases of a ‘varying
coefficient’ model:
E[Yki] = γ0(xk) + γ1(xk)tki,
(aIm + b1m)
−1 =
1
a

Im −
b
a + mb1m

for a 6= 0 and a 6= −mb and:
|aIm + b1m| = a
m−1
(a + mb)

(a) Derive the likelihood and log-likelihood as a function of (µ, σ
2
, τ
2
).
(b) Show that the MLEs for µ, σ
2
, and τ
2 are given by:
µˆ = Y¯
··
σˆ
2 = MSE
τˆ
2 =
(1 − 1/n)MSA − MSE
m
where MSA = n
P
i
(Y¯
k· − Y¯
··)
2/(K − 1) and MSE = P
k
P
i
(Yki − Y¯
k·)
2/[K(n − 1)]. Hint: It
may be helpful to write λ = σ
2 + nτ 2
.
(c) Obtain the form for Var[ˆµ] and hence an estimate of this quantity.
(d) Find the REML estimators for σ
2 and τ
2 by integrating µ out of the likelihood in part (a).

(e) In the one-way random effects model with balanced data, it can be shown that:
MSA/(σ
2 + mτ 2
)
MSE/σ2
∼ FK−1, K(n−1)
where FK−1, K(n−1) denotes the F distribution with K − 1 and K(n − 1) degrees of freedom. Hence explain why F
? = MSA/MSE may be compared to an FK−1, K(n−1) to test the
hypothesis H: τ
2 = 0.
3

Question 2 (Optional):
Consider the one-way analysis of variance model:
Yki = µ + γk + ki,
with i = 1, . . . , n replicates on k = 1, . . . , K units and
γk ∼ Normal(0, τ
2
),
ki ∼ Normal(0, σ
2
),
γk ⊥ ki.
The following may be useful: Let Im denote the m × m identity matrix and 1m denote the m × 1
vector of 1’s. Then:

P8157: Analysis of Longitudinal Data Homework 3

Question 1:
In this question you are going to analyze the ‘wtloss’ data, available on the course website.
Briefly, the data come from a weight loss trial in which K=120 patients were randomized to three
treatment arms: dietary counseling at baseline (diet=0), dietary counseling at all sudy visits
(diet=1), and dietary counseling at all visits plus free access to an exercise facility (diet=2).
Each patient visited the study clinic monthly, for up to 12 months; at each visit their weight was
measured.

(a) Use lme() to fit two linear mixed effects models, both including a main effect for diet, a main
effect for time, and a diet by time interaction:

(i) random intercepts only
(ii) random intercepts and random slopes
Report the results in a table that would be suitable for a clinical journal, and provide precise
interpretations of the fixed effects and variance components from model (ii)

(b) Consider conducting a test for whether the random intercepts/slopes model provides a significantly better fit to the data than the random intercept model. Write down the null and
alternative hypotheses. In class we learned that this test is non-standard testing scenario,
and the likelihood ratio test statistic under the null is a mixture of χ
2
1
and χ
2
2 distributions.

In
the lme help file look up simulate.lme. Use this function to simulate the null distribution,
setting n.sim=1000 and seed=1504. Plot this distribution, along with the χ
2
1 distribution
and the χ
2
2 distribution, highlighting the 95th percentiles. What do you conclude about the
adequacy of the random intercept model as compared to random intercept/slopes model.

(c) Conduct a residual analysis of model (i) from part (a). Report your results in a concise
manner and briefly summarize what you conclude, including whether the results from this
analysis are consistent with what you concluded from part (b).

(d) Use geeglm() to fit the same mean model from part (a) using GEE 1.5, based on: (i) working
independence (GEE-I), (ii) working exchangeable (GEE-E), and (iii) working AR-1 (GEEAR1). Report the results in a table that would be suitable for a clinical journal, and provide
precise interpretations of the regression parameter estimates from GEE-E.
1

(e) State the null and alternative hypothesis for the test of whether the rate of weight loss differs
for the treatment groups. Conduct the test for the GEE-E estimator and describe the results
using language that would be suitable to a non-biostatistician collaborator

(f) Assuming the mean model is correctly specified, comment on the consistency of the point
estimates reported in parts (a) and (d), as well as on the validity of the standard error
estimates.
2
where the weight matrix Wk is equal to V
−1
k
.

For simplicity, we’ll consider the special case where
the response is continuous and the variance is constant (i.e. homogeneous). In that case, the GEE
estimating equation is given by:
X
K
k=1
DT
k V
−1
k
(Yk − µk
) = X
K
k=1
XT
k Wk(Yk − µk
).

Define the total weight given to cluster k as
Wtot,k =
Xnk
i=1
Xnk
j=1
wk,ij ,
where wk,ij is the (i, j)
th element of Wk.

(a) Assuming an exchangeable correlation structure with correlation parameter ρ, calculate Wtot,k
as a function of nk and ρ using the identity:
(aIm + b1m)
−1 =
1
a

Im −
b
a + mb1m

.
where Im denotes the m × m identity matrix and 1m denotes the m × m matrix of 1’s.

(b) From this, derive the form of the relative weight of a person with nk=10 to one where
nk=5.Calculate this value for ρ=0.9, ρ=0.5, and ρ=0.1 and comment on the trend that you
observe.
(c) What do the results in part (b) say about the weight for each subject when working independence is used?

(d) The per-observation weight (per single observation within a person) can be thought of as
Wtot,k/nk. Using the results from part (b), comment on the trend in the per observation
weight received.
3

Question 2 (Optional):
In this question we are going to try to understand how much each cluster (or subject) and
each observation per cluster (or subject) is weighted by GEE. That is, even though Vk is called a
’working covariance matrix,’ it might be more natural to think of it as a working weighting matrix,

BIST 8157: Analysis of Longitudinal Data Homework 4

Question 1:
On the course notes, the glmer() in the lme4 package is reviewed as a means to fitting
generalized linear mixed models. In this question you are going to create your own function in R
to fit a logistic random intercepts GLMM for binary response data, using Gauss-Hermite
quadrature.

Your function should have the same inputs as those given in slide 367, with the
exception of the ‘family’ input. Together with the primary function for fitting the model, create
print method that outputs the results in a way that is similar to the output from the summary
method for glmer().

As part of this output include, at least, a title for the fit, information on the
overall fit (i.e. the maximized log-likelihood), results regarding the variance components and results
regarding the fixed effects. Finally, apply the function to the ICHS data specifically to replicate
the results presented on slide 369 of the notes.

As you hand in your solution, send your code to
the TAs. Please make sure to clean and annotate your code in a way that makes it easy for the
TAs (or any reader) to understand the various steps.

Question 2 (optional):
In collaborative settings in which the data are either cluster-correlated or longitudinal, a very
common question is whether one should proceed using a marginal model, with estimation/inference
via GEE, or with a GLMM with likelihood-based estimation inference. In this question you are
going to consider a series of questions that can help guide those decisions.

For each of the following,
create a series of bullet points that could be folded into a talk that you give on the topic or into a
set of slides that you could use with your collaborator:
Q: Features shared by both frameworks?
Q: Reasons to use marginal models?
Q: Reasons not to use marginal models?
Q: Reasons to use mixed models?
Q: Reasons not to use mixed models?