Description
Problem 1: (10pts)
Suppose we are given n data points {(X1, Y1),(X2, Y2), . . . ,(Xn, Yn)}
(a) We are interested in fitting the linear regression model Yi = β1Xi + i
where the i are independent and identically distributed N(0, σ2
). Derive
the least square estimate βˆ
1 of β. Find the distribution of βˆ
1 and propose
an estimate for its variance.
(b) We are also interested in fitting the linear regression model Yi = β0 +
β1Xi + i where the i are again independent and identically distributed
N(0, σ2
P
) variables. It turns out, incidentally, that the {Xi} satisfies
n
i=1 Xi = 0. What are the least squares estimates of β0 and β1 in this
case ? Do you observe any interesting aspect of the least square estimation
due to the fact that Pn
i=1 Xi = 0 ? When doing simple linear regression,
can you always assume, without loss of generality, that Pn
i=1 Xi = 0 ?
Problem 2: (10pts)
Suppose we are given n data points {(X1, Y1, Z1),(X2, Y2, Z2), . . . ,(Xn, Yn, Zn)}.
We are interested in fitting the linear regression model Yi = α + βXi + i and
Zi = γ + βXi + ηi for i = 1, 2, . . . , n where the {i} and the {ηi} are independent random variables with zero mean and common variance σ
2
. Derive the
least squares estimates ˆα, βˆ and ˆγ of α, β and γ algebraically. Note that we
require the linear coefficient β in both the regression model for Yi on Xi and Zi
on Xi to be the same.
1
Hint: The least square objective function can be written as
Q =
Xn
i=1
(Yi − α − βXi)
2 +
Xn
i=1
(Zi − γ − βXi)
2
We can then estimate α, β and γ by taking the partial derivatives of Q with
respect to α, β, and γ, set the resulting partial derivatives to 0 and solve for
the estimates ˆα, βˆ and ˆγ.
Problem 3: (20pts)
Install the R package SemiPar to get access to the sausage data set for the
calories vs sodium content among several sausage types. Using R but not the
lm command in R, do the following.
(a) Perform a simple linear regression using calories as the response variable
and sodium level as the predictor variable. Find the least square estimate
for the coefficients.
(b) Setup a hypothesis test to test whether calories count is associated with
the sodium content. What is your conclusion ?
(c) Predict the calories count (that is, obtain the prediction intervals) when
the sodium content is 350 mg, 520 mg, and 441 mg.
(d) Find the equation defining the 95% confidence band for the regression line
using the Working-Hotelling approach.
Problem 4: (20pts)
The data for this problem is available from the link https://us.sagepub.com/
sites/default/files/upm-binaries/26929_lordex.txt
The data records a study about misinformation and facilitation effects in
children. The data consists of 51 observations, with each observation corresponding to a child between the age of 4 to 9 years old. The children saw a
magic show and were then asked questions – in two separate sessions – about
the events that happened during the show. The first session takes place one
week after the show, while the second session takes place roughly 10 months
after the show. The children was scored in each session based on how much
they managed to recall the events of the magic show. The study is described
in more detail in the paper “Post-Event Information Affects Children’s Autobiographical Memory after One Year” by K. London, M. Bruck and L. Melnyk,
Law and Human Behavior, Volume 33, 2009. A snippet of the data is given
below; here AGEMOS refers to the age of the child in months at the start of the
first session, Initial and Final are the scores of the child in the first and second
session, respectively.
2
AGEMOS Final Initial
1 55 0 0
2 82 6 8
3 81 3 6
4 71 0 3
5 84 2 15
6 76 2 8
Once you download the above data file, you can read it into R using the
following command 1
df <- read.table(“26929_lordex.txt”, sep = “”, header = T)
## Now make a new column called age.binarize
df$age.binarize <- (df$AGEMOS <= 78)
df$age.binarize <- factor(df$age.binarize, levels = c(T, F), labels = c(“younger”, “older”))
## Now make a new column called score.difference
df$score.difference <- df$Final – df$Initial
We have decided to binarize the age of the children into two categories,
namely those for which the child is 78 months or younger, and those for which
the child is 79 months or older. After adding the above columns, the above
snippet of data becomes
AGEMOS Final Initial age.binarize score.difference
1 55 0 0 younger 0
2 82 6 8 older -2
3 81 3 6 older -3
4 71 0 3 younger -3
5 84 2 15 older -13
6 76 2 8 younger -6
Using the above data, answer the following questions.
(a) A scientist wants to inquire whether or not older children remember events
longer than younger children. He thinks that the way to do this is by
performing a regression with score.difference as the response variable and
age.binarize as the predictor variable, i.e., he consider the model
score.differencei = β0 + β1 × 1{age.binarizei = “older”} + i
where 1{age.binarizei = “older”} is 1 if the i-th child is older than 79
months and 0 otherwise. Without using the lm command in R, find
the least square estimate for β0 and β1 under this model. What is the
estimated coefficient βˆ
1 ? Assuming the normal error regression model,
1R might warns you about EOF in the downloaded file, but you can safely ignore this
warning.
3
comment on the output of this regression, e.g., is the estimated coefficient
βˆ
1 statistically significant ? Under this model, what does the estimated
coefficient βˆ
1 say about the scores of the older children compared to the
scores of the younger children ?
(b) Another scientists also wants to inquire whether or not older children
remember events longer than younger children. She thinks that the way
to do this is by performing a regression with Final as the response variable
and Initial and age.binarize as the predictor variables, i.e., she consider
the model
Finali = β0 + β2 × Initiali + β1 × 1{age.binarizei = “older”} + i
Without using the lm command, compute the least square estimate for β1
under this model 2
. Under this model, what does the estimated coefficient
βˆ
1 say about the scores of the older children compared to the scores of the
younger children ?
(c) (Bonus: 10pts) Comment on the discrepancy in the estimate for β1 between the above two regression models. What do you think is the reason
behind this discrepancy ?
2Once again, write down the least square objective function in terms of the parameters
β0, β1 and β2 and then find βˆ
0, βˆ
1, and βˆ
2 by setting the partial derivatives with respect to
β0, β1 and β2 to 0. For simplicity, you can also assume that the least square estimate for β2
is known to be βˆ
2 = 0.12
4