Week 6 – Simulation Project STAT 420 solution

\$25.00

Category:

Description

In this simulation study we will investigate the significance of regression test. We will simulate from two different models:

1. The “significant” model
Yi=β0+β1xi1+β2xi2+β3xi3+ϵi��=�0+�1��1+�2��2+�3��3+��

where ϵiN(0,σ2)��∼�(0,�2) and

• β0=3�0=3,
• β1=1�1=1,
• β2=1�2=1,
• β3=1�3=1.
1. The “non-significant” model
Yi=β0+β1xi1+β2xi2+β3xi3+ϵi��=�0+�1��1+�2��2+�3��3+��

where ϵiN(0,σ2)��∼�(0,�2) and

• β0=3�0=3,
• β1=0�1=0,
• β2=0�2=0,
• β3=0�3=0.

For both, we will consider a sample size of 2525 and three possible levels of noise. That is, three values of σ.

• n=25�=25
• σ(1,5,10)�∈(1,5,10)

Use simulation to obtain an empirical distribution for each of the following values, for each of the three values of σ, for both models.

• The F statistic for the significance of regression test.
• The p-value for the significance of regression test
• R2�2

For each model and σ combination, use 20002000 simulations. For each simulation, fit a regression model of the same form used to perform the simulation.

Use the data found in study_1.csv for the values of the predictors. These should be kept constant for the entirety of this study. The y values in this data are a blank placeholder.

Done correctly, you will have simulated the y vector 2(models)×3(sigmas)×2000(sims)=120002(������)×3(������)×2000(����)=12000 times.

Potential discussions:

• Do we know the true distribution of any of these values?
• How do the empirical distributions from the simulations compare to the true distributions? (You could consider adding a curve for the true distributions if you know them.)
• How are each of the F statistic, the p-value, and R2�2 related to σ? Are any of those relationships the same for the significant and non-significant models?

• Organize the plots in a grid for easy comparison.

Simulation Study 2: Using RMSE for Selection?

In homework we saw how Test RMSE can be used to select the “best” model. In this simulation study we will investigate how well this procedure works. Since splitting the data is random, we don’t expect it to work correctly each time. We could get unlucky. But averaged over many attempts, we should expect it to select the appropriate model.

We will simulate from the model

Yi=β0+β1xi1+β2xi2+β3xi3+β4xi4+β5xi5+β6xi6+ϵi��=�0+�1��1+�2��2+�3��3+�4��4+�5��5+�6��6+��

where ϵiN(0,σ2)��∼�(0,�2) and

• β0=0�0=0,
• β1=3�1=3,
• β2=4�2=−4,
• β3=1.6�3=1.6,
• β4=1.1�4=−1.1,
• β5=0.7�5=0.7,
• β6=0.5�6=0.5.

We will consider a sample size of 500500 and three possible levels of noise. That is, three values of σ.

• n=500�=500
• σ(1,2,4)�∈(1,2,4)

Use the data found in study_2.csv for the values of the predictors. These should be kept constant for the entirety of this study. The y values in this data are a blank placeholder.

Each time you simulate the data, randomly split the data into train and test sets of equal sizes (250 observations for training, 250 observations for testing).

For each, fit nine models, with forms:

• y ~ x1
• y ~ x1 + x2
• y ~ x1 + x2 + x3
• y ~ x1 + x2 + x3 + x4
• y ~ x1 + x2 + x3 + x4 + x5
• y ~ x1 + x2 + x3 + x4 + x5 + x6, the correct form of the model as noted above
• y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7
• y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8
• y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9

For each model, calculate Train and Test RMSE.

RMSE(model, data)=1ni=1n(yiy^i)2−−−−−−−−−−−−√RMSE(model, data)=1�∑�=1�(��−�^�)2

Repeat this process with 10001000 simulations for each of the 33 values of σ. For each value of σ, create a plot that shows how average Train RMSE and average Test RMSE changes as a function of model size. Also show the number of times the model of each size was chosen for each value of σ.

Done correctly, you will have simulated the y vector 3×1000=30003×1000=3000 times. You will have fit 9×3×1000=270009×3×1000=27000 models. A minimal result would use 33 plots. Additional plots may also be useful.

Potential discussions:

• Does the method always select the correct model? On average, does is select the correct model?
• How does the level of noise affect the results?

Simulation Study 3: Power

In this simulation study we will investigate the power of the significance of regression test for simple linear regression.

H0:β1=0 vs H1:β10�0:�1=0 vs �1:�1≠0

Recall, we had defined the significance level, α, to be the probability of a Type I error.

α=P[Reject H0H0 True]=P[Type I Error]�=�[Reject �0∣�0 True]=�[Type I Error]

Similarly, the probability of a Type II error is often denoted using β; however, this should not be confused with a regression parameter.

β=P[Fail to Reject H0H1 True]=P[Type II Error]�=�[Fail to Reject �0∣�1 True]=�[Type II Error]

Power is the probability of rejecting the null hypothesis when the null is not true, that is, the alternative is true and β1�1 is non-zero.

Power=1β=P[Reject H0H1 True]Power=1−�=�[Reject �0∣�1 True]

Essentially, power is the probability that a signal of a particular strength will be detected. Many things affect the power of a test. In this case, some of those are:

• Sample Size, n
• Signal Strength, β1�1
• Noise Level, σ
• Significance Level, α

We’ll investigate the first three.

To do so we will simulate from the model

Yi=β0+β1xi+ϵi��=�0+�1��+��

where ϵiN(0,σ2)��∼�(0,�2).

For simplicity, we will let β0=0�0=0, thus β1�1 is essentially controlling the amount of “signal.” We will then consider different signals, noises, and sample sizes:

• β1(2,1.9,1.8,,0.1,0,0.1,0.2,0.3,1.9,2)�1∈(−2,−1.9,−1.8,…,−0.1,0,0.1,0.2,0.3,…1.9,2)
• σ(1,2,4)�∈(1,2,4)
• n(10,20,30)�∈(10,20,30)

We will hold the significance level constant at α=0.05�=0.05.

Use the following code to generate the predictor values, x: values for different sample sizes.

x_values = seq(0, 5, length = n)

For each possible β1�1 and σ combination, simulate from the true model at least 10001000 times. Each time, perform the significance of the regression test. To estimate the power with these simulations, and some α, use

Power^=P^[Reject H0H1 True]=# Tests Rejected# SimulationsPower^=�^[Reject �0∣�1 True]=# Tests Rejected# Simulations

It is possible to derive an expression for power mathematically, but often this is difficult, so instead, we rely on simulation.

Create three plots, one for each value of σ. Within each of these plots, add a “power curve” for each value of n that shows how power is affected by signal strength, β1�1.

Potential discussions:

• How do nβ1�1, and σ affect power? Consider additional plots to demonstrate these effects.
• Are 10001000 simulations sufficient?