## Description

#### Simulation Study 1: Significance of Regression

In this simulation study we will investigate the significance of regression test. We will simulate from two different models:

- The
**“significant”**model

where ϵi∼N(0,σ2)��∼�(0,�2) and

- β0=3�0=3,
- β1=1�1=1,
- β2=1�2=1,
- β3=1�3=1.

- The
**“non-significant”**model

where ϵi∼N(0,σ2)��∼�(0,�2) and

- β0=3�0=3,
- β1=0�1=0,
- β2=0�2=0,
- β3=0�3=0.

For both, we will consider a sample size of 2525 and three possible levels of noise. That is, three values of σ�.

- n=25�=25
- σ∈(1,5,10)�∈(1,5,10)

Use simulation to obtain an empirical distribution for each of the following values, for each of the three values of σ�, for both models.

- The
**F� statistic**for the significance of regression test. - The
**p-value**for the significance of regression test **R2�2**

For each model and σ� combination, use 20002000 simulations. For each simulation, fit a regression model of the same form used to perform the simulation.

Use the data found in `study_1.csv`

for the values of the predictors. These should be kept constant for the entirety of this study. The `y`

values in this data are a blank placeholder.

Done correctly, you will have simulated the `y`

vector 2(models)×3(sigmas)×2000(sims)=120002(������)×3(������)×2000(����)=12000 times.

Potential discussions:

- Do we know the true distribution of any of these values?
- How do the empirical distributions from the simulations compare to the true distributions? (You could consider adding a curve for the true distributions if you know them.)
- How are each of the F� statistic, the p-value, and R2�2 related to σ�? Are any of those relationships the same for the significant and non-significant models?

Additional things to consider:

- Organize the plots in a grid for easy comparison.

# Simulation Study 2: Using RMSE for Selection?

In homework we saw how Test RMSE can be used to select the “best” model. In this simulation study we will investigate how well this procedure works. Since splitting the data is random, we don’t expect it to work correctly each time. We could get unlucky. But averaged over many attempts, we should expect it to select the appropriate model.

We will simulate from the model

where ϵi∼N(0,σ2)��∼�(0,�2) and

- β0=0�0=0,
- β1=3�1=3,
- β2=−4�2=−4,
- β3=1.6�3=1.6,
- β4=−1.1�4=−1.1,
- β5=0.7�5=0.7,
- β6=0.5�6=0.5.

We will consider a sample size of 500500 and three possible levels of noise. That is, three values of σ�.

- n=500�=500
- σ∈(1,2,4)�∈(1,2,4)

Use the data found in `study_2.csv`

for the values of the predictors. These should be kept constant for the entirety of this study. The `y`

values in this data are a blank placeholder.

Each time you simulate the data, randomly split the data into train and test sets of equal sizes (250 observations for training, 250 observations for testing).

For each, fit **nine** models, with forms:

`y ~ x1`

`y ~ x1 + x2`

`y ~ x1 + x2 + x3`

`y ~ x1 + x2 + x3 + x4`

`y ~ x1 + x2 + x3 + x4 + x5`

`y ~ x1 + x2 + x3 + x4 + x5 + x6`

, the correct form of the model as noted above`y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7`

`y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8`

`y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9`

For each model, calculate Train and Test RMSE.

Repeat this process with 10001000 simulations for each of the 33 values of σ�. For each value of σ�, create a plot that shows how average Train RMSE and average Test RMSE changes as a function of model size. Also show the number of times the model of each size was chosen for each value of σ�.

Done correctly, you will have simulated the y� vector 3×1000=30003×1000=3000 times. You will have fit 9×3×1000=270009×3×1000=27000 models. A minimal result would use 33 plots. Additional plots may also be useful.

Potential discussions:

- Does the method
**always**select the correct model? On average, does is select the correct model? - How does the level of noise affect the results?

# Simulation Study 3: Power

In this simulation study we will investigate the **power** of the significance of regression test for simple linear regression.

Recall, we had defined the *significance* level, α�, to be the probability of a Type I error.

Similarly, the probability of a Type II error is often denoted using β�; however, this should not be confused with a regression parameter.

*Power* is the probability of rejecting the null hypothesis when the null is not true, that is, the alternative is true and β1�1 is non-zero.

Essentially, power is the probability that a signal of a particular strength will be detected. Many things affect the power of a test. In this case, some of those are:

- Sample Size, n�
- Signal Strength, β1�1
- Noise Level, σ�
- Significance Level, α�

We’ll investigate the first three.

To do so we will simulate from the model

where ϵi∼N(0,σ2)��∼�(0,�2).

For simplicity, we will let β0=0�0=0, thus β1�1 is essentially controlling the amount of “signal.” We will then consider different signals, noises, and sample sizes:

- β1∈(−2,−1.9,−1.8,…,−0.1,0,0.1,0.2,0.3,…1.9,2)�1∈(−2,−1.9,−1.8,…,−0.1,0,0.1,0.2,0.3,…1.9,2)
- σ∈(1,2,4)�∈(1,2,4)
- n∈(10,20,30)�∈(10,20,30)

We will hold the significance level constant at α=0.05�=0.05.

Use the following code to generate the predictor values, `x`

: values for different sample sizes.

`x_values = seq(0, 5, length = n)`

For each possible β1�1 and σ� combination, simulate from the true model at least 10001000 times. Each time, perform the significance of the regression test. To estimate the power with these simulations, and some α�, use

It is *possible* to derive an expression for power mathematically, but often this is difficult, so instead, we rely on simulation.

Create three plots, one for each value of σ�. Within each of these plots, add a “power curve” for each value of n� that shows how power is affected by signal strength, β1�1.

Potential discussions:

- How do n�, β1�1, and σ� affect power? Consider additional plots to demonstrate these effects.
- Are 10001000 simulations sufficient?