Description

5/5 - (6 votes)

In this simulation study we will investigate the significance of regression test. We will simulate from two different models:

The “significant” model

Y i = β 0 + β 1 x i 1 + β 2 x i 2 + β 3 x i 3 + ϵ i

where $ϵ_{i} \sim N (0, σ^{2})$ and

$β_{0} = 3$ ,
$β_{1} = 1$ ,
$β_{2} = 1$ ,
$β_{3} = 1$ .

The “non-significant” model

Y i = β 0 + β 1 x i 1 + β 2 x i 2 + β 3 x i 3 + ϵ i

where $ϵ_{i} \sim N (0, σ^{2})$ and

$β_{0} = 3$ ,
$β_{1} = 0$ ,
$β_{2} = 0$ ,
$β_{3} = 0$ .

For both, we will consider a sample size of $25$ and three possible levels of noise. That is, three values of $σ$ .

$n = 25$
$σ \in (1, 5, 10)$

Use simulation to obtain an empirical distribution for each of the following values, for each of the three values of $σ$ , for both models.

The $F$ statistic for the significance of regression test.
The p-value for the significance of regression test
$R^{2}$

For each model and $σ$ combination, use $2000$ simulations. For each simulation, fit a regression model of the same form used to perform the simulation.

Use the data found in study_1.csv for the values of the predictors. These should be kept constant for the entirety of this study. The y values in this data are a blank placeholder.

Done correctly, you will have simulated the y vector $2 (m o d e l s) \times 3 (s i g m a s) \times 2000 (s i m s) = 12000$ times.

Potential discussions:

Do we know the true distribution of any of these values?
How do the empirical distributions from the simulations compare to the true distributions? (You could consider adding a curve for the true distributions if you know them.)
How are each of the $F$ statistic, the p-value, and $R^{2}$ related to $σ$ ? Are any of those relationships the same for the significant and non-significant models?

Additional things to consider:

Organize the plots in a grid for easy comparison.

Simulation Study 2: Using RMSE for Selection?

Name: Week 6 - Simulation Project STAT 420 solution
SKU: 35561
Availability: InStock

In homework we saw how Test RMSE can be used to select the “best” model. In this simulation study we will investigate how well this procedure works. Since splitting the data is random, we don’t expect it to work correctly each time. We could get unlucky. But averaged over many attempts, we should expect it to select the appropriate model.

We will simulate from the model

Y i = β 0 + β 1 x i 1 + β 2 x i 2 + β 3 x i 3 + β 4 x i 4 + β 5 x i 5 + β 6 x i 6 + ϵ i

where $ϵ_{i} \sim N (0, σ^{2})$ and

$β_{0} = 0$ ,
$β_{1} = 3$ ,
$β_{2} = - 4$ ,
$β_{3} = 1.6$ ,
$β_{4} = - 1.1$ ,
$β_{5} = 0.7$ ,
$β_{6} = 0.5$ .

We will consider a sample size of $500$ and three possible levels of noise. That is, three values of $σ$ .

$n = 500$
$σ \in (1, 2, 4)$

Use the data found in study_2.csv for the values of the predictors. These should be kept constant for the entirety of this study. The y values in this data are a blank placeholder.

Each time you simulate the data, randomly split the data into train and test sets of equal sizes (250 observations for training, 250 observations for testing).

For each, fit nine models, with forms:

y ~ x1
y ~ x1 + x2
y ~ x1 + x2 + x3
y ~ x1 + x2 + x3 + x4
y ~ x1 + x2 + x3 + x4 + x5
y ~ x1 + x2 + x3 + x4 + x5 + x6, the correct form of the model as noted above
y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7
y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8
y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9

For each model, calculate Train and Test RMSE.

RMSE (model, data) = 1 n \sum i = 1 n (y i - y^i) 2 ------------\sqrt

Repeat this process with $1000$ simulations for each of the $3$ values of $σ$ . For each value of $σ$ , create a plot that shows how average Train RMSE and average Test RMSE changes as a function of model size. Also show the number of times the model of each size was chosen for each value of $σ$ .

Done correctly, you will have simulated the $y$ vector $3 \times 1000 = 3000$ times. You will have fit $9 \times 3 \times 1000 = 27000$ models. A minimal result would use $3$ plots. Additional plots may also be useful.

Potential discussions:

Does the method always select the correct model? On average, does is select the correct model?
How does the level of noise affect the results?

Simulation Study 3: Power

In this simulation study we will investigate the power of the significance of regression test for simple linear regression.

H 0 : β 1 = 0 vs H 1 : β 1 \neq 0

Recall, we had defined the significance level, $α$ , to be the probability of a Type I error.

α = P [Reject H 0 ∣ H 0 True] = P [Type I Error]

Similarly, the probability of a Type II error is often denoted using $β$ ; however, this should not be confused with a regression parameter.

β = P [Fail to Reject H 0 ∣ H 1 True] = P [Type II Error]

Power is the probability of rejecting the null hypothesis when the null is not true, that is, the alternative is true and $β_{1}$ is non-zero.

Power = 1 - β = P [Reject H 0 ∣ H 1 True]

Essentially, power is the probability that a signal of a particular strength will be detected. Many things affect the power of a test. In this case, some of those are:

Sample Size, $n$
Signal Strength, $β_{1}$
Noise Level, $σ$
Significance Level, $α$

We’ll investigate the first three.

To do so we will simulate from the model

Y i = β 0 + β 1 x i + ϵ i

where $ϵ_{i} \sim N (0, σ^{2})$ .

For simplicity, we will let $β_{0} = 0$ , thus $β_{1}$ is essentially controlling the amount of “signal.” We will then consider different signals, noises, and sample sizes:

$β_{1} \in (- 2, - 1.9, - 1.8, \dots, - 0.1, 0, 0.1, 0.2, 0.3, \dots 1.9, 2)$
$σ \in (1, 2, 4)$
$n \in (10, 20, 30)$

We will hold the significance level constant at $α = 0.05$ .

Use the following code to generate the predictor values, x: values for different sample sizes.

x_values = seq(0, 5, length = n)

For each possible $β_{1}$ and $σ$ combination, simulate from the true model at least $1000$ times. Each time, perform the significance of the regression test. To estimate the power with these simulations, and some $α$ , use

Power^= P^[Reject H 0 ∣ H 1 True] = # Tests Rejected # Simulations

It is possible to derive an expression for power mathematically, but often this is difficult, so instead, we rely on simulation.

Create three plots, one for each value of $σ$ . Within each of these plots, add a “power curve” for each value of $n$ that shows how power is affected by signal strength, $β_{1}$ .

Potential discussions:

How do $n$ , $β_{1}$ , and $σ$ affect power? Consider additional plots to demonstrate these effects.
Are $1000$ simulations sufficient?

Week 6 – Simulation Project STAT 420 solution

Download Details:

Description

Simulation Study 1: Significance of Regression

Simulation Study 2: Using RMSE for Selection?

Simulation Study 3: Power

Week 6 – Simulation Project STAT 420 solution

Download Details:

Description

Simulation Study 1: Significance of Regression

Simulation Study 2: Using RMSE for Selection?

Simulation Study 3: Power

Related products

Week 7 – Homework STAT 420 solution

Week 4 – Homework STAT 420 solution

Week 1 – Homework STAT 420 solution