EE219 Project 1 Regression Analysis solution




5/5 - (4 votes)

Introduction: Regression analysis is a statistical procedure for estimating the relationship between
a target variable and a set of potentially relevant variables. Usually, in a stochastic setting, regression
analysis estimates the conditional expectation of the response variable given the other variables;
roughly speaking, the average value of the response variable for a realization of the other variables.
Such an analysis is highly dependent on the underlying data generating process, the assumptions on
which guide the choice of the regression function and the constraints we impose on the relationship
that we want to estimate. If the assumed model is excessively complex, over-fitting occurs, which
diminishes the predictive performance of the model.
In this project, we explore basic regression models on two datasets, along with basic techniques to
handle over-fitting; namely cross-validation, and regularization. With cross-validation, we test how the
model generalizes to unseen data by evaluating its performance on a set of data not used for training,
while with regularization we penalize overly complex models.
Network backup Dataset
1) Load the dataset. You can download the dataset from this link. The dataset is comprised of simulated
traffic data on a backup system in a network. The system monitors the files residing in a destination
machine and copies their changes in four hours cycles. At the end of each backup process, the size of
the data moved to the destination as well as the duration it took are logged, to be used for
developing prediction models. We define a workflow as a task that backs up data from a group of
files, which have similar patterns of change in terms of size over time. In other words, how the files
are changing varies among different workflows and it depends on different factors like the day of the
week it happens and the time of the day. The dataset has around 18000 data points with the
following columns:
 Week index
 Day of the week at which the file back up has started
 Backup start time-Hour of the day: the exact time that the backup process is completed
 Workflow ID
 File name
 Backup size: the size of the file that is backed up in that cycle in GB
 Backup time: the duration of the backup procedure in hour
Given this dataset, we want to develop prediction models for predicting the size of the data being
backed up as well as the time a backup process may take (refer to “Size of Backup” and “Backup Time”
columns in the dataset). To get an idea on the type of relationships in your dataset, for each workflow,
plot the actual copy sizes of all the files on a time period of 20 days. Can you identify any repeating
2) Let us now predict the copy size of a file given the other attributes.
a) Fit a linear regression model with copy size as the target variable and the other attributes as the
features. We use ordinary least square as the penalty function. That is
min ∥ 𝑌 − 𝑋𝛽 ∥2,
where the minimization is on the coefficient vector 𝛽.
Perform a 10-fold cross validation. That is, split the data randomly into 10 parts and each time take 90%
of the data for training and intentionally regard the other 10% to have an unknown response variable
for testing. After training the model compare the predicted value of the 10% testing data with their
actual values. If we split the data into 10 equally sized parts and test 10 times, each time testing for one
of these 10 parts while training on the other 9 parts, we would achieve “10-fold Cross-validation”.
Analyze the significance of different variables with the statistics obtained from the model you have
trained (e.g. p-value with complete description) and report your obtained Root Mean Squared Error
(RMSE). Evaluate how well your model fits the data by providing “Fitted values and actual values
scattered plot over time”, and “residuals versus fitted values plot”.
b) Use a random forest regression model for this same task. Set the parameters of your model with
the following initial values
 Number of trees: 20
 Depth of each tree: 4
And you can initialize the maximum number of features at each node to be the number of features you
have. By tuning the parameters you can improve the performance of the model. Using more trees
reduces the variance. Tune the parameters of your model and report the best RMSE you can get.
Compare the performance in RMSE with the linear regression model developed earlier.
The output of the random forest algorithm gives a lot of insight about the data. Interpret the output of
your random forest model. Which features are more important? Can you identify the patterns you
observed in part 1 in your fitted model?
c) Now use a neural network regression model. Explain the major parameters of your model and how
they affect the performance in RMSE.
3) Predict the Backup size for each of the workflows separately. Explain if the fit is improved? Note
that in this case, you are fitting a piece-wise linear regression model.
Now, try fitting a more complex regression function to your data. You can try a polynomial function of
your variables? Try increasing the degree of the polynomial to improve your fit. Again, use a 10 fold
cross validation to evaluate your results. Plot the RMSE of the trained model against the degree of the
polynomial you fit first for a fixed training and test set, and then for the average RMSE using cross
validation. Can you find a threshold on the degree of the fitted polynomial beyond which the
generalization error of your model gets worse?
Can you explain how cross validation helps controlling the complexity of your model?
Boston Housing Dataset
Load the dataset. You can download the dataset from this link. This dataset concerns housing values in
the suburbs of the greater Boston area and is taken from the StatLib library which is maintained at
Carnegie Mellon University. There are around 500 data points with the following features
 CRIM: per capita crime rate by town
 ZN: proportion of residential land zoned for lots over 25,000 sq. ft.
 INDUS: proportion of non-retail business acres per town
 CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
 NOX: nitric oxides concentration (parts per 10 million)
 RM: average number of rooms per dwelling
 AGE: proportion of owner-occupied units built prior to 1940
 DIS: weighted distances to five Boston employment centers
 RAD: index of accessibility to radial highways
 TAX: full-value property-tax rate per $10,000
 PTRATIO: pupil-teacher ratio by town
 B: 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town
 LSTAT: % lower status of the population
 MEDV: Median value of owner-occupied homes in $1000’s
4) Fit a linear regression model with MEDV as the target variable and the other attributes as the
features and ordinary least square as the penalty function. Perform a 10 fold cross validation, analyze
the significance of different variables with the statistics obtained from the model you have trained, and
the averaged Root Mean Squared Error (RMSE), and plot the same curves as in part 2. Repeat the same
steps for a polynomial regression function and find the optimal degree of fit as in part 3.
5) In this part, we try to control over fitting via regularization of the parameters. The idea behind
regularization is to constrain the coefficient vector to lie in a less complex manifold rather than Rp
, with
p being the number of features. In this part we explore common regularization techniques that impose
a further penalty on the size of the regression coefficients along with the sum of residuals. Namely we
consider ridge and lasso regression techniques, which correspond to ℓ1 and ℓ2 regularizations
a) Tune the complexity parameter 𝛼 of the ridge regression below in the range {1, 0.1,0.01,0.001} and
report the best RMSE obtained via 10-fold cross validation.
min ∥ 𝑌 − 𝑋𝛽 ∥2
2+ 𝛼 ∥ 𝛽 ∥2
Compare the value of the optimal coefficients obtained this way with the un-regularized model and the
ℓ1 regularized model below.
b) Repeat the previous part for Lasso regularization as formulated below
min ∥ 𝑌 − 𝑋𝛽 ∥2
2+ 𝛼 ∥ 𝛽 ∥1.
Use an appropriate normalization for the range of 𝛼, if needed.
Submission: Please submit a zip file containing your report, and your codes with a readme file on
how to run your code to The zip file should be named as
“Project1_UID1_UID2_…” where UIDx are student ID numbers of the team members. If you
had any questions you can send an email to the same address.