Description

5/5 - (1 vote)

1 Pre-requisites
• Plotting: In this assignment we will be generating a number of plots to visualize the results.
Visualization plays a very important role in ML and most of the times you would be looking at
different kinds of graphs/charts to get some key insights about your data and the performance
of different ML algorithms. There are a number of third party plotting libraries and toolkits
available for different platforms and you can select any of them you wish to work with.
• Normalization: In the previous assignment we normalized the entire dataset prior to learning, and then used ten-fold cross validation to evaluate the performance of the learning
algorithm. However, the correct way of normalizing data would be to normalize the training
data and record the normalization parameters i.e., mean and standard deviation for z-score
normalization and min and max feature values for feature re-scaling. This minimizes the
chances of introducing any bias during the performance evaluation of the learning algorithm.
In a real-world scenario you would not have access to test data during the training phase. To
summarize: only normalize your training data, and then use the normalization parameters
to normalize your test data and then estimate the accuracy/error.
• Adding the Constant Feature: For every regression problem remember to add a column
of ones to your dataset.
2 Gradient Descent for Linear Regression
In this problem you will be working with three datasets for regression:
• Housing: This is a regression dataset where the task is to predict the value of houses in the
suburbs of Boston based on thirteen features that describe different aspects that are relevant
to determining the value of a house, such as the number of rooms, levels of pollution in the
area, etc.
• Yacht: This is a regression dataset where the task is to predict the resistance of a sailing yacht’s structure based on six different features that describe structural and buoyancy
properties.
• Concrete: This is a regression dataset where the task is to predict the compressive strength
of concrete on eight different features. There are a total of 1030 instances and all the features
are numeric.
1
2.1 Learning Regression Coefficients using Gradient Descent (60 points)
Recall, the model for linear regression:
y = fw(x) = Xm
i=1
wixi
and the task is to estimate the least squares error (LSE) regression coefficients wi for predicting
the output y based on the observed data x, using gradient descent.
1. Normalize the features in the training data using z-score normalization.
2. Initialize the weights for the gradient descent algorithm to all zeros i.e., w = [0, 0, . . . , 0]T
.
3. Use the following set of parameters:
(a) Housing: learning rate = 0.4 × 10−3, tolerance = 0.5 × 10−2
(b) Yacht: learning rate = 0.1 × 10−2, tolerance = 0.1 × 10−2
(c) Concrete: learning rate = 0.7 × 10−3, tolerance = 0.1 × 10−3
NOTE: Here tolerance is defined based on the difference in root mean squared error (RMSE)
measured on the training set between successive iterations. Where, RMSE is defined as:
RMSE =
r
SSE
N
N is the number of training instances.
4. Be sure to set a maximum number of iterations (recommended maximum iterations = 1000)
so that the algorithm does not run forever.
5. For both datasets use ten-fold cross-validation to calculate the RMSE (both training and
test) for each fold and the overall mean RMSE. Summarize your results for each dataset as a
table where you report the RMSE (both training and test) for each fold and also the average
SSE and its standard deviation across the folds.
6. Select any fold, and plot the progress of the gradient descent algorithm for each dataset
separately in two different plots. To this end plot the training RMSE for each iteration.
2.2 Interpreting the results (Extra-credit)
For this part you can briefly discuss the impact of the different parameters on the convergence
of gradient descent and the performance of the linear regression model. Please, provide specific
parameter values and empirical evidence to support your claims. You can also include plots to
further elucidate your claims/hypotheses.
1. How sensitive were the results to different starting weights? (Hint: you can do multiple runs
with the weights initialized randomly in [0, 1] or [−1, 1]).
2. Does the tolerance parameter highly affect the results?
3. How was the convergence of the gradient descent algorithm affected by different values of
learning rates?
2
3 Least Squares Regression using Normal Equations
Use the housing and yacht dataset to estimate the regression weights using normal equations.
Contrast the performance (measured through RMSE) to the results obtained using the gradient
descent algorithm, based on a ten-fold cross validation scheme. In this problem you will calculate
the analytical solution that we obtained through Normal equations to learn your weight vector, and
contrast the performance (training and test RMSE) for the same fold with your gradient-descent
based implementation for problem-1.
4 Deriving Normal Equations for Univariate Regression
Consider the linear regression model for the case of a single input and output variable:
y = fw(x) = w0 + w1x
Assume that you have a training dataset consisting of N observations (xi), yi) ∀i ∈ {1, 2, . . . , N}.
Find the values of w0 and w1 that have the minimum sum of squares error on the dataset.
5 Polynomial Regression
In this problem you will be working with the yacht dataset from the first problem and in addition
we have a new dataset:
• Sinusoid Dataset: This is an artificial dataset created by randomly sampling the function
y = f(x) = 7.5 sin(2.25πx). You have a total of 100 samples of the input feature x and the
corresponding noise corrupted output/function values y. In addition to this you are also given
a validation set that has 50 samples.
5.1 Polynomial Regression using Normal Equations
You will be using your implementation of the normal equations from Problem-2. For ten fold crossvalidation: calculate RMSE for each fold (for training data use the number of training instances (as
N for calculating RMSE) and for test data use the number of test instances (as N for calculating
RMSE)).
• Sinusoid dataset: Since, there is only one input feature, add higher powers of the input
feature p ∈ {1, 2, 3, . . . , 15} and calculate the RMSE on the validation set. Remember that
you are not required to do cross-validation, and you do not need to normalize the input
feature values. Summarize your results by plotting the mean SSE on both the training and
the validation set (on the same plot) versus max(p).
• Yacht dataset: In this problem you add higher powers of the six input features, p ∈
{1, 2, 3, . . . , 7} and calculate the mean RMSE using ten fold cross validation. For example, if p = 2 you will have twelve features corresponding to the original six input features and
six new features obtained by squaring the original feature values. Remember, to employ the
normalization scheme as outlined previously. Summarize your results by plotting the mean
RMSE across folds on both the training and the validation set (on the same plot) versus
3
max(p). Note: During every iteration of the cross validation you will calculate the RMSE
for both the training and test data. The final quantity that you will plot will be the average
of the RMSE across the ten folds (divide the sum of RMSE for each fold by 10).
NOTE: Do not duplicate the constant feature as you add new sets of features. Remember, that
the final design matrix should only have a single columns of ones (preferably, the first column)
5.2 Interpreting the results (Extra-credit)
1. Does the addition of new features reduce the RMSE for both datasets? Is the impact of
adding new features on the RMSE identical for both training data and test (validation) data?
2. How can we scale this approach to include higher order polynomials and also cross-terms
between features? Do you think that this is an efficient approach?
6 The Hat Matrix
Using the Normal Equations we found that the least squares solution for linear regression is:
w = (XT X)
−1XT y
We can calculate the output values for our training data using the least squares solution as:
yˆ = Xw = X(XT X)
−1XT y
the matrix H = X(XT X)
−1XT
is also known as the “hat” matrix since it puts the hat on y.
Show that the hat matrix H is:
1. Symmetric i.e, HT = H, and
2. Idempotent i.e., HH = H
7 Programming Ridge Regression
In this question you will implement ridge regression using the analytical closed form solution that
we obtained in Module 2: Ridge Regression. To this end remember to center your data (subtract
the mean value for all instances for both features and label) and do not include the constant
feature in your design matrix (no column of ones should be added to the data matrix), and set w0
to the mean value of that target variable y. We will be working with the Sinusoid dataset from
Assignment-2 (do not use the validation set). Estimate the training and test RMSE using ten-fold
cross validation for the following settings:
1. Add four new feature to the dataset consisting of higher powers of x
p
, p ∈ {1, 2, . . . , 5}. Use
51 values of λ in [0, 10], so that λ ∈ {0, 0.2, 0.4, . . . , 10}. Plot train and test RMSE vs lambda
on separate plots.
2. Add nine new feature to the dataset consisting of higher powers of x
p
, p ∈ {1, 2, . . . , 9}. Use
51 values of λ in [0, 10], so that λ ∈ {0, 0.2, 0.4, . . . , 10}. Plot train and test RMSE vs lambda
on separate plots.
NOTE: The quantity that you will be plotting is the average RMSE across the ten folds for the
test/train sets, for each value of λ.
4
7.1 Interpretation:
Do you see a difference in the behavior of regularization of the two synthetic datasets we created
in the two problems? Explain your results in detail, keeping in mind that a zero value for λ
corresponds to the solution without regularization i.e., linear regression.
8 Maximum Likelihood For Univariate Normal
Consider N samples {x1, x2 . . . , xN }, generated from a univariate normal distribution:
p(x; µ, σ) = 1
√
2πσ
e
− 1
2σ2
(x−µ)
2
(1)
formulate the log-likelihood for the N samples, and derive the maximum likelihood estimate for
the mean of the distribution µML.
8.1 Extra-credit:
Derive the maximum likelihood estimate for the standard deviation σML.
5

Assignment 2 Linear and Ridge Regression solution

Download Details:

Description

Assignment 2 Linear and Ridge Regression solution

Download Details:

Description

Related products

CS 6140 Assignment 4 solution

Assignment 1 Decision Trees solution

Assignment 5 Data Clustering solution