CSci 5521 Homework 1 least squares regression solution

$30.00

Category:

Description

5/5 - (4 votes)

1. (25 points) Consider doing least squares regression based on a training set Ztrain = {(x
t
, rt
), t =
1, . . . , N}, where x
t ∈ R and r
t ∈ R.
(i) (10 points) Consider fitting a linear model of the form
g1(x) = w1x + w0 ,
with unknown parameters w1, w0 ∈ R, which are selected so as to minimize the following
empirical loss:
E(w1, w0|Ztrain) = 1
N
X
N
t=1
(r
t − (w1x
t + w0))2
.
Derive the optimal values of (w1, w0) clearly showing all steps of the derivation.
(ii) (10 points) Consider fitting a polynomial model of the form
g2(x) = v2x
2020 + v1x + v0 ,
with unknown parameters v2, v1, v0 ∈ R, which are selected so as to minimize the following empirical loss:
E(v2, v1, v0|Ztrain) = 1
N
X
N
t=1
(r
t − (v2(x
t
)
2020 + v1x
t + v0))2
.
Derive the optimal values of v2, v1, v0 clearly showing all steps of the derivation.1
(iii) (5 points) For a given training set Ztrain, let (w

1
, w∗
0
) be the optimal values of (w1, w0) in
(i) above, and let (v

2
, v∗
1
, v∗
0
) be the optimal values of (v2, v1, v0) in (ii) above. Professor
Gopher claims that the following is true for any given Ztrain:
E(v

2
, v∗
1
, v∗
0
|Ztrain) ≤ E(w

1
, w∗
0
|Ztrain)
Is Professor Gopher’s claim correct? Clearly explain your answer.2
2. (15 points) Consider the following 4 × 4 matrix:
A =




1 1 1 1
1 2 4 8
1 3 9 27
1 4 16 64




.
(i) (5 points) What are the values of tr(A),tr(AT
),tr(AT A), and tr(AAT
).3
1
It is ok to leave the solution in terms of a linear system, say Av = b, where A ∈ R
3×3
, b ∈ R
3
are known, and
v = [v1 v2 v2]
T ∈ R
3
is a vector of the unknown parameters. If you choose to do this, please also mention your
preferred approach to solve such a linear system.
2 A correct answer with insufficient or incorrect explanation will not get any credit.
3For this problem, you can use python libraries for the computations.
1
(ii) (5 points) From a geometric perspective, explain how the absolute value of |A| (determinant of A) can be computed.
(iii) (5 points) Are the rows of A linearly independent? Clearly explain your answer.2
(For this problem, you can use python libraries to arrive at your answer. If you do that,
clearly explain what you did and why. There is a way to arrive at the answer without
using python libraries.)
Programming assignments: The next two problems involve programming. We will be considering three datasets (derived from two available datasets) for these assignments:
(a) Boston: The Boston housing dataset comes pre-packaged with scikit-learn. The dataset
has 506 points, 13 features, and 1 target (response) variable. You can find more information
about the dataset here:
https://github.com/rupakc/UCI-Data-Analysis/tree/master/Boston Housing Dataset/Boston Housing
While the original dataset is for a regression problem, we will create two classification datasets
for the homework. Note that you only need to work with the response r to create these
classification datasets.
i. Boston50: Let τ50 be the median (50th percentile) over all r (response) values. Create
a 2-class classification problem such that y = 1 if r ≥ τ50 and y = 0 if r < τ50. By
construction, note that the class priors will be p(y = 1) ≈
1
2
, p(y = 0) ≈
1
2
.
ii. Boston75: Let τ75 be the 75th percentile over all r (response) values. Create a 2-class
classification problem such that y = 1 if r ≥ τ75 and y = 0 if r < τ75. By construction,
note that the class priors will be p(y = 1) ≈
1
4
, p(y = 0) ≈
3
4
.
(b) Digits: The Digits dataset comes prepackaged with scikit-learn. The dataset has 1797
points, 64 features, and 10 classes corresponding to ten numbers 0, 1, . . . , 9. The dataset was
(likely) created from the following dataset:
http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits
The 2-class classification datasets from Boston50, Boston75, and the 10-class classification dataset
from Digits will be used in the following two problems.
3. (30 points) We will consider three methods from scikit-learn: LinearSVC, SVC, and
LogisticRegression. Use the following parameters for the different methods mentioned:
LinearSVC: max iter=2000
SVC: gamma=‘scale’, C=10
LogisticRegression: penalty=‘l2’, solver=‘lbfgs’, multi class=‘multinomial’,
max iter=5000
(i) (15 points) Develop code for my cross val(method,X,y,k), which performs k-fold crossvalidation on (X, y) using method, and returns the error rate in each fold. Using
my cross val, report the error rates in each fold as well as the mean and standard deviation of error rates across folds for the three methods: LinearSVC, SVC, and LogisticRegression,
applied to the three classification datasets: Boston50, Boston75, and Digits.
You will have to submit (a) code and (b) summary of results for my cross val:
2
(a) Code: You will have to submit code for my cross val(method,X,y,k) (main file)
as well as a wrapper code q3i().
The main file has input: (1) method, which specifies the (class) name of one
of the three classification methods under consideration, (2) X,y, which is data for
the 2-class or 10-class classification problem, (3) k, the number of folds for crossvalidation, and output: (1) the test set error rates for each of the k folds.
The wrapper code has no input and is used to prepare the datasets, and make
calls to my cross val(method,X,y,k) to generate the results for each dataset and
each method. Make sure the calls to my cross val(method,X,y,k) are made in
the following order and add a print to the terminal before each call to show which
method and dataset is being used:
1. LinearSVC with Boston50; 2. LinearSVC with Boston75; 3. LinearSVC with
Digits,
4. SVC with Boston50; 5. SVC with Boston75; 6. SVC with Digits,
7. LogisticRegression with Boston50; 8. LogisticRegression with Boston75;
9. LogisticRegression with Digits.
For example, the first call to my cross val(method,X,y,k) with k = 10 should result
in the following output:
Error rates for LinearSVC with Boston50:
Fold 1: ###
Fold 2: ###

Fold 10: ###
Mean: ###
Standard Deviation: ###
(b) Summary of results: For each dataset and each method, report the test set error
rates for each of the k = 10 folds, the mean error rate over the k folds, and the
standard deviation of the error rates over the k folds. Make a table to present the
results for each method and each dataset (9 tables in total). Include a column in
the table for each fold, and add two columns at the end to show the overall mean
error rate and standard deviation over the k folds. For example:
Error rates for LinearSVC with Boston50
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 Mean SD
# # # # # # # # # # # #
(ii) (15 points) Develop code for my train test(method,X,y,π,k), which performs random
splits on the data (X, y) so that π ∈ [0, 1] fraction of the data is used for training using
method, rest is used for testing, and the process is repeated k times, after which the
code returns the error rate for each such train-test split. Using my train test, with
π = 0.75 and k = 10, report the mean and standard deviation of error rate for the three
methods: LinearSVC, SVC, and LogisticRegression, applied to the three classification
datasets: Boston50, Boston75, and Digits.
You will have to submit (a) code and (b) summary of results for my train test:
(a) Code: You will have to submit code for my train test(method,X,y,π,k) (main
file) as well as a wrapper code q3ii().
This main file has input: (1) method, which specifies the (class) name of one
3
of the three classification methods under consideration, (2) X,y, which is data for
the 2-class or 10-class classification problem, (3) π, the fraction of data chosen
randomly to be used for training, (4) k, the number of times the train-test split will
be repeated, and output: (1) the test set error rates for each of the k folds printed
to the terminal.
The wrapper code has no input and is used to prepare the datasets, and make calls
to my train test(method,X,y,π,k) to generate the results for each dataset and each
method (9 combinations in total). Make sure the calls to my train test(method,X,y,π,k)
are made in the following order and add a print to the terminal before each call to
show which method and dataset is being used:
1. LinearSVC with Boston50; 2. LinearSVC with Boston75; 3. LinearSVC with
Digits,
4. SVC with Boston50; 5. SVC with Boston75; 6. SVC with Digits,
7. LogisticRegression with Boston50; 8. LogisticRegression with Boston75;
9. LogisticRegression with Digits.
(b) Summary of results: For each dataset and each method, report the test set error
rates for each of the k = 10 runs with π = 0.75, the mean error rate over the k
folds, and the standard deviation of the error rates over the k folds. Make a table
to present the results for each method and each dataset (9 tables in total). Include
a column in the table for each run, and add two columns at the end to show the
overall mean error rate and standard deviation over the k runs.
4. (30 points) The problem considers a preliminary exercise in ‘feature engineering’ with focus
on the Digits dataset. Represented as (X, y), the Digits dataset has X ∈ R
1797×64, i.e.,
1797 training points, each having 64 features, and y ∈ {0, 1, . . . , 9}
1797, i.e., 1797 training
labels with each yi ∈ {0, 1, . . . , 9}. We will consider three methods from scikit-learn:
LinearSVC, SVC, and LogisticRegression for this problem. Use the following parameters
for the different methods mentioned:
LinearSVC: max iter=2000
SVC: gamma=‘scale’, C=10
LogisticRegression: penalty=‘l2’, solver=‘lbfgs’, multi class=‘multinomial’,
max iter=5000
(i) (15 points) For the Digits dataset, starting with X ∈ R
1797×64, you will create a new
feature representation X˜
1 ∈ R
1797×32 as follows: Construct a (random) matrix G ∈
R
64×32 where each element gij ∼ N(0, 1), i.e., sampled independently from a univariate
normal distribution, and then compute X˜
1 = XG. Using (X˜
1, y), perform 10-fold crossvalidation4 using the three methods: LinearSVC, SVC, and LogisticRegression, and
report the mean and the standard deviation of the 10-fold test set error rate.5 The
creation of X˜
1 will be done based on a function rand proj(X, d), where d = 32 for this
problem, and the function will return X˜
1.
4Please use your own code my cross val for this problem.
5Since G is a random matrix, every time you generate G and repeat the procedure, your results will be a bit
different.
4
(ii) (15 points) For the Digits dataset, starting with X ∈ R
1797×64, you will create a new
feature representation X˜
2 ∈ R
1797×2144 as follows: For any training data xi ∈ R
64, let the
elements be xij , j = 1, . . . , 64. The new feature set ˜xi ∈ R
2144 will include all the original
features xij , j = 1, . . . , 64, squares of the original features x
2
ij , j = 1, . . . , 64, and products
of all the original features xijxij0, j < j0
, j = 1, . . . , 64, j0 = j+1, . . . , 64. You should verify that the new ˜xi ∈ R
2144 and hence X˜
2 ∈ R
1797×2144. Using (X˜
2, y), perform 10-fold
cross-validation4 using the three methods: LinearSVC, SVC, and LogisticRegression,
and report the mean and the standard deviation of the 10-fold test set error rate. The
creation of X˜
2 will be done based on a function quad proj(X), and the function will
return X˜
2.
You will have to submit (a) code and (b) summary of results for all three parts:
(a) Code: You will have to submit code for rand proj(X,d), quad proj(X) as well as a
wrapper code q4().
rand proj(X,d) has input: (1) X, which is data (features) for the classification problem,
(2) d, the dimensionality of the projected features, and output: (1) X˜ ∈ R
1797×d
, the
new data for the problem. This output array does not need to be printed to the terminal.
quad proj(X) has input: X, which is data (features) for the classification problem, and
output: (1) X˜
2, the new data with all linear and quadratic combinations of features as
described above. This output array does not need to be printed to the terminal.
The wrapper code has no input and uses these above functions to execute all the
classification exercises outlined in (i) and (ii) above and print the test set error rates for
each of the k folds to the terminal. Make sure the exercises are executed in the following
order and add a print to the terminal before each execution to show which method and
dataset is being used:
1. LinearSVC with X˜
1; 2. LinearSVC with X˜
2,
3. SVC with X˜
1; 4. SVC with X˜
2,
5. LogisticRegression with X˜
1; 6. LogisticRegression with X˜
2.
(b) Summary of results: For each dataset, i.e., X˜
1 and X˜
2, and each method, report the
mean error rate over the k folds, and the standard deviation of the error rates over the k
folds. Make a table to present the results for each method and each dataset (6 tables in
total). Include a column in the table for each fold, and add two columns at the end to
show the overall mean error rate and standard deviation over the k folds.
Additional instructions: Code can only be written in Python (not IPython notebook); no
other programming languages will be accepted. One should be able to execute all programs from
python command prompt. Your code must be run on a CSE lab machine (e.g., csel-kh1260-
01.cselabs.umn.edu). Please make sure you specify the version of Python you are using as well
as instructions on how to run your program in the README file. Information on the size of the
datasets, including number of data points and dimensionality of features, as well as number of
classes can be readily extracted from the datasets in scikit-learn. Each function must take the
inputs in the order specified in the problem and display the output via the terminal or as specified.
For each part, you can submit additional files/functions (as needed) which will be used by the
main file. Please put comments in your code so that one can follow the key parts and steps in your
code.
5
Follow the rules strictly. If we cannot run your code, you will not get any credit.
• Things to submit
1. hw1.pdf: A document which contains the solution to Problems 1, 2, 3, and 4 including
the summary of results for 3 and 4. This document must be in PDF format (no word,
photo, etc. is accepted). If you submit a scanned copy of a hand-written document,
make sure the copy is clearly readable, otherwise no credit may be given.
2. Python code (main.py and other necessary function files) for Problems 3 and 4.
3. README.txt: README file that contains your name, student ID, email, instructions
on how to run your code, any assumptions you are making, and any other necessary
details.
4. Any other files, except the data, which are necessary for your code.
6