## Description

1. (25 points) Consider doing least squares regression based on a training set Ztrain = {(x

t

, rt

), t =

1, . . . , N}, where x

t ∈ R and r

t ∈ R.

(i) (10 points) Consider fitting a linear model of the form

g1(x) = w1x + w0 ,

with unknown parameters w1, w0 ∈ R, which are selected so as to minimize the following

empirical loss:

E(w1, w0|Ztrain) = 1

N

X

N

t=1

(r

t − (w1x

t + w0))2

.

Derive the optimal values of (w1, w0) clearly showing all steps of the derivation.

(ii) (10 points) Consider fitting a polynomial model of the form

g2(x) = v2x

2020 + v1x + v0 ,

with unknown parameters v2, v1, v0 ∈ R, which are selected so as to minimize the following empirical loss:

E(v2, v1, v0|Ztrain) = 1

N

X

N

t=1

(r

t − (v2(x

t

)

2020 + v1x

t + v0))2

.

Derive the optimal values of v2, v1, v0 clearly showing all steps of the derivation.1

(iii) (5 points) For a given training set Ztrain, let (w

∗

1

, w∗

0

) be the optimal values of (w1, w0) in

(i) above, and let (v

∗

2

, v∗

1

, v∗

0

) be the optimal values of (v2, v1, v0) in (ii) above. Professor

Gopher claims that the following is true for any given Ztrain:

E(v

∗

2

, v∗

1

, v∗

0

|Ztrain) ≤ E(w

∗

1

, w∗

0

|Ztrain)

Is Professor Gopher’s claim correct? Clearly explain your answer.2

2. (15 points) Consider the following 4 × 4 matrix:

A =

1 1 1 1

1 2 4 8

1 3 9 27

1 4 16 64

.

(i) (5 points) What are the values of tr(A),tr(AT

),tr(AT A), and tr(AAT

).3

1

It is ok to leave the solution in terms of a linear system, say Av = b, where A ∈ R

3×3

, b ∈ R

3

are known, and

v = [v1 v2 v2]

T ∈ R

3

is a vector of the unknown parameters. If you choose to do this, please also mention your

preferred approach to solve such a linear system.

2 A correct answer with insufficient or incorrect explanation will not get any credit.

3For this problem, you can use python libraries for the computations.

1

(ii) (5 points) From a geometric perspective, explain how the absolute value of |A| (determinant of A) can be computed.

(iii) (5 points) Are the rows of A linearly independent? Clearly explain your answer.2

(For this problem, you can use python libraries to arrive at your answer. If you do that,

clearly explain what you did and why. There is a way to arrive at the answer without

using python libraries.)

Programming assignments: The next two problems involve programming. We will be considering three datasets (derived from two available datasets) for these assignments:

(a) Boston: The Boston housing dataset comes pre-packaged with scikit-learn. The dataset

has 506 points, 13 features, and 1 target (response) variable. You can find more information

about the dataset here:

https://github.com/rupakc/UCI-Data-Analysis/tree/master/Boston Housing Dataset/Boston Housing

While the original dataset is for a regression problem, we will create two classification datasets

for the homework. Note that you only need to work with the response r to create these

classification datasets.

i. Boston50: Let τ50 be the median (50th percentile) over all r (response) values. Create

a 2-class classification problem such that y = 1 if r ≥ τ50 and y = 0 if r < τ50. By

construction, note that the class priors will be p(y = 1) ≈

1

2

, p(y = 0) ≈

1

2

.

ii. Boston75: Let τ75 be the 75th percentile over all r (response) values. Create a 2-class

classification problem such that y = 1 if r ≥ τ75 and y = 0 if r < τ75. By construction,

note that the class priors will be p(y = 1) ≈

1

4

, p(y = 0) ≈

3

4

.

(b) Digits: The Digits dataset comes prepackaged with scikit-learn. The dataset has 1797

points, 64 features, and 10 classes corresponding to ten numbers 0, 1, . . . , 9. The dataset was

(likely) created from the following dataset:

http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits

The 2-class classification datasets from Boston50, Boston75, and the 10-class classification dataset

from Digits will be used in the following two problems.

3. (30 points) We will consider three methods from scikit-learn: LinearSVC, SVC, and

LogisticRegression. Use the following parameters for the different methods mentioned:

LinearSVC: max iter=2000

SVC: gamma=‘scale’, C=10

LogisticRegression: penalty=‘l2’, solver=‘lbfgs’, multi class=‘multinomial’,

max iter=5000

(i) (15 points) Develop code for my cross val(method,X,y,k), which performs k-fold crossvalidation on (X, y) using method, and returns the error rate in each fold. Using

my cross val, report the error rates in each fold as well as the mean and standard deviation of error rates across folds for the three methods: LinearSVC, SVC, and LogisticRegression,

applied to the three classification datasets: Boston50, Boston75, and Digits.

You will have to submit (a) code and (b) summary of results for my cross val:

2

(a) Code: You will have to submit code for my cross val(method,X,y,k) (main file)

as well as a wrapper code q3i().

The main file has input: (1) method, which specifies the (class) name of one

of the three classification methods under consideration, (2) X,y, which is data for

the 2-class or 10-class classification problem, (3) k, the number of folds for crossvalidation, and output: (1) the test set error rates for each of the k folds.

The wrapper code has no input and is used to prepare the datasets, and make

calls to my cross val(method,X,y,k) to generate the results for each dataset and

each method. Make sure the calls to my cross val(method,X,y,k) are made in

the following order and add a print to the terminal before each call to show which

method and dataset is being used:

1. LinearSVC with Boston50; 2. LinearSVC with Boston75; 3. LinearSVC with

Digits,

4. SVC with Boston50; 5. SVC with Boston75; 6. SVC with Digits,

7. LogisticRegression with Boston50; 8. LogisticRegression with Boston75;

9. LogisticRegression with Digits.

For example, the first call to my cross val(method,X,y,k) with k = 10 should result

in the following output:

Error rates for LinearSVC with Boston50:

Fold 1: ###

Fold 2: ###

…

Fold 10: ###

Mean: ###

Standard Deviation: ###

(b) Summary of results: For each dataset and each method, report the test set error

rates for each of the k = 10 folds, the mean error rate over the k folds, and the

standard deviation of the error rates over the k folds. Make a table to present the

results for each method and each dataset (9 tables in total). Include a column in

the table for each fold, and add two columns at the end to show the overall mean

error rate and standard deviation over the k folds. For example:

Error rates for LinearSVC with Boston50

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 Mean SD

# # # # # # # # # # # #

(ii) (15 points) Develop code for my train test(method,X,y,π,k), which performs random

splits on the data (X, y) so that π ∈ [0, 1] fraction of the data is used for training using

method, rest is used for testing, and the process is repeated k times, after which the

code returns the error rate for each such train-test split. Using my train test, with

π = 0.75 and k = 10, report the mean and standard deviation of error rate for the three

methods: LinearSVC, SVC, and LogisticRegression, applied to the three classification

datasets: Boston50, Boston75, and Digits.

You will have to submit (a) code and (b) summary of results for my train test:

(a) Code: You will have to submit code for my train test(method,X,y,π,k) (main

file) as well as a wrapper code q3ii().

This main file has input: (1) method, which specifies the (class) name of one

3

of the three classification methods under consideration, (2) X,y, which is data for

the 2-class or 10-class classification problem, (3) π, the fraction of data chosen

randomly to be used for training, (4) k, the number of times the train-test split will

be repeated, and output: (1) the test set error rates for each of the k folds printed

to the terminal.

The wrapper code has no input and is used to prepare the datasets, and make calls

to my train test(method,X,y,π,k) to generate the results for each dataset and each

method (9 combinations in total). Make sure the calls to my train test(method,X,y,π,k)

are made in the following order and add a print to the terminal before each call to

show which method and dataset is being used:

1. LinearSVC with Boston50; 2. LinearSVC with Boston75; 3. LinearSVC with

Digits,

4. SVC with Boston50; 5. SVC with Boston75; 6. SVC with Digits,

7. LogisticRegression with Boston50; 8. LogisticRegression with Boston75;

9. LogisticRegression with Digits.

(b) Summary of results: For each dataset and each method, report the test set error

rates for each of the k = 10 runs with π = 0.75, the mean error rate over the k

folds, and the standard deviation of the error rates over the k folds. Make a table

to present the results for each method and each dataset (9 tables in total). Include

a column in the table for each run, and add two columns at the end to show the

overall mean error rate and standard deviation over the k runs.

4. (30 points) The problem considers a preliminary exercise in ‘feature engineering’ with focus

on the Digits dataset. Represented as (X, y), the Digits dataset has X ∈ R

1797×64, i.e.,

1797 training points, each having 64 features, and y ∈ {0, 1, . . . , 9}

1797, i.e., 1797 training

labels with each yi ∈ {0, 1, . . . , 9}. We will consider three methods from scikit-learn:

LinearSVC, SVC, and LogisticRegression for this problem. Use the following parameters

for the different methods mentioned:

LinearSVC: max iter=2000

SVC: gamma=‘scale’, C=10

LogisticRegression: penalty=‘l2’, solver=‘lbfgs’, multi class=‘multinomial’,

max iter=5000

(i) (15 points) For the Digits dataset, starting with X ∈ R

1797×64, you will create a new

feature representation X˜

1 ∈ R

1797×32 as follows: Construct a (random) matrix G ∈

R

64×32 where each element gij ∼ N(0, 1), i.e., sampled independently from a univariate

normal distribution, and then compute X˜

1 = XG. Using (X˜

1, y), perform 10-fold crossvalidation4 using the three methods: LinearSVC, SVC, and LogisticRegression, and

report the mean and the standard deviation of the 10-fold test set error rate.5 The

creation of X˜

1 will be done based on a function rand proj(X, d), where d = 32 for this

problem, and the function will return X˜

1.

4Please use your own code my cross val for this problem.

5Since G is a random matrix, every time you generate G and repeat the procedure, your results will be a bit

different.

4

(ii) (15 points) For the Digits dataset, starting with X ∈ R

1797×64, you will create a new

feature representation X˜

2 ∈ R

1797×2144 as follows: For any training data xi ∈ R

64, let the

elements be xij , j = 1, . . . , 64. The new feature set ˜xi ∈ R

2144 will include all the original

features xij , j = 1, . . . , 64, squares of the original features x

2

ij , j = 1, . . . , 64, and products

of all the original features xijxij0, j < j0

, j = 1, . . . , 64, j0 = j+1, . . . , 64. You should verify that the new ˜xi ∈ R

2144 and hence X˜

2 ∈ R

1797×2144. Using (X˜

2, y), perform 10-fold

cross-validation4 using the three methods: LinearSVC, SVC, and LogisticRegression,

and report the mean and the standard deviation of the 10-fold test set error rate. The

creation of X˜

2 will be done based on a function quad proj(X), and the function will

return X˜

2.

You will have to submit (a) code and (b) summary of results for all three parts:

(a) Code: You will have to submit code for rand proj(X,d), quad proj(X) as well as a

wrapper code q4().

rand proj(X,d) has input: (1) X, which is data (features) for the classification problem,

(2) d, the dimensionality of the projected features, and output: (1) X˜ ∈ R

1797×d

, the

new data for the problem. This output array does not need to be printed to the terminal.

quad proj(X) has input: X, which is data (features) for the classification problem, and

output: (1) X˜

2, the new data with all linear and quadratic combinations of features as

described above. This output array does not need to be printed to the terminal.

The wrapper code has no input and uses these above functions to execute all the

classification exercises outlined in (i) and (ii) above and print the test set error rates for

each of the k folds to the terminal. Make sure the exercises are executed in the following

order and add a print to the terminal before each execution to show which method and

dataset is being used:

1. LinearSVC with X˜

1; 2. LinearSVC with X˜

2,

3. SVC with X˜

1; 4. SVC with X˜

2,

5. LogisticRegression with X˜

1; 6. LogisticRegression with X˜

2.

(b) Summary of results: For each dataset, i.e., X˜

1 and X˜

2, and each method, report the

mean error rate over the k folds, and the standard deviation of the error rates over the k

folds. Make a table to present the results for each method and each dataset (6 tables in

total). Include a column in the table for each fold, and add two columns at the end to

show the overall mean error rate and standard deviation over the k folds.

Additional instructions: Code can only be written in Python (not IPython notebook); no

other programming languages will be accepted. One should be able to execute all programs from

python command prompt. Your code must be run on a CSE lab machine (e.g., csel-kh1260-

01.cselabs.umn.edu). Please make sure you specify the version of Python you are using as well

as instructions on how to run your program in the README file. Information on the size of the

datasets, including number of data points and dimensionality of features, as well as number of

classes can be readily extracted from the datasets in scikit-learn. Each function must take the

inputs in the order specified in the problem and display the output via the terminal or as specified.

For each part, you can submit additional files/functions (as needed) which will be used by the

main file. Please put comments in your code so that one can follow the key parts and steps in your

code.

5

Follow the rules strictly. If we cannot run your code, you will not get any credit.

• Things to submit

1. hw1.pdf: A document which contains the solution to Problems 1, 2, 3, and 4 including

the summary of results for 3 and 4. This document must be in PDF format (no word,

photo, etc. is accepted). If you submit a scanned copy of a hand-written document,

make sure the copy is clearly readable, otherwise no credit may be given.

2. Python code (main.py and other necessary function files) for Problems 3 and 4.

3. README.txt: README file that contains your name, student ID, email, instructions

on how to run your code, any assumptions you are making, and any other necessary

details.

4. Any other files, except the data, which are necessary for your code.

6