## Description

Problem 1: Linear regression learns a linear function of feature variables X to fit the

responses y. In this problem, you will derive the closed-form solution for linear regression

formulations.

1. The standard linear regression can be formulated as solving a least square problem

minimize

w

||Xw − y||2

,

where X ∈ R

n×m (n ≥ m) represents the feature matrix, y ∈ R

n×1

represents the

response vector and w ∈ R

m×1

is the vector variable of the linear coefficients. This is

a convex objective function of w. Derive the optimal w by setting the derivative of the

function wrt w to zero to minimize the objective function.

2. In practice, a L2-norm regularizer is often introduced with the least squares, called

Ridge Regression, to overcome ill-posed problems where the hessian matrix is not

positive definite. The objective function of ridge regression is defined as

minimize

w

||Xw − y||2 + λ||w||2

,

where λ > 0. This objective function is strictly convex. Derive the solution of the

ridge regression problem to find the optimal w.

Problem 2: Consider a coin with probability of heads equal to Pr(H) = p and probability

of tails Pr(T) = 1 − p. You toss it 5 times and get outcomes H,H,T,T,H.

1. What is the probability of observing the sequence H,H,T,T,H in five tosses. Also give

the formula for the natural logarithm of this probability. Your formulas should be a

function of p.

2. You have a box containing exactly 2 coins, one fair with p = 1/2 and one biased with

p = 2/3. You choose one of these two coins at random with equal probability, toss it

5 times and get the outcome H,H,T,T,H.

(a) Give the joint probability that the coin chosen was the fair coin (p = 1/2) and

the outcome was H,H,T,T,H.

(b) Give the joint probability that the coin chosen was the biased coin (p = 2/3) and

the outcome was H,H,T,T,H.

1

Instructors: Rui Kuang (kuang@umn.edu). TAs: Tianci Song (song0309@umn.edu)

1

3. What should the bias p = Pr(H) be to maximize the probability of observing H,H,T,T,H,

and what is the corresponding probability of observing H,H,T,T,H (i.e., what is the

maximum likelihood estimate for p), assuming p were unknown? Show the derivation.

Hint: maximize the log of the function.

Problem 3: Below is the pseudo-code of perceptron algorithm for binary classification,

1 w = w0.

2 Do Iterate until convergence

3 For each sample (x

t

, rt

)

4 If(< w, xt > rt ≤ 0)

5 w = w + r

tx

t

1. Implement the perceptron algorithm and test it on the provided data. To begin, load

the file data1.mat into Python. X ∈ R

40×2

is the feature matrix of 40 samples in 2

dimensions and y ∈ R

40×1

is the label vector (+1/−1). Initialize w to be vector [1, −1].

Visualize all the samples (use different colors for different classes) and plot the decision

boundary defined by the initial w.

Now, run your perceptron algorithm on the given data. How many iterations does

it take to converge? Plot the decision boundary defined by the w returned by the

perceptron program.

Hint: To load data in MATLAB format, you may consider to use the function loadmat,

which is included in the io module of scipy package. To visualize the samples you

could use the function scatter(), which is included in the pyplot module of matplotlib

package. Plotting the boundary is equivalent to plot the line w

T x = 0. Therefore, you

could first generate a vector a to be your x-axis, then compute the y-axis vector b as

b = −

w(1)a

w(2) . Once the plot is generated you could use xlim() and ylim() functions in

pyplot module of matplotlib package to make sure your axes are in a right range.

When you are done your plots will look like the following figures:

2

2. In previous question the samples from the two classes are linearly separable. Now let’s

look at a linearly non-separable case. Load the file data2.mat into Python and run

your perceptron algorithm with w = [1, −1]. Can the perceptron algorithm converge?

Explain why. To improve the algorithm, we can introduce a ”soft” linear classifier to

tolerate errors. It turns out we can solve the following LP problem:

minimize

w,ξt

X

t

ξ

t

subject to r

t

(w

T x

t

) ≥ 1 − ξ

t

ξ

t ≥ 0.

Here ξ

t

is the error which needs to be minimized. The function linprog() from optimize

module of scipy package can be used for the problem. Now, run the following Python

code to solve the LP problem on data2.mat.

import numpy a s np

from s ci p y . o p timi z e import l i n p r o g

m, n = np . shape (X)

X = np . h s ta c k ( (X, np . one s ( (m, 1 ) ) ) )

n = n + 1

f = np . append ( np . z e r o s ( n ) , np . one s (m) )

A1 = np . h s ta c k ( (X∗np . t i l e ( y , ( n , 1 ) ) . T, np . eye (m) ) )

A2 = np . h s ta c k ( ( np . z e r o s ( (m, n ) ) , np . eye (m) ) )

A = −np . v s ta c k ( ( A1, A2 ) )

b = np . append(−np . one s (m) , np . z e r o s (m) )

x = l i n p r o g ( f ,A, b )

w = x [ ’ x ’ ] [ 0 : n ]

Apply this algorithm to data2.mat, visualize the data and plot the boundary by the

w returned by LP.

Submission

• Things to submit:

1. hw0 sol.pdf: a document contains all the derivations of Problem 1&2 and the

three plots asked by Problem 3.

2. MyPerceptron.py: a Python function defined as MyPerceptron(X, y, w0) with w

and step returned, where X is the feature matrix, y is a label vector and w0 is

the initialization of the parameter w. In the output, w is the parameter found

by perceptron and step represents the number of steps the algorithm takes to

converge. The function should also display the plot of samples and boundary.

3

Note that only numpy pacakge and functions mentioned above are allowed to use

in this assignment.

3. Zip all the files into a single zipped file and name it as your name.

• Submit: All material must be submitted electronically via Canvas. This homework

will not be graded but required as a proof of satisfying the prerequisites for taking the

class.

4