# CSci 5521–002 Homework 0 Linear regression solution

\$30.00

Category:

## Description

Problem 1: Linear regression learns a linear function of feature variables X to fit the
responses y. In this problem, you will derive the closed-form solution for linear regression
formulations.
1. The standard linear regression can be formulated as solving a least square problem
minimize
w
φ(w) = kXw − yk
2
2 = hXw − y, Xw − yi
where X ∈ R
n×m (n ≥ m) represents the feature matrix, y ∈ R
n×1
represents the
response vector and w ∈ R
m×1
is the vector variable of the linear coefficients. Here the
i − j-th element of X, denoted xij , is the j-th attribute value for the i-th data sample
(observation) and yi
is the true response for the i-th data sample. This is a convex
objective function of w. Derive the optimal w by setting the gradient of the function
wrt w to zero to minimize the objective function. To find the gradient, you can use
the following formula
φ(w + δ) = [X(w + δ)]T X(w + δ) − 2 [X(w + δ)]T
y + y
T y
= φ(w) + 2 [Xδ]
T
[Xw − y] + (Xδ)
T Xδ,
and note that w must be determined so that φ(w + δ) ≥ φ(w) for any possible vector
δ (why?). Here XT denotes the transpose of X.
2. In practice, a L2-norm regularizer is often introduced with the least squares, called
Ridge Regression, to overcome ill-posed problems where the hessian matrix is not
positive definite. The objective function of ridge regression is defined as
minimize
w
φ˜(w) = kXw − yk
2 + λkwk
2 =


X

λI
w −

y
0


2
where λ > 0 and I is an m × m identity matrix. This objective function is strictly
convex. Derive the solution of the ridge regression problem to find the optimal w.
1
Problem 2: Consider a coin with probability of heads equal to Pr(H) = p and probability
of tails Pr(T) = 1 − p. You toss it 5 times and get outcomes H,H,T,T,H.
1. What is the probability of observing the sequence H,H,T,T,H in five tosses. Also give
the formula for the natural logarithm of this probability. Your formulas should be a
function of p.
2. You have a box containing exactly 2 coins, one fair with p = 1/2 and one biased with
p = 2/3. You choose one of these two coins at random with equal probability, toss it
5 times and get the outcome H,H,T,T,H.
(a) Give the joint probability that the coin chosen was the fair coin (p = 1/2) and
the outcome was H,H,T,T,H.
(b) Give the joint probability that the coin chosen was the biased coin (p = 2/3) and
the outcome was H,H,T,T,H.
3. What should the bias p = Pr(H) be to maximize the probability of observing H,H,T,T,H,
and what is the corresponding probability of observing H,H,T,T,H (i.e., what is the
maximum likelihood estimate for p), assuming p were unknown? Show the derivation.
Hint: maximize the log of the function.
Problem 3: Below is the pseudo-code of perceptron algorithm for binary classification,
where (x
t
, rt
) is the t-th data sample: x
t
is the vector of attribute values (real numbers) and
r
t = ±1 is the class label for the t-th sample:
1. w = w0.
2. Do Iterate until convergence
3. For each sample (x
t
, rt
), t = 1, 2, · · ·
4. If r
t
hw, x
t
i ≤ 0
5. w = w + r
tx
t
Here “convergence” means w does not change at all over one pass through the entire training
dataset in the loop starting in step 3.
1. Implement the perceptron algorithm and test it on the provided data. To begin, do
“load data1.mat” to load the file the data file into MATLAB. X ∈ R
40×2
is the
feature matrix of 40 samples in 2 dimensions and r ∈ R
40×1
is the label vector (±1).
Use initial value w0 = [1; −1]T
. Now, run your perceptron algorithm on the given data.
How many iterations does it take to converge?
2. Visualize all the samples (use 2 different colors for the 2 different classes) and plot the
decision boundary defined by the initial w0. Plot the decision boundary defined by the
w returned by the perceptron program.
Hint: To visualize the samples you could use the MATLAB function call
scatter(X(:,1), X(:,2), 50, y, ’*’);
2
Type help scatter for more information. Plotting the boundary is equivalent to plotting the line wT x = w1x1 + w2x2 = 0. Since all the sample points are located within
the square {(x1, x2), −1 ≤ x1, x2 ≤ +1}, choose two points (a, b) and (c, d) by setting
a = −1, c = +1 and solving for b, d, or else set b = −1, d = +1 and solving for a, c, and
then draw the line between the two points (a, b) and (c, d) with the command
hold on; plot([a,c],[b,d]); hold off;
Use the hold function to add the line to the existing scatter plot, and axis to adjust
the axes, if needed. Draw both the initial boundary and the final boundary on the
same plot.
Submission
• Things to submit:
1. hw0 sol.pdf: a document contains all the derivations of Problem 1 & 2 and the