Description
1. [Linear Regression] We are given a dataset D = {(−1, 0),(1, 1)} containing two pairs
(x, y) with x ∈ R and y ∈ R. We want to find the parameters w =
w1
w0
∈ R
2 of a
linear regression model y = w1x + w0 using
min
w
1
2
X
(x,y)∈D
y − w
>
x
1
2
. (1)
(a) [2 pt] Plot the given dataset and find the optimal w∗ by inspection.
(b) [4 pt] Write down y ∈ R
2 and X ∈ R
2×2 which makes the following optimization
equivalent to (1):
min
w
1
2
ky − Xwk
2
2
(2)
(c) [3 pt] Derive the general analytical solution for (2). Also plug in the values for
the given dataset D and compute the solution numerically.
(d) [3 pt] There are several ways to compute this solution via PyTorch. Read the docs
for the functions torch.lstsq, torch.solve, torch.inverse. Use all three approaches
when completing the file LinearRegression.py and verify your answer.
(e) [6 pt] We are now given a dataset D0 = {(−2, −2),(−1, 0),(1, 1),(2, 1)} of pairs
(x, y) with x ∈ R and y ∈ R. We want to fit a quadratic model ˆy = w2x
2 +
w1x + w0 using (2). Specify the dimensions of the matrix X and the vector y.
Also write down explicitly the matrix and vector using the values in D0
. Find the
optimal solution w∗ and draw it together with the dataset into a plot.
(f) [2 pt] Specify y and X in LinearRegression2.py to verify your answer for
Problem 1e.
2
2. [Binary Logistic Regression] We are given a dataset D = {(−2, −1),(1, 1),(2, 1)}
containing three pairs (x, y), where each x ∈ R denotes a real-valued point and y ∈
{−1, +1} is the point’s class label.
Assuming the samples in the dataset D to be i.i.d. and using maximum likelihood, we
want to train a logistic regression model parameterized by w =
w1
w2
∈ R
2
:
p(y | x) = 1
1 + exp
−yw>
x
1
(3)
(a) [1pt] Instead of maximizing the likelihood we commonly minimize the negative
log-likelihood. Write the minimization problem for the model given in (3) (don’t
plug in the instances of D. In other words, write an optimization problem like
(1)).
(b) [3pt] Compute the derivative of the negative log-likelihood objective which you
specified in Problem 2a (don’t plug in the instances of D). Sketch a simple
gradient-descent algorithm using pseudo-code (use w for the parameters, α for
the learning rate, f for the objective function, and g = ∇wf for the gradient).
(c) [5pt] Implement the algorithm by completing LogisticRegression.py. What
is the optimal solution w∗
that your program found?
(d) [3pt] If we had the fourth datapoint (100; 1), would this influence the solution of
maximum likelihood estimate w much? How about if we had used linear regression
to fit D as opposed to logistic regression? Provide a reason for your answer.
(e) [3pt] Instead of manually deriving and implementing the gradient we now want to
take advantage of PyTorch auto-differentiation. Investigate LogisticRegression2.py
and complete the update step using the instance named optimizer. What code
did you add? If you compare the result of LogisticRegression.py with that
of LogisticRegression2.py after an equal number of iterations, what do you
realize?
(f) [5pt] Instead of manually implementing the cost function, we now want to take advantage of available functions in PyTorch, specifically torch.nn.BCEWithLogitsLoss
which expects targets to be y ∈ {0, 1}. Consequently, you need to translate
dataset D with y ∈ {−1, 1} to dataset D0 with y ∈ {0, 1}. Write the probabilities
p(y = 1 | x), p(y = 0 | x) and p(y | x) if we use torch.nn.BCEWithLogitsLoss.
Complete LogisticRegression3.py and compare the obtained result after 100
iterations to the one obtained in previous functions.
3
3. [Support Vector Machine] We are given the following dataset D of 4 samples as follows:
x1 x2 label
1 0 +
0 1 +
0 0 –
-1 0 –
Consider the binary classification problem of learning the decision boundary. We will
use the hard-margin SVM to find the max-margin hyperplane:
min
w,b
1
2
||w||2
2
(4)
s.t. y
(i)
(w1x
(i)
1 + w2x
(i)
2 + b) ≥ 1 ∀i ∈ {1, . . . , 6} (5)
(a) [5pt] Draw the dataset in R
2
space using circles (◦) for the points belonging to
the positive label (+) and crosses (×) for the points belonging to the negative
label (−). Find the support vectors by inspection, and mark them with stars.
How many support vectors are there?
(b) [5pt] Using your drawing in Problem 3a, compute the values of w1, w2, b for the
hard-margin SVM in (4) and (5). (hint: consider the dual problem, and use the
support vectors that you found in the previous problem)
(c) The L1 soft-margin SVM formulation is defined as:
min
w,b,ζ
1
2
||w||2
2 + C
X
N
i=1
ζi (6)
s.t. y
(i)
(w>x
(i) + b) ≥ 1 − ζi (7)
ζi ≥ 0 ∀i ∈ {1, . . . , N} (8)
Write the Lagrangian of the L1 norm soft-margin SVM using Lagrangian multipliers αi
’s for (7) and βi
’s for (7). (Do not plug in D.)
(d) [3pt] Write the KKT stationary condition for the soft-margin SVM in Problem 3c.
(e) [3pt] Using the KKT stationary condition obtained in Problem 3d, Write the
dual of the soft-margin SVM as an optimization problem with respect to αi
’s.
4