Description

5/5 - (4 votes)

• You can work in groups of 1-3 people on the assignments, but please only hand in one assignment. • Place the names and student numbers of all group members on the ﬁrst page, and submit all answers as a single PDF ﬁle to handin. • Organize your PDF ﬁle sequentially according to the section numbers in this document. Place answers (including code/ﬁgures) in the appropriate section. • You will lose marks if answers are unclear or if the TA can’t easily ﬁnd the answers. • At the start of each question, acknowledge sources of help from outside of class/textbook (classmates outside the group, online material, papers, etc.). • All Sections are equally weighted, except for Section 0. • The code and data referred to in the assignment is available in a1.zip. • Any modiﬁcations/updates/clariﬁcations after the assignment is ﬁrst put online will be marked in red. • Some of the questions require a bit of work, but no individual question is intended to take a lot of time. If you are really stuck, try coming up with a good question and posting on Piazza (other people are likely stuck on the same issue).
0 Course Prerequisite Form
Graduate students in CPSC or EECE must submit the prerequisite form: https://www.cs.ubc.ca/~schmidtm/Courses/540_prereqs.pdf Hand this in at the start of one of the ﬁrst 3 lectures, or at the start of the ﬁrst tutorial on January 12th. Students who don’t submit the form may be dropped from the course.
1 Fundamentals
The purpose of this question is to give you practice using the mathematical and coding notation that we will adopt in the course.
1.1 Matrix Notation
For this question we’ll use the following Householder-like notation:
1. α is a scalar.
2. w, a, and b are d by 1 column-vectors.
1
3. y and v are n by 1 column-vectors (with elements yi and vi). 4. A is a d by d matrix, not necessarily symmetric (with elements aij).
5. V is a diagonal matrix with v along the diagonal.
6. B is a diagonal matrix with b along the diagonal. 7. X is a n by d matrix (with rows (xi)T). Express the gradient ∇f(w) and Hessian ∇2f(w) of the following functions in matrix notation, simplifying as much as possible.
1. The linear function
f(w) = wTa + α +
d X j=1
wjaj.
2. The linear function
f(w) = aTw + aTAw + wTATb.
3. The quadratic function
f(w) = wTw + wTXTXw +
d X i=1
d X j=1
wiwjaij.
4. L2-regularized weighted least squares,
f(w) =
1 2
n X i=1
vi(wTxi −yi)2 + λ 2kwk2.
5. Weighted L2-regularized probit regression,
f(w) = −
n X i=1
logp(yi|xiw) + 1 2
d X j=1
bjw2 j.
where yi ∈{−1,+1} and the likelihood of a single example i is given by p(yi|xi,w) = Φ(yiwTxi). where Φ is the cumulative distribution function (CDF) of the standard normal distribution.
Hint: You can use the results we showed in class to simplify the derivations. You can use 0 to represent the zero vector or a matrix of zeroes and I to denote the identity matrix. It will help to convert the fourth example to matrix notation ﬁrst. For the ﬁfth example, it is useful to deﬁne a vector c containing the CDF Φ(yiwTxi) as element ci and a vector p containing the corresponding PDF as element pi. For the ﬁfth one you’ll need to deﬁne new vectors to express the gradient and Hessian in matrix notation (and remember the relationship between the PDF and CDF). As a sanity check, make sure that your results have the right dimension.
1.2 Regularization and Cross-Validation
Download a1.zip from the course webpage, and start Matlab in a directory containing the extracted ﬁles. If you run the script example nonLinear, it will:
2
1. Load a one-dimensional regression dataset.
2. Fit a least-squares linear regression model.
3. Report the test error.
4. Draw a ﬁgure showing the training/testing data and what the model looks like.
Unfortunately, this is not a great model of the data, and the ﬁgure shows that a linear model is probably not suitable.
1. Write a function called leastSquaresRBFL2 that implements least squares using Gaussian radial basis functions (RBFs) and L2-regularization. You should start from the leastSquares function and use the same conventions: n refers to the number of training examples, d refers to the number of features, X refers to the data matrix, y refers to the targets, Z refers to the data matrix after the change of basis, and so on. Note that you’ll have to add two additional input arguments (λ for the regularization parameter and σ for the Gaussian RBF variance) compared to the leastSquares function. To make your code easier to understand/debug, you may want to deﬁne a new function rbfBasis which computes the Gaussian RBFs for a given training set, testing set, and σ value. Hand in your function and the plot generated with λ = 1 and σ = 1.
2. When dealing with larger datasets, an important issue is the dependence of the computational cost on the number of training examples n and the number of features d. What is the cost in big-O notation of training the model on n training examples with d features under (a) the linear basis, and (b) Gaussian RBFs (for a ﬁxed σ)? What is the cost of classifying t new examples under these two bases? Assume that multiplication by an n by d matrix costs O(nd) and that inverting a d by d linear system costs O(d3).
3. Modify the training/validation procedure to use 10-fold cross-validation on the training set to select λ and σ. Hand in your cross-validation procedure and the plot you obtain with the best values of λ and σ
Note: If you ﬁnd that calculating the Euclidean distances between all pairs of points takes too long, the following code will form a matrix containing the squared Euclidean distances between all training and test points:
[n,d] = size(X); [t,d] = size(Xtest); D = X.^2*ones(d,t) + ones(n,d)*(Xtest’).^2 – 2*X*Xtest’;
Element D(i,j) gives the squared Euclidean distance between training point i and testing point j.
1.3 MAP Estimation
In class, we showed that under the assumptions yi ∼N(wTxi,1), wj ∼N0, 1 λ, the MAP estimate is equivalent to solving the L2-regularized least squares problem
f(w) =
1 2kXw−yk2 + λ 2kwk2,
in the “loss plus regularizer” framework. For each of the alternate assumptions below, write it in the “loss plus regularizer” framework (simplifying as much as possible, including converting to matrix notation):
3
1. Laplace distribution likelihoods and priors, yi ∼L(wTxi,1), wj ∼L0, 1 λ.
2. Gaussians with separate variance for each training example and variable,
yi ∼N(wTxi,σ2 i ), wj ∼N0, 1 λj.
3. Poisson-distributed likelihood (for the case where yi represents discrete counts) and Gaussian prior, yi ∼P(exp(wTxi)), wj ∼N0, 1 λ,
2 Convex Functions
2.1 Minimizing Strictly-Convex Quadratic Functions
Solve for the minimizer of the below strictly-convex quadratic functions: 1. f(w) = 1 2kw−vk2 (projection of v onto real space). 2. f(w) = 1 2kXw−yk2 + 1 2wTΛw (least squares with quadratic-norm regularization). 3. f(w) = 1 2Pn i=1 vi(wTxi −yi)2 + λ 2kw−w0k2 (weighted least squares shrunk towards non-zero w0). Above we assume that v and w0 are d by 1 vectors, and Λ is a d by d positive-deﬁnite matrix. You can use V as a diagonal matrix containing the vi values along the diagonal.
2.2 Proving Convexity
Show that the following functions are convex using one of the deﬁnitions of convexity (without using the “operations that preserve convexity” or using convexity results stated in class): 1. Negative logarithm f(w) = −log(aw) w 0 2. Quadratic with positive semi-deﬁnite A f(w) = 1 2wTAw + bTw + γ w ∈Rd,A 03. Any norm f(w) = kwkp w ∈Rd4. Logistic regression f(w) =Pn i=1 log(1 + exp(−yiwTxi)) w ∈Rd Hint: Norms are not diﬀerentiable in general, so you cannot use the Hessian for the third one. For the last one, you can use the Hessian structure we derived in class.
Use the results above and from class, along with the operations that preserve convexity, to show that the following functions are convex (with λ ≥ 0): 5. f(w) = kXw−ykp + λkAwkq (regularized regression with arbitrary p-norm and weighted q-norm) 6. f(w) =PN i=1 max{0,|wTxi −yi|−}+ λ 2kwk2 2 (support vector regression) 7. f(x) = maxijk{|xi|+|xj|+|xk|} (3 largest-magnitude elements)
4
2.3 Robust Regression
The script example outliers loads a one-dimensional regression dataset that has a non-trivial number of ‘outlier’ data points. These points do not ﬁt the general trend of the rest of the data, and pull the least squares model away from the main cluster of points. One way to improve the performance in this setting is simply to remove or downweight the outliers. However, in high-dimensions it may be diﬃcult to determine whether points are indeed outliers (or the errors might simply be heavy-tailed). In such cases, it is preferable to replace the squared error with an error that is more robust to outliers.
1. Write a new function, robustRegression(X,y), that adds a bias variable and ﬁts a linear regression model by minimizing the absolute error instead of the square error, f(w) = kXw−yk1. You should turn this into a linear program as shown in class, and you can solve this linear program using Matlab’s linprog function. Hand in the new function and report the minimum absolute training error that is possible on this dataset.
2. There have been several attempts to adapt SVMs to the regression problem. The most common method for support vector regression uses what iscalled the -insensitive loss,
f(w) =
n X i=1
max{0,|wTxi −yi|−}.
Here, is a parameter and notice that the model only penalizes errors larger than . Show how to write minimizing this objective function as a linear program.
3. Write a new function, svRegression(X,y,epsilon), that minimizes the -insensitive objective. Hand in the new function and report the absolute training error achieved with = 1..
3 Numerical Optimization
3.1 Gradient Descent and Newton’s Method
The function example gradient loads a simple binary classiﬁcation dataset, and ﬁts an `2-regularized logistic regression model to it using the ﬁndMin function which implements a simple gradient descent algorithm. The ﬁndMin function is generic in the sense that it only needs an anonymous function which computes the objective value and gradient given a parameter vector. On each iteration ﬁndMin uses a backtracking line-search to ﬁnd a step-size α that satisﬁes the Armijo “suﬃcient decrease” condition. It always tries α = 1 ﬁrst, and whenever the condition is not satisﬁed it uses “cubic Hermite polynomial” interpolation to ﬁnd a smaller value of α to try. It continues running the algorithm until a pre-speciﬁed number of iterations are reached or until the norm of the gradient is suﬃciently small. On this dataset, the method only requires 9 iterations before it satisﬁes its optimality condition, although it backtracks 13 times and evaluates the function and gradient a total of 23 times.
Report the eﬀect on performance (in terms of number of backtracking iterations and total number of iterations) of making the following changes to ﬁndMin:
1. When backtracking, replacing the cubic-Hermite interpolation with the simpler strategy of dividing α in half (as suggested in the Boyd & Vandenberghe book).
2. Instead of resetting α to one after the line-search, set it using the Barzilai-Borwein step-size, given by
α ←−α
vT∇f(w) vTv
,
5
where v is the new gradient value minus the old gradient value.
3. Fix the step-size α to 1/L, where L is given by
L =
1 4
max{eig(XTX)}+ λ,
which is the Lipschitz constant of the gradient.
4. Instead of using the gradient direction, set d to the Newton direction which is given by d = [∇2f(w)]−1∇f(w).
For the Newton direction, you’ll need to make a new objective function that returns the Hessian in addition to the function and gradient, and modify ﬁndMin to use the Hessian.
3.2 Hessian-Free Newton
The dataset used in the previous question leads to an easy optimization problem. If you apply the variations in the previous question to the larger dataset contained in rcv1 train binary.mat then the performance is very diﬀerent: 1. Without modiﬁcations, ﬁndMin doesn’t work since the objective overﬂows.1
2. Step-size halving avoids the overﬂow, but the method doesn’t reaches a reasonable accuracy even after 500 iterations.
3. With the Barzilai-Borwein step-size, it reaches a solution in a reasonable number of iterations. 4. Computing L is slower than solving the original problem because d is so large.2
5. Computing the Hessian is slower than solving the original problem because d is so large.
In Hessian-free Newton methods, we compute an approximate Newton direction d without ever forming the Hessian. The standard way to do this is to use conjugate gradient to solve for d, which only requires Hessian-vector products of the form ∇2f(w)v for a vector v. Note that Hessian-vector products can always be computed at a similar cost to computing the gradient. For example, for logistic regression we can use ∇2f(w)v = XTDXv = XT(D(Xv)), and the order of operations leads to a cost of O(nd). This is cheaper than the O(nd2) cost of forming the Hessian. Use Matlab’s pcg function to implement a “Hessian-free” Newton’s method, where you use conjugate gradient to solve the Newton system. Report the output of ﬁndMin on rcv1 train binary.mat when using this strategy and using optTol as the tolerance for pcg.
Hint: In logisticL2, deﬁne a function that computes the Hessian-vector product and has the following header:
function [Hv] = Hvfunc(w,v,X,y,lambda)
To deﬁne the AFUN argument needed by pcg (which must be a function of v for a ﬁxed w) you can use an anonymous function:
Hv = @(v)Hvfunc(w,v,varargin{:});
1There are two standard solutions to this problem: we could detect the overﬂow and backtrack out of regions where it overﬂows, or we could use the log-sum-exp trick to evaluate the objective without overﬂowing. 2An often-faster way to compute the largest eigenvalue is the “power method”.
6
3.3 Multi-Class Logistic Regression
The function example multiClass loads a multi-class classiﬁcation dataset and ﬁts a “one-vs-all” logistic regression classiﬁer, then reports the validation error and shows a plot of the data/classiﬁer. The performance on the validation set is ok, but could be much better. For example, this classiﬁer never even predicts some of the classes.
Using a one-vs-all classiﬁer hurts performance because the classiﬁers are ﬁt independently, so there is no attempt to calibrate the columns of the matrix W. An alternative to this independent model is to use the softmax probability, p(yi|W,xi) = exp(wT yixi) Pk c=1 exp(wT c xi) . Here c is a possible label and wc is column c of W. Similarly, yi is the training label, wyi is column yi of W. The loss function corresponding to the negative logarithm of the softmax probability is given by
f(W) =
n X i=1″−wT yixi + log k X c0=1
exp(wT c0xi)!#.
Make a new function, softmaxClassiﬁer, which ﬁts W using the softmax loss from the previous section instead of ﬁtting k independent classiﬁers. Hand in the code and report the validation error.
Hint: you can use the derivativeCheck function to help you debug the gradient of the softmax loss. It can also help you numerically check your answer to several more of this assignment’s questions.

CPSC 540 Assignment 1 Fundamentals, Convex Functions, Numerical Optimization solved

Download Details:

Description

CPSC 540 Assignment 1 Fundamentals, Convex Functions, Numerical Optimization solved

Download Details:

Description

Related products

CPSC 540 Assignment 3 Density Estimation and Project Proposal solved

CPSC 540 Assignment 5 UGMs, Bayes, and Literature Survey solved

CPSC 540 Assignment 2 Large-Scale Machine Learning solved