CPSC 540 Assignment 1  Fundamentals, Convex Functions, Numerical Optimization solved

$24.99

Original Work ?

Download Details:

  • Name: assn1.zip
  • Type: zip
  • Size: 14.08 MB

Category: You will Instantly receive a download link upon Payment||Click Original Work Button for Custom work

Description

5/5 - (4 votes)

• You can work in groups of 1-3 people on the assignments, but please only hand in one assignment. • Place the names and student numbers of all group members on the first page, and submit all answers as a single PDF file to handin. • Organize your PDF file sequentially according to the section numbers in this document. Place answers (including code/figures) in the appropriate section. • You will lose marks if answers are unclear or if the TA can’t easily find the answers. • At the start of each question, acknowledge sources of help from outside of class/textbook (classmates outside the group, online material, papers, etc.). • All Sections are equally weighted, except for Section 0. • The code and data referred to in the assignment is available in a1.zip. • Any modifications/updates/clarifications after the assignment is first put online will be marked in red. • Some of the questions require a bit of work, but no individual question is intended to take a lot of time. If you are really stuck, try coming up with a good question and posting on Piazza (other people are likely stuck on the same issue).
0 Course Prerequisite Form
Graduate students in CPSC or EECE must submit the prerequisite form: https://www.cs.ubc.ca/~schmidtm/Courses/540_prereqs.pdf Hand this in at the start of one of the first 3 lectures, or at the start of the first tutorial on January 12th. Students who don’t submit the form may be dropped from the course.
1 Fundamentals
The purpose of this question is to give you practice using the mathematical and coding notation that we will adopt in the course.
1.1 Matrix Notation
For this question we’ll use the following Householder-like notation:
1. α is a scalar.
2. w, a, and b are d by 1 column-vectors.
1
3. y and v are n by 1 column-vectors (with elements yi and vi). 4. A is a d by d matrix, not necessarily symmetric (with elements aij).
5. V is a diagonal matrix with v along the diagonal.
6. B is a diagonal matrix with b along the diagonal. 7. X is a n by d matrix (with rows (xi)T). Express the gradient ∇f(w) and Hessian ∇2f(w) of the following functions in matrix notation, simplifying as much as possible.
1. The linear function
f(w) = wTa + α +
d X j=1
wjaj.
2. The linear function
f(w) = aTw + aTAw + wTATb.
3. The quadratic function
f(w) = wTw + wTXTXw +
d X i=1
d X j=1
wiwjaij.
4. L2-regularized weighted least squares,
f(w) =
1 2
n X i=1
vi(wTxi −yi)2 + λ 2kwk2.
5. Weighted L2-regularized probit regression,
f(w) = −
n X i=1
logp(yi|xiw) + 1 2
d X j=1
bjw2 j.
where yi ∈{−1,+1} and the likelihood of a single example i is given by p(yi|xi,w) = Φ(yiwTxi). where Φ is the cumulative distribution function (CDF) of the standard normal distribution.
Hint: You can use the results we showed in class to simplify the derivations. You can use 0 to represent the zero vector or a matrix of zeroes and I to denote the identity matrix. It will help to convert the fourth example to matrix notation first. For the fifth example, it is useful to define a vector c containing the CDF Φ(yiwTxi) as element ci and a vector p containing the corresponding PDF as element pi. For the fifth one you’ll need to define new vectors to express the gradient and Hessian in matrix notation (and remember the relationship between the PDF and CDF). As a sanity check, make sure that your results have the right dimension.
1.2 Regularization and Cross-Validation
Download a1.zip from the course webpage, and start Matlab in a directory containing the extracted files. If you run the script example nonLinear, it will:
2
1. Load a one-dimensional regression dataset.
2. Fit a least-squares linear regression model.
3. Report the test error.
4. Draw a figure showing the training/testing data and what the model looks like.
Unfortunately, this is not a great model of the data, and the figure shows that a linear model is probably not suitable.
1. Write a function called leastSquaresRBFL2 that implements least squares using Gaussian radial basis functions (RBFs) and L2-regularization. You should start from the leastSquares function and use the same conventions: n refers to the number of training examples, d refers to the number of features, X refers to the data matrix, y refers to the targets, Z refers to the data matrix after the change of basis, and so on. Note that you’ll have to add two additional input arguments (λ for the regularization parameter and σ for the Gaussian RBF variance) compared to the leastSquares function. To make your code easier to understand/debug, you may want to define a new function rbfBasis which computes the Gaussian RBFs for a given training set, testing set, and σ value. Hand in your function and the plot generated with λ = 1 and σ = 1.
2. When dealing with larger datasets, an important issue is the dependence of the computational cost on the number of training examples n and the number of features d. What is the cost in big-O notation of training the model on n training examples with d features under (a) the linear basis, and (b) Gaussian RBFs (for a fixed σ)? What is the cost of classifying t new examples under these two bases? Assume that multiplication by an n by d matrix costs O(nd) and that inverting a d by d linear system costs O(d3).
3. Modify the training/validation procedure to use 10-fold cross-validation on the training set to select λ and σ. Hand in your cross-validation procedure and the plot you obtain with the best values of λ and σ
Note: If you find that calculating the Euclidean distances between all pairs of points takes too long, the following code will form a matrix containing the squared Euclidean distances between all training and test points:
[n,d] = size(X); [t,d] = size(Xtest); D = X.^2*ones(d,t) + ones(n,d)*(Xtest’).^2 – 2*X*Xtest’;
Element D(i,j) gives the squared Euclidean distance between training point i and testing point j.
1.3 MAP Estimation
In class, we showed that under the assumptions yi ∼N(wTxi,1), wj ∼N0, 1 λ, the MAP estimate is equivalent to solving the L2-regularized least squares problem
f(w) =
1 2kXw−yk2 + λ 2kwk2,
in the “loss plus regularizer” framework. For each of the alternate assumptions below, write it in the “loss plus regularizer” framework (simplifying as much as possible, including converting to matrix notation):
3
1. Laplace distribution likelihoods and priors, yi ∼L(wTxi,1), wj ∼L0, 1 λ.
2. Gaussians with separate variance for each training example and variable,
yi ∼N(wTxi,σ2 i ), wj ∼N0, 1 λj.
3. Poisson-distributed likelihood (for the case where yi represents discrete counts) and Gaussian prior, yi ∼P(exp(wTxi)), wj ∼N0, 1 λ,
2 Convex Functions
2.1 Minimizing Strictly-Convex Quadratic Functions
Solve for the minimizer of the below strictly-convex quadratic functions: 1. f(w) = 1 2kw−vk2 (projection of v onto real space). 2. f(w) = 1 2kXw−yk2 + 1 2wTΛw (least squares with quadratic-norm regularization). 3. f(w) = 1 2Pn i=1 vi(wTxi −yi)2 + λ 2kw−w0k2 (weighted least squares shrunk towards non-zero w0). Above we assume that v and w0 are d by 1 vectors, and Λ is a d by d positive-definite matrix. You can use V as a diagonal matrix containing the vi values along the diagonal.
2.2 Proving Convexity
Show that the following functions are convex using one of the definitions of convexity (without using the “operations that preserve convexity” or using convexity results stated in class): 1. Negative logarithm f(w) = −log(aw) w 0 2. Quadratic with positive semi-definite A f(w) = 1 2wTAw + bTw + γ w ∈Rd,A  03. Any norm f(w) = kwkp w ∈Rd4. Logistic regression f(w) =Pn i=1 log(1 + exp(−yiwTxi)) w ∈Rd Hint: Norms are not differentiable in general, so you cannot use the Hessian for the third one. For the last one, you can use the Hessian structure we derived in class.
Use the results above and from class, along with the operations that preserve convexity, to show that the following functions are convex (with λ ≥ 0): 5. f(w) = kXw−ykp + λkAwkq (regularized regression with arbitrary p-norm and weighted q-norm) 6. f(w) =PN i=1 max{0,|wTxi −yi|−}+ λ 2kwk2 2 (support vector regression) 7. f(x) = maxijk{|xi|+|xj|+|xk|} (3 largest-magnitude elements)
4
2.3 Robust Regression
The script example outliers loads a one-dimensional regression dataset that has a non-trivial number of ‘outlier’ data points. These points do not fit the general trend of the rest of the data, and pull the least squares model away from the main cluster of points. One way to improve the performance in this setting is simply to remove or downweight the outliers. However, in high-dimensions it may be difficult to determine whether points are indeed outliers (or the errors might simply be heavy-tailed). In such cases, it is preferable to replace the squared error with an error that is more robust to outliers.
1. Write a new function, robustRegression(X,y), that adds a bias variable and fits a linear regression model by minimizing the absolute error instead of the square error, f(w) = kXw−yk1. You should turn this into a linear program as shown in class, and you can solve this linear program using Matlab’s linprog function. Hand in the new function and report the minimum absolute training error that is possible on this dataset.
2. There have been several attempts to adapt SVMs to the regression problem. The most common method for support vector regression uses what iscalled the -insensitive loss,
f(w) =
n X i=1
max{0,|wTxi −yi|−}.
Here,  is a parameter and notice that the model only penalizes errors larger than . Show how to write minimizing this objective function as a linear program.
3. Write a new function, svRegression(X,y,epsilon), that minimizes the -insensitive objective. Hand in the new function and report the absolute training error achieved with  = 1..
3 Numerical Optimization
3.1 Gradient Descent and Newton’s Method
The function example gradient loads a simple binary classification dataset, and fits an `2-regularized logistic regression model to it using the findMin function which implements a simple gradient descent algorithm. The findMin function is generic in the sense that it only needs an anonymous function which computes the objective value and gradient given a parameter vector. On each iteration findMin uses a backtracking line-search to find a step-size α that satisfies the Armijo “sufficient decrease” condition. It always tries α = 1 first, and whenever the condition is not satisfied it uses “cubic Hermite polynomial” interpolation to find a smaller value of α to try. It continues running the algorithm until a pre-specified number of iterations are reached or until the norm of the gradient is sufficiently small. On this dataset, the method only requires 9 iterations before it satisfies its optimality condition, although it backtracks 13 times and evaluates the function and gradient a total of 23 times.
Report the effect on performance (in terms of number of backtracking iterations and total number of iterations) of making the following changes to findMin:
1. When backtracking, replacing the cubic-Hermite interpolation with the simpler strategy of dividing α in half (as suggested in the Boyd & Vandenberghe book).
2. Instead of resetting α to one after the line-search, set it using the Barzilai-Borwein step-size, given by
α ←−α
vT∇f(w) vTv
,
5
where v is the new gradient value minus the old gradient value.
3. Fix the step-size α to 1/L, where L is given by
L =
1 4
max{eig(XTX)}+ λ,
which is the Lipschitz constant of the gradient.
4. Instead of using the gradient direction, set d to the Newton direction which is given by d = [∇2f(w)]−1∇f(w).
For the Newton direction, you’ll need to make a new objective function that returns the Hessian in addition to the function and gradient, and modify findMin to use the Hessian.
3.2 Hessian-Free Newton
The dataset used in the previous question leads to an easy optimization problem. If you apply the variations in the previous question to the larger dataset contained in rcv1 train binary.mat then the performance is very different: 1. Without modifications, findMin doesn’t work since the objective overflows.1
2. Step-size halving avoids the overflow, but the method doesn’t reaches a reasonable accuracy even after 500 iterations.
3. With the Barzilai-Borwein step-size, it reaches a solution in a reasonable number of iterations. 4. Computing L is slower than solving the original problem because d is so large.2
5. Computing the Hessian is slower than solving the original problem because d is so large.
In Hessian-free Newton methods, we compute an approximate Newton direction d without ever forming the Hessian. The standard way to do this is to use conjugate gradient to solve for d, which only requires Hessian-vector products of the form ∇2f(w)v for a vector v. Note that Hessian-vector products can always be computed at a similar cost to computing the gradient. For example, for logistic regression we can use ∇2f(w)v = XTDXv = XT(D(Xv)), and the order of operations leads to a cost of O(nd). This is cheaper than the O(nd2) cost of forming the Hessian. Use Matlab’s pcg function to implement a “Hessian-free” Newton’s method, where you use conjugate gradient to solve the Newton system. Report the output of findMin on rcv1 train binary.mat when using this strategy and using optTol as the tolerance for pcg.
Hint: In logisticL2, define a function that computes the Hessian-vector product and has the following header:
function [Hv] = Hvfunc(w,v,X,y,lambda)
To define the AFUN argument needed by pcg (which must be a function of v for a fixed w) you can use an anonymous function:
Hv = @(v)Hvfunc(w,v,varargin{:});
1There are two standard solutions to this problem: we could detect the overflow and backtrack out of regions where it overflows, or we could use the log-sum-exp trick to evaluate the objective without overflowing. 2An often-faster way to compute the largest eigenvalue is the “power method”.
6
3.3 Multi-Class Logistic Regression
The function example multiClass loads a multi-class classification dataset and fits a “one-vs-all” logistic regression classifier, then reports the validation error and shows a plot of the data/classifier. The performance on the validation set is ok, but could be much better. For example, this classifier never even predicts some of the classes.
Using a one-vs-all classifier hurts performance because the classifiers are fit independently, so there is no attempt to calibrate the columns of the matrix W. An alternative to this independent model is to use the softmax probability, p(yi|W,xi) = exp(wT yixi) Pk c=1 exp(wT c xi) . Here c is a possible label and wc is column c of W. Similarly, yi is the training label, wyi is column yi of W. The loss function corresponding to the negative logarithm of the softmax probability is given by
f(W) =
n X i=1″−wT yixi + log k X c0=1
exp(wT c0xi)!#.
Make a new function, softmaxClassifier, which fits W using the softmax loss from the previous section instead of fitting k independent classifiers. Hand in the code and report the validation error.
Hint: you can use the derivativeCheck function to help you debug the gradient of the softmax loss. It can also help you numerically check your answer to several more of this assignment’s questions.