## Description

CSCI 4100/6100 RPI Machine Learning From Data

ASSIGNMENT 12

LFD is the class textbook

1. (300) Neural Networks and Backpropagation

Write a program to implement gradient descent for a 2 input (d(0) = 2), m-hidden unit (d(1) = m), 1 output sigmoidal neural network (L = 2). For the output node transformation, allow for both S(s) = s and θ(s) = tanh(s). Implement gradient decent on the squared error Ein(w) = 1 4N PN n=1 (h(xn,w) − yn)2 , and check your gradient calculation as follows: (a) Use a network with m = 2. Set all the weights to 0.25 and consider a data set with 1 point: x1 =1 1; y = 1. For both the identity and tanh(·) output node transformation functions, obtain the gradient of Ein(w) using the backpropagation algorithm. Report this result there should be as many numbers as parameters in this network. (b) Now, obtain the gradient numerically by peturbing each weight in turn by 0.0001. Report this result. (Hint: If this result is not similar to your previous result then there is something wrong with your backpropagtion gradient calculation.)

2. (600) Neural Network for Digits

Use your neural network implementation with 10 hidden units to build a classiﬁer for separating the digit ‘1’ from ’not 1’ . Use the two features and data you developed in a previous assignment and the 300 training data points you selected in assignment #9. Randomly initialize the weights to small values. For this problem, train with the linear output transformation function but use the sign threshold on the output node for actually classifying the data.

(a) Plot Ein(w) versus iteration for the variable learning rate gradient descent heuristic and 2 × 106 iterations, with S(x) = x. Show the decision boundary for the resulting classiﬁer. (b) Now use weight decay with λ = 0.01/N and use variable learning rate gradient descent to minimize the augmented error. Show the decision boundary for the resulting classiﬁer. (c) Now use early stopping with a validation set of size 50 and training set of size 250, and show the decision boundary for the resulting classiﬁer that had minimum validation error.

3. (300) Support Vector Machines

For this problem, we will use the data set consisting of the 2 points x1 = (1,0), y1 = +1 and x2 = (−1,0), y2 = −1.

(a) Show that for these 2 data points, the optimal separating hyperplane (with maximum cushion) is just the ‘plane’ that is the perpendicular bisector of the line segment joining the two points. In our case, what is the equation of the optimal hyperplane? (b) Now consider a transformation to a more “complicated” Z–space. The transformation is given by z =z1 z2=x3 1 − x2 x1x2 (1) i. What are the data points in this space?

1

ii. Construct the optimal hyperplane in this space (i.e., what is the equation for the hyperplane in terms of the Z–coordinates). To classify a point in X-space, one ﬁrst transforms the point into Z-space. One then classiﬁes the point in Z-space using the optimal hyperplane in Z-space. (c) Plot (in X–space) the decision boundary for the optimal hyperplane constructed using the data in X–space (from part (a)). On the same plot, plot the decision boundary you would observe in X-space if you classiﬁed X-space points by ﬁrst transforming to Z-space, and then classifying according to the optimal hyperplane constructed using the data in Z–space (this decision boundary will not be a line!). (d) A kernel function, K(x,y), is a function of two vectors in X-space deﬁned by K(x,y) = z(x) · z(y), where z(x) and z(y) are the transformed x and y into Z-space. In other words, the kernel function computes the dot product of the transformed vectors. Give an expression for the kernel function in terms of the components of x and y. (e) Using this kernel function, give an explicit functional form for the classiﬁer in the X-space.

4. (600) SVM with digits data

Implement the non-separable SVM using your two features for the 300 training points you selected in assignment #9. Use the 8th order polynomial kernel,

K(x,x′) = (1 + xtx′)8.

(a) In the non-separable case, you need to choose the value for C > 0 (the ’regularization’ parameter). Show the decision boundary for a small and large choice of C. Use your own judgement to determine small and large. (b) Explain the ‘complexity’ of the decision boundaries you observe in part (a) with respect to the choice of C. (c) Using a grid of values for C between your small and large values, use cross validation to pick a good value of C, one that achieves minimum cross validation error. Show the decision boundary for the resulting classiﬁer and give its test error.

5. (200) Compare Methods: Linear, k-NN, RBF-network, Neural Network, SVM

Compare the ﬁnal test error from your attempts to solve the digits problem: (i) Regularized linear model with 8th order polynomial transform and λ selected by CV. (ii) k-NN rule with k selected by CV. (iii) RBF-network with number of centers selected by CV. (iv) Neural network with early stopping. (v) SVM with 8th order polynomial kernel and C selected by CV. Make some intelligent comments.