CptS 570 Machine Learning Homework #2 solution


Original Work


5/5 - (3 votes)

1 Analytical Part (3 percent grade)
This part will be graded as a PASS or FAIL.
1. Suppose we have n+ positive training examples and n− negative training examples. Let C+
be the center of the positive examples and C− be the center of the negative examples, i.e.,
C+ =
i: yi=+1 xi and C− =
i: yi=−1
. Consider a simple classifier called CLOSE
that classifies a test example x by assigning it to the class whose center is closest.
• Show that the decision boundary of the CLOSE classifier is a linear hyperplane of the
form sign(w · x + b). Compute the values of w and b in terms of C+ and C−.
• Recall that the weight vector can be written as a linear combination of all the training
examples: w =
i=1 αi
· yi
· xi
. Compute the dual weights (α’s). How many of the
training examples are support vectors?
2. Suppose we use the following radial basis function (RBF) kernel: K(xi
, xj ) = exp(−
kxi − xjk
which has some implicit unknown mapping φ(x).
• Prove that the mapping φ(x) corresponding to RBF kernel has infinite dimensions.
• Prove that for any two input examples xi and xj , the squared Euclidean distance of their
corresponding points in the higher-dimensional space defined by φ is less than 2, i.e.,
kφ(xi) − φ(xj )k
2 ≤ 2.
3. The decision boundary of a SVM with a kernel function (via implicit feature mapping φ(.))
is defined as follows:
w · φ(x) + b =
i∈SV yiαiK(xi
, x) + b = f(x; α, b)
, where w and b are parameters of the decision boundary in the feature space phi defined by
the kernel function K, SV is the set of support vectors, and αi
is the dual weight of the i
support vector.
Let us assume that we use the radial basis function (RBF) kernel K(xi
, xj ) = exp(−
kxi − xjk
also assume that the training examples are linearly separable in the feature space φ and SVM
finds a decision boundary that perfectly separates the training examples.
If we choose a testing example xfar that is far away from any training instance xi (distance
here is measured in the original feature space <
). Prove that f(xfar; α, b) ≈ b.
4. The function K(xi
, xj ) = − hxi
, xj i is a valid kernel. Prove or Disprove it.
5. You are provided with n training examples: (x1, y1),(x2, y2), · · · ,(xn, yn,), where xi
is the
input example, yi
is the class label (+1 or -1). The teacher gave you some additional information by specifying the costs for different mistakes C+ and C−, where C+ and C− stand for
the cost of misclassifying a positive and negative example respectively.
a. How will you modify the Soft-margin SVM formulation to be able to leverage this extra
information? Please justify your answer.
6. You are provided with a set of n training examples: (x1, y1),(x2, y2), · · · ,(xn, yn,), where xi
the input example, yi
is the class label (+1 or -1). Suppose n is very large (say in the order of
millions). In this case, standard SVM training algorithms will not scale due to large training
Tom wants to devise a solution based on “Coarse-to-Fine” framework of problem solving.
The basic idea is to cluster the training data; train a SVM classifier based on the clusters
(coarse problem); refine the clusters as needed (fine problem); perform training on the finer
problem; and repeat until convergence. Suppose we start with k+ positive clusters and k−
negative clusters to begin with (a cluster is defined as a set of examples). Please specify
the mathematical formulation (define all the variables used in your formulation) and concrete
algorithm for each of the following steps to instantiate this idea:
a) How to define the SVM training formulation for a given level of coarseness: a set of k+
positive clusters and a set of k− negative clusters?
b) How to refine the clusters based on the resulting SVM classifier?
c) What is the stopping criteria?
Optional question: For what kind of problems will this solution fail?
7. You are provided with a set of n training examples: (x1, y1),(x2, y2), · · · ,(xn, yn,), where xi
is the input example, yi
is the class label (+1 or -1). Suppose n is very large (say in the order
of millions). In this case, online kernelized Perceptron algorithms will not scale if the number
of allowed support vectors are unbounded.
a) Suppose you have trained using kernelized Perceptron algorithm (without any bounds on
support vectors) and got a set of support vectors SV . Tom wants to use this classifier for
real-time prediction and cannot afford more than B kernel evaluations for each classification
decision. Please give an algorithm to select B support vectors from SV . You need to motivate
your design choices in order to convince Tom to use your solution.
b) Tom wants to train using kernelized Perceptron algorithm, but wants to use at most B
support vectors during the training process. Please modify the standard kernelized Perceptron
training algorithm (from class slides) for this new setting. You need to motivate your design
choices in order to convince Tom to use your solution.
2 Programming and Empirical Analysis Part (5 percent grade)
1. Empirical analysis question. You can use a publicly available SVM classifier implementation
(e.g., scikit-learn) for SVM related experiments. scikit-learn (http://scikit-learn.org/
You will use the Fashion MNIST data (https://github.com/zalandoresearch/fashion-mnist).
There is a fixed training and testing set. From training data, use first 80 percent for “training”
and last 20 percent as “validation data”.
Each example is a 28×28 grayscale image, associated with a label from 10 classes: 0 Tshirt/top, 1 Trouser, 2 Pullover, 3 Dress, 4 Coat, 5 Sandal, 6 Shirt, 7 Sneaker, 8 Bag, 9 Ankle
You will use ONLY training data for training and testing data for evaluation.
(a) Using a linear kernel, train the SVM on the training data for different values of C parameter: 10−4
, 10−3
, 10−2
, 10−1
, 100
, 101
, 102
, 103
, 104
. Compute the training accuracy, validation
accuracy, and testing accuracy for the SVM obtained with different values of the C parameter.
Plot the training accuracy, validation accuracy, and testing accuracy as a function of C (C
value on x-axis and Accuracy on y-axis) – one curve each for training, validation, and testing
data. Also, plot the number of support vectors (if applicable for the SVM toolkit you are
using) as a function of C. List your observations.
(b) Select the best value of hyper-parameter C based on the accuracy on validation set and
train a linear SVM on the combined set of training and validation examples. Compute the
testing accuracy and the corresponding confusion matrix: a 10 × 10 matrix.
(c) Repeat the experiment (a) with the best C value from (a) with polynomial kernel of degree
2, 3, and 4. Compare the training, validation, testing accuracies, and the number of support
vectors for differnt kernels (linear, polynomial kernel of degree 2, polynomial kernel of degree
3, and polynomial kernel of degree 4). List your observations.
2. Programming question. You will implement the kernelized Perceptron training algorithm
(discussed in the class) for multi-class classification.
(a) You will use the Fashion MNIST data. Train the kernelized Perceptron classifier for 5
iterations with polynomial kernel (pick the best degree out of 2, 3, and 4 from the above
experiment). Plot the number of mistakes as a function of training iterations. Compute the
training, validation, and testing accuracy at the end of 5 iterations.
3. Programming question. “Breast Cancer” Classifier using Decision Trees. You will use the
following dataset for this question: https://archive.ics.uci.edu/ml/datasets/Breast+
Cancer+Wisconsin+%28Diagnostic%29. You will use the first 70 percent examples for training, next 10 percent examples for validation, and last 20 percent examples for testing.
(a) Implement the ID3 decision tree learning algorithm that we discussed in the class. The
key step in the decision tree learning is choosing the next feature to split on. Implement
the information gain heuristic for selecting the next feature. Please see lecture notes or
https://en.wikipedia.org/wiki/ID3_algorithm for more details. Do the following to select candidate thresholds for continuous features: Sort all candidate values for feature f from
training data. Suppose f1, f2, · · · , fn is the sorted list. The candidate thresholds are chosen
as fi + (fi+1 − fi)/2 for i=1 to n.
(b) Run the decision tree construction algorithm on the training examples. Compute the
accuracy on validation examples and testing examples.
(c) Implement the decision tree pruning algorithm discussed in the class (via validation data).
(d) Run the pruning algorithm on the decision tree constructed using training examples.
Compute the accuracy on validation examples and testing examples. List your observations
by comparing the performance of decision tree with and without pruning.
To debug and test your implementation, you can employ scikit-learn (http://scikit-learn.
3 Instructions for Submission
Please follow the below instructions. If you do not follow them, your homework will not be graded.
We will provide a dropbox folder link for the homework submission.
PDF submission. One PDF file with both answers for analytical part (Part I) and empirical
analysis questions with results/analysis (Part-II). Please label x-axis, y-axis, and name of the graphs
appropriately. Please name this file as WWSUID-LASTNAME.pdf (e.g., 111222-Fern.pdf).
Code submission. You will submit one zip file for your code as per the instructions below. If
your script and/or code does not execute, we will try to give some partial credit by looking at the
overall code contents.
• Mention the programming language and version (e.g., Python 2.5) that you used.
• Submit one folder with name WSUID-LASTNAME.zip (e.g., 111222-Fern.zip) and include a
README file.
• Include a script to run the code and it should be referred in the README file. Please make
sure that your script runs on a standard linux machine.
• Don’t submit the data folder. Assume there is a folder “data” with all the files.
• Output of your programs should be well-formatted in order to answer the empirical analysis
• Structure your code meaningfully and add comments to make it readable.
If you have collaborated or discussed the homework with some student, please provide this information with all the relevant details. If we find that the code of two different students has traces of
plagiarism, both students will be penalized and we will report the academic dishonesty case to graduate school (see https://communitystandards.wsu.edu/policies-and-reporting/academic-integritypolicy/). The bottom line is please DO NOT even think of going this route. It is very unpleasant
to deal with these things for both faculty, TA, and students involved.
4 Grading Rubric
Each question in the students work will be assigned a letter grade of either A,B,C,D, or F by the
Instructor and TAs. This five-point (discrete) scale is described as follows:
• A) Exemplary (=100%).
Solution presented solves the problem stated correctly and meets all requirements of the problem.
Solution is clearly presented.
Assumptions made are reasonable and are explicitly stated in the solution.
Solution represents an elegant and effective way to solve the problem and is not overly complicated than is necessary.
• B) Capable (=75%).
Solution is mostly correct, satisfying most of the above criteria under the exemplary category,
but contains some minor pitfalls, errors/flaws or limitations.
• C) Needs Improvement (=50%).
Solution demonstrates a viable approach toward solving the problem but contains some major
pitfalls, errors/flaws or limitations.
• D) Unsatisfactory (=25%)
Critical elements of the solution are missing or significantly flawed.
Solution does not demonstrate sufficient understanding of the problem and/or any reasonable
directions to solve the problem.
• F) Not attempted (=0%)
No solution provided.
The points on a given homework question will be equal to the percentage assigned (given by the
letter grades shown above) multiplied by the maximum number of possible points worth for that
question. For example, if a question is worth 6 points and the answer is awarded a B grade, then
that implies 4.5 points out of 6.