Description
Problem 1 – Linear Separability 10 points
Consider a dataset with two features x1 and x2 in which the points (−1, −1),(1, 1),(−3, −3),(4, 4) belong to
one class and (−1, 1),(1, −1),(−5, 2),(4, −8) belong to the other.
1. Is this dataset linearly separable ? Can a linear classifier be trained using features x1 and x2 to classify
this data set ? You can plot the dataset points and argue. (2)
2. Can you define a new 1-dimensional representation z in terms of x1 and x2 such that the dataset is
linearly separable in terms of 1-dimensional representation corresponding to z ? (4)
3. What does the separating hyperplane looks like ? (2)
4. Explain the importance of nonlinear transformations in classification problems. (2)
Problem 2 – Bias Variance Tradeoff, Regularization 40 points
1. Derive the bias-variance decomposition for a regression problem, i.e., prove that the expected mean
squared error of a regression problem can be written as
E[MSE] = Bias2 + V ariance + Noise
Hint: Let y(x) = f(x) + be the true (unknown) relationship and ˆy = g(x) be the model predicted
value of y. Then MSE over test instance xi
, i = 1, . . . , t, is given by:
MSE =
1
t
Xt
i=1
(f(xi) + − g(xi))2
(5)
2. Consider the case when y(x) = x + sin(1.5x) + N (0, 0.3), here f(x) = x + sin(1.5x) and = N (0, 0.3).
Create a dataset of size 20 points by randomly generating samples from y. Display the dataset and
f(x). Use scatter plot for y and smooth line plot for f(x). (5)
3. Use weighted sum of polynomials as an estimator function for f(x), in particular, let the form of
estimator function be:
gn(x) = β0 + β1x + β2x
2 + ….. + βnx
n
Consider three candidate estimators, g1, g3, and g10. Estimate the coefficients of each of the three
estimators using the sampled dataset and plot f(x), g1(x), g3(x), g10(x). Which estimator is underfitting
? Which one is overfitting ? (10)
4. Generate 100 datasets (each of size 50) by randomly sampling from y. Partition each dataset into
training and test set (80/20 split). Next fit the estimators of varying complexity, i.e., g1, g2, ….g15 using
the training set for each dataset. Then calculate and display the squared bias, variance, and error
on testing set for each of the estimators showing the tradeoff between bias and variance with model
complexity. Can you identify the best model ? (10)
5. One way to increase model bias is by using regularization. Lets take the order 10 polynomial and
apply L2 regularization. Compare the bias, variance, and MSE of the regularized model with the
unregularized order 10 polynomial model ? Does the regularized model have a higher or lower bias ?
What about MSE ? Explain. (10)
Problem 3 – OpenML, Algorithmic Performance Scaling 25 points
OpenML (https://www.openml.org) has thousands of datasets for classification tasks. Select any 2 datasets
from OpenML with different number of output classes.
1. Summarize the attributes of each dataset: number of features, number of instances, number of classes,
number of numerical features, number of categorical features. (5)
2. For each dataset, select 80% of data as training set and remaining 20% as test set. Generate 10 different
subsets of the training set by randomly subsampling 10%, 20%, . . . , 100% of the training set. Use each
of these subsets to train two different classifiers: Random forest and Gradient boosting. When training
a classifier also measure the wall clock time to train. After each training, evaluate the accuracy of
trained models on the test set.
Report model accuracy and training time for each of the 10 subsets of
the training set. Generate learning curve for each classifier. A learning curve shows how the accuracy
changes with increasing size of training data. Also create a curve showing the training time of classifiers
with increasing size of training data. So, for each dataset you will have two figures: First figure showing
learning curves (x-axis being training data size and y-axis accuracy) for the two classifiers and second
Figure showing training time for the two classifiers as a function of training data size. (15)
3. Study the scaling of training time and accuracy of classifiers with training data size using the two
figures generated in part 2 of the question. Compare the performance of classifiers in terms of training
time and accuracy and write 3 main observations. Which gives better accuracy ? Which has shorter
training time ? (5)
Problem 4 – Precision, Recall, ROC 25 points
This question is based on two papers, one from ICML 2006 and other from NIPS 2015 (details below). ICML
paper talks about the relationship between ROC and Precision-Recall (PR) curves and shows a one-to-one
correspondence between them. NIPS paper introduces Precision-Recall-Gain (PRG) curves. You need to
refer to the two papers to answer the following questions.
1. Does true negative matter for both ROC and PR curve ? Argue why each point on ROC curve
corresponds to a unique point on PR curve ? (5)
2. Select one OpenML dataset with 2 output classes. Use two binary classifiers (Adaboost and Logistic
regression) and create ROC and PR curves for each of them. You will have two figures: one containing
two ROC and other containing two PR curves. Show the point where an all positive classifier lies in
the ROC and PR curves. (10)
3. NIPS paper defined PR Gain curve. Calculate AUROC (Area under ROC), AUPR (Area under PR),
and AUPRG (Area under PRG) for two classifiers and compare. Do you agree with the conclusion of
NIPS paper that practitioners should use PR gain curves rather than PR curves. (10)
Related papers:
• Jesse Davis, Mark Goadrich, The Relationship Between Precision-Recall and ROC Curves, ICML 2006.
https://www.biostat.wisc.edu/ page/rocpr.pdf
• Peter A. Flach and Meelis Kull, Precision-Recall-Gain Curves: PR Analysis Done Right, NIPS 2015.
https://papers.nips.cc/paper/5867-precision-recall-gain-curves-pr-analysis-done-right