Description
Problem 1 (Non-linear classiers; 20 points). For this problem, you will need to learn to use
software libraries for at least two of the following non-linear classier types:
• neural networks;
• boosted decision trees (i.e., boosting, with the \weak learner” being a decision tree learner);
• random forests.
All of these are available in scikit-learn and the MATLAB Statistics and Machine Learning toolbox,
although you may also use other external libraries (e.g., Tensor ow1 for neural networks, XGBoost2
for boosted decision trees). You are welcome to implement learning algorithms for these classiers
yourself, but this is neither required nor recommended.
Learn two dierent types of non-linear classiers from above for the HW3 restaurant review
data, with any feature representation you like. You can use the software libraries mentioned above,
and as well as libraries & tools (or your old code from HW3) to perform the feature processing.
Your should try to match (and hopefully surpass) the results obtained by the \modied” Averaged-
Perceptron with bigram features from HW3, which had ve-fold cross-validation error rate around
8:8%. You may sub-sample the training data if you like, but note that using more training data
usually improves performance (up to a point).
For many of these learning algorithms, you will need to set various hyperparameters (including
classier \architecture”, like the number of layers and width of each layer in a neural network,
and the number of trees in a random forest). Often there are defaults that make a good starting
point, but you may need to adjust at least some of them to get good performance. Use hold-out
validation or K-fold cross-validation to do this. Do not make any hyperparameter choices
(or any other similar choices) based on the test set! You should only compute the test error
rates after you have settled on hyperparameter settings and trained your two nal classiers.
What to submit:
1. Names of the two types of classiers you opt to learn.
2. Proper citations for any external code you use. See https://integrity.mit.edu/handbook/
writing-code for guidelines.
3. Description of your training methodology, with enough details so that another machine learn-
ing enthusiast can reproduce the your results.
4. The nal hyperparameter settings you use.
5. Training error rates, hold-out or cross-validation error rates, and test error rates for your two
nal classiers.
No need to submit any code.
1https://www.tensorflow.org
2https://github.com/dmlc/xgboost/tree/master/python-package
1
Problem 2 (Conditional probability estimation; 15 points).
(a) Suppose (X; Y ) is a random pair taking values in X f1; 1g. Let : X ! [0; 1] be the
conditional probability function, given by (x) := P(Y = 1 j X = x). What function
f : X ! R minimizes E[`(Y f(X))] where `(z) := ez? Give your answer in terms of (e.g.,
f(x) = (x)).
(b) Download the data set hw5data.mat from Courseworks (which has training feature vectors
and labels, stored as data and labels, respectively). Compute the maximum likelihood
estimates of the logistic regression model parameters (0; ) 2 R Rd. You may use your
code from the previous homework assignment, or code from an existing software library for
logistic regression. It should be possible to obtain estimates with objective value (from the
previous homework) not larger than 0:50317.
The data set hw5data.mat also contains test feature vectors and labels, stored as testdata
and testlabels, respectively. With your estimated parameters ( ^ 0; ^ ) 2 R Rd in hand,
you can compute the conditional probability
P( ^0;^)(Y = 1 j X = x) =
1
1 + exp(^ 0 h^ ; xi)
for each row x> in testdata.
It turns out each row of testdata actually appears 128 times in testdata: for each i =
1; 2; : : : ; 1024, the testdata rows fx>
1024k+i : k = 0; 1; : : : ; 127g are identical. However,
the corresponding entries in testlabels are not repeated in the same way: for each i =
1; 2; : : : ; 1024, the actual fraction of labels equal to one is pi := 1
128
P127
k=0 y1024k+i =2 f0; 1g.
Compare P( ^0;^)(Y = 1 j X = xi) to pi by computing the mean absolute error (MAE)
MAE :=
1
1024
1X024
i=1
P( ^0;^)(Y = 1 j X = xi) pi
:
(This is one way to evaluate conditional probability predictions; there are many others.)
What to submit in your write-up:
1. MAE of conditional probability predictions.
2. Proper citations for any external code you use.
No need to submit any code.
(c) Optional. The data from part (b) was generated by a distribution for which the Bayes classier
is a linear classier, but is not one from the logistic regression model. Try a dierent approach
to get better conditional probability predictions on the data from part (b).
What to submit in your write-up:
1. MAE of conditional probability predictions.
2. Proper citations for any external code you use.
3. A description of the approach you tried.
No need to submit any code.
2
Problem 3 (Linear regression; 15 points).
(a) Let S := f(xi; yi)gn
i=1 be a data set from Rd R, and let P = fP(w;2) : w 2 Rd; 2 > 0g be
the \classical statistical model for linear regression” discussed in lecture.
Let A be the nd matrix whose i-th row is x>
i , let y be the vector whose i-th entry is yi.
Suppose S really is an iid sample from a distribution P(w?;2
?) 2 P. Assume that A has rank
d, so A>A is invertible and ^wols exists; dene ^y := A^wols.
For each of the following assertions, state whether it is true or false, and brie y justify your
answer.
1. var(yi j xi) = 2
?.
2. cov(w? j A) = 2
?(A>A)1.
3. cov( ^wols j A) = 2
?(A>A)1.
4. E(^y j A) = Aw?.
5. r := y A^wols is the orthogonal projection of y onto the null space of A>.
6. If every entry in the rst column of A is 1, then
Pn
i=1 ri = 0 (where ri is the i-th entry
of r, dened above).
(b) Download the Boston housing data set housing.mat from Courseworks. The rst feature
is a constant feature, always equal to one: this is just here to simplify estimation of the
\intercept” term. The next 13 features are CRIM, ZN, . . . , LSTAT, as described in https:
//archive.ics.uci.edu/ml/datasets/Housing; note that standardization (based on the
training data) has been applied (to both the training data and test data). The output (label)
is MEDV, the median value of owner-occupied homes (in units of $1000).
Compute the ordinary least squares (OLS) estimator based on the training data (data and
labels). You can use whatever software libraries you like to do this. Compute the average
squared loss of the OLS estimator on the test data (testdata and testlabels).
Use some existing software package to compute a sparse weight vector with at most three non-
zero entries (not including the \intercept”) based on the training data. Compute the average
squared loss of this sparse linear predictor on the test data (testdata and testlabels).
Some suggested methods to use: Lasso, LARS (which is actually an algorithm for solving the
Lasso optimization problem with some additional convenient properties), stepwise regression,
orthogonal matching pursuit.
What to submit in your write-up:
1. Test average squared loss of OLS estimator.
2. Test average squared loss of the sparse linear predictor.
3. Names of the variables with non-zero entries in the sparse linear prediction. Report the
actual variable names3 (e.g., CRIM, ZN, INDUS).
4. Proper citations for any external code you use.
No need to submit any code.
3The names of the variables are given in https://archive.ics.uci.edu/ml/machine-learning-databases/
housing/housing.names.
3