Homework 5 COMS 4771 solution

$29.99

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (7 votes)

Problem 1 (Non-linear classi ers; 20 points). For this problem, you will need to learn to use
software libraries for at least two of the following non-linear classi er types:
• neural networks;
• boosted decision trees (i.e., boosting, with the \weak learner” being a decision tree learner);
• random forests.
All of these are available in scikit-learn and the MATLAB Statistics and Machine Learning toolbox,
although you may also use other external libraries (e.g., Tensor ow1 for neural networks, XGBoost2
for boosted decision trees). You are welcome to implement learning algorithms for these classi ers
yourself, but this is neither required nor recommended.
Learn two di erent types of non-linear classi ers from above for the HW3 restaurant review
data, with any feature representation you like. You can use the software libraries mentioned above,
and as well as libraries & tools (or your old code from HW3) to perform the feature processing.
Your should try to match (and hopefully surpass) the results obtained by the \modi ed” Averaged-
Perceptron with bigram features from HW3, which had ve-fold cross-validation error rate around
8:8%. You may sub-sample the training data if you like, but note that using more training data
usually improves performance (up to a point).
For many of these learning algorithms, you will need to set various hyperparameters (including
classi er \architecture”, like the number of layers and width of each layer in a neural network,
and the number of trees in a random forest). Often there are defaults that make a good starting
point, but you may need to adjust at least some of them to get good performance. Use hold-out
validation or K-fold cross-validation to do this. Do not make any hyperparameter choices
(or any other similar choices) based on the test set! You should only compute the test error
rates after you have settled on hyperparameter settings and trained your two nal classi ers.
What to submit:
1. Names of the two types of classi ers you opt to learn.
2. Proper citations for any external code you use. See https://integrity.mit.edu/handbook/
writing-code for guidelines.
3. Description of your training methodology, with enough details so that another machine learn-
ing enthusiast can reproduce the your results.
4. The nal hyperparameter settings you use.
5. Training error rates, hold-out or cross-validation error rates, and test error rates for your two
nal classi ers.
No need to submit any code.
1https://www.tensorflow.org
2https://github.com/dmlc/xgboost/tree/master/python-package
1
Problem 2 (Conditional probability estimation; 15 points).
(a) Suppose (X; Y ) is a random pair taking values in X  f􀀀1; 1g. Let  : X ! [0; 1] be the
conditional probability function, given by (x) := P(Y = 1 j X = x). What function
f : X ! R minimizes E[`(Y f(X))] where `(z) := e􀀀z? Give your answer in terms of  (e.g.,
f(x) = (x)).
(b) Download the data set hw5data.mat from Courseworks (which has training feature vectors
and labels, stored as data and labels, respectively). Compute the maximum likelihood
estimates of the logistic regression model parameters ( 0; ) 2 R  Rd. You may use your
code from the previous homework assignment, or code from an existing software library for
logistic regression. It should be possible to obtain estimates with objective value (from the
previous homework) not larger than 0:50317.
The data set hw5data.mat also contains test feature vectors and labels, stored as testdata
and testlabels, respectively. With your estimated parameters ( ^ 0; ^ ) 2 R  Rd in hand,
you can compute the conditional probability
P( ^ 0;^ )(Y = 1 j X = x) =
1
1 + exp(􀀀^ 0 􀀀 h^ ; xi)
for each row x> in testdata.
It turns out each row of testdata actually appears 128 times in testdata: for each i =
1; 2; : : : ; 1024, the testdata rows fx>
1024k+i : k = 0; 1; : : : ; 127g are identical. However,
the corresponding entries in testlabels are not repeated in the same way: for each i =
1; 2; : : : ; 1024, the actual fraction of labels equal to one is pi := 1
128
P127
k=0 y1024k+i =2 f0; 1g.
Compare P( ^ 0;^ )(Y = 1 j X = xi) to pi by computing the mean absolute error (MAE)
MAE :=
1
1024
1X024
i=1

P( ^ 0;^ )(Y = 1 j X = xi) 􀀀 pi

:
(This is one way to evaluate conditional probability predictions; there are many others.)
What to submit in your write-up:
1. MAE of conditional probability predictions.
2. Proper citations for any external code you use.
No need to submit any code.
(c) Optional. The data from part (b) was generated by a distribution for which the Bayes classi er
is a linear classi er, but is not one from the logistic regression model. Try a di erent approach
to get better conditional probability predictions on the data from part (b).
What to submit in your write-up:
1. MAE of conditional probability predictions.
2. Proper citations for any external code you use.
3. A description of the approach you tried.
No need to submit any code.
2
Problem 3 (Linear regression; 15 points).
(a) Let S := f(xi; yi)gn
i=1 be a data set from Rd  R, and let P = fP(w;2) : w 2 Rd; 2 > 0g be
the \classical statistical model for linear regression” discussed in lecture.
Let A be the nd matrix whose i-th row is x>
i , let y be the vector whose i-th entry is yi.
Suppose S really is an iid sample from a distribution P(w?;2
?) 2 P. Assume that A has rank
d, so A>A is invertible and ^wols exists; de ne ^y := A^wols.
For each of the following assertions, state whether it is true or false, and brie y justify your
answer.
1. var(yi j xi) = 2
?.
2. cov(w? j A) = 2
?(A>A)􀀀1.
3. cov( ^wols j A) = 2
?(A>A)􀀀1.
4. E(^y j A) = Aw?.
5. r := y 􀀀 A^wols is the orthogonal projection of y onto the null space of A>.
6. If every entry in the rst column of A is 1, then
Pn
i=1 ri = 0 (where ri is the i-th entry
of r, de ned above).
(b) Download the Boston housing data set housing.mat from Courseworks. The rst feature
is a constant feature, always equal to one: this is just here to simplify estimation of the
\intercept” term. The next 13 features are CRIM, ZN, . . . , LSTAT, as described in https:
//archive.ics.uci.edu/ml/datasets/Housing; note that standardization (based on the
training data) has been applied (to both the training data and test data). The output (label)
is MEDV, the median value of owner-occupied homes (in units of $1000).
Compute the ordinary least squares (OLS) estimator based on the training data (data and
labels). You can use whatever software libraries you like to do this. Compute the average
squared loss of the OLS estimator on the test data (testdata and testlabels).
Use some existing software package to compute a sparse weight vector with at most three non-
zero entries (not including the \intercept”) based on the training data. Compute the average
squared loss of this sparse linear predictor on the test data (testdata and testlabels).
Some suggested methods to use: Lasso, LARS (which is actually an algorithm for solving the
Lasso optimization problem with some additional convenient properties), stepwise regression,
orthogonal matching pursuit.
What to submit in your write-up:
1. Test average squared loss of OLS estimator.
2. Test average squared loss of the sparse linear predictor.
3. Names of the variables with non-zero entries in the sparse linear prediction. Report the
actual variable names3 (e.g., CRIM, ZN, INDUS).
4. Proper citations for any external code you use.
No need to submit any code.
3The names of the variables are given in https://archive.ics.uci.edu/ml/machine-learning-databases/
housing/housing.names.
3