CS178 Homework #4 solved

$29.99

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (5 votes)

Problem 1: Decision Trees
We’ll use the same data as in our earlier homework: In order to reduce my email load, I decide to
implement a machine learning algorithm to decide whether or not I should read an email, or simply
le it away instead. To train my model, I obtain the following data set of binary-valued features
about each email, including whether I know the author or not, whether the email is long or short,
and whether it has any of several key words, along with my nal decision about whether to read it
(y = +1 for read, y = −1 for discard).
x1 x2 x3 x4 x5 y
know author? is long? has `research’ has `grade’ has `lottery’ ⇒ read?
0 0 1 1 0 -1
1 1 0 1 0 -1
0 1 1 1 1 -1
1 1 1 1 0 -1
0 1 0 0 0 -1
1 0 1 1 1 1
0 0 1 0 0 1
1 0 0 0 0 1
1 0 1 1 0 1
1 1 1 1 1 -1
In the case of any
ties, we will prefer to predict class +1.
(a) Calculate the entropy of the class variable y
(b) Calculate the information gain for each feature xi
. Which feature should I split on rst?
(c) Draw the complete decision tree that will be learned from these data.
Problem 2: Decision Trees on Kaggle
In this problem, we will use our Kaggle in-class competition data to test decision trees on real data.
Kaggle is a website designed to host data prediction competitions; we will use it to give some experience with more realistic machine learning problems, and some opportunity to compare methods
and ideas amongst ourselves. Create an account linked to / using your UCI-aliated email
and log into inclass.kaggle.com, and search for the UC Irvine CS178 2017 competition,
1
https://kaggle.com/c/uc-irvine-2017w-cs178
Download the training data, X_train.txt, Y_train.txt, as well as the test data point features
(used to make your test predictions), X_test.txt, from the Kaggle competition site (under Data).
(Note: this is a private class competition; if you do not use a UCI email, you will not be able to
access the data or submit.)
In this problem, we will build a simple decision tree model to make predictions on the data. You
can use the treeClassify class provided to build your decision trees.
Note: Kaggle competitions only let you submit a xed number of predictions per day, ≈ 5 in
our case, so be careful. We’ll use a validation split to decide what hyperparameter choices we think
are most promising, and upload only one model.
(a) Load the training data, X_train.txt and Y_train.txt. There are quite a lot of data available; for the homework you can just use the rst 10,000 samples. Split out a validation set as
well, say samples 10001  20000.
(b) Learn a decision tree classier on the data. To avoid any potential recursion limits, specify a
max depth of 50, e.g.,
learner = ml.dtree.treeClassify(Xt,Yt, maxDepth=50)
(This may take a minute or two.) Compute your model’s training and validation error rates.
(c) Now, try varying the maximum depth parameter (maxDepth), which forces the tree to stop
after at most that many levels. Test values 0, 1, …, 15 and compare their performance
(both training and test) against the full depth. Is complexity increasing or decreasing with
the depth cuto? Identify whether you think the model begins overtting, and if so, when. If
you use this parameter for complexity control, what depth would you select as best?
(d) Now, using high maximum depth (d = 50), use minLeaf to control complexity. Try values
2.∧[2:12]=[4,8,16,…,4096]. Is complexity increasing or decreasing as minLeaf grows?
Identify when (if) the model is starting to overt, and what value you would use for this type
of complexity control.
(e) (Not graded) A related control is minParent; how does complexity control with minParent
compare to minLeaf?
(f) Our Kaggle competition measures performance using the ROC curve, specically the AUC
(area under the curve) score. Compute and plot the ROC curve for your trained model (using
e.g. the roc member function), and the area under the curve (auc).
(g) Using your best complexity control value (either depth or number of leaf data), re-train a model
(you may use the rst 10k data, or more, as you prefer). Load the test features X_test.txt,
and make predictions on all 200k test points. Output your predictions in the format expected
by Kaggle,
Ypred = learner.predictSoft( Xte )
# Now output a file with two columns, a row ID and a confidence in class 1:
np.savetxt(‘Yhat_dtree.txt’,
np.vstack( (np.arange(len(Ypred)) , Ypred[:,1]) ).T,
‘%d, %.2f’,header=’ID,Prob1′,comments=”,delimiter=’,’);
2
Upload them and report your model’s performance. Compare its performance to the AUC
score you estimated using the validation data.
Note the use of predictSoft here; while you can upload hard predictions (class values),
your ROC score will usually be much better if you include your condence level. Thus we
typically upload our condence that the data is class 1; this allows Kaggle to order the data
by our condence and give a smoother ROC curve (and thus usually higher area under the
curve).
Problem 3: Random Forests
Random Forests are bagged collections of decision trees, which select their decision nodes from
randomly chosen subsets of the possible features (rather than all features). You can implement this
easily in treeClassify using option ’nFeatures’=n, where n is the number of features to select
from (e.g., n = 50 or n = 60 if there are 90-some features); you’ll write a for-loop to build the
ensemble members, and another to compute the prediction of the ensemble.
In Python, it is easy to keep a list of dierent learners, even of dierent types, for use in an
ensemble predictor:
ensemble[i] = treeClassify(Xb,Yb,…) # save ensemble member “i” in a cell array
# …
ensemble[i].predict(Xv,Yv); # find the predictions for ensemble member “i”
(a) Using your validation split, learn a bagged ensemble of decision trees on the training data and
evaluate validation performance. (See the pseudocode from lecture slides.) For your individual
learners, use little complexity control (depth cuto 15+, minLeaf 4, etc.), since the bagging
will be used to control overtting instead. For the bootstrap process, draw the same number
of data as in your training set after the validation split (M0 = M in the pseudocode). You may
nd ml.bootstrapData() helpful, although it is very easy to do yourself. Plot the training
and validation error as a function of the number of learners you include in the ensemble, for
(at least) 1, 5, 10, 25 learners. (You may nd it more computationally ecient to simply learn
25 ensemble members rst, and then evaluate the results using only a few of them; this will
give the same results as only learning the few that you need.)
(b) Now choose an ensemble size and build an ensemble using at least 10k training data points,
make predictions on the test data, and upload to Kaggle. Report your preformance, along
with your estimated AUC score on your validation data.
Note: for your ensemble’s soft predictions, you can either use the fraction of members of
your ensemble that predicted class 1, or take the average of your ensemble members’ soft
prediction scores (condences). You can try either, or try both and let us know which worked
better.
3