HOMEWORK 1 CSC2515 solution




Rate this product

1. Nearest Neighbours and the Curse of Dimensionality – 15 pts. In this question,
we will verify the claim from lecture that “most” points in a high-dimensional space are far away
from each other, and also have approximately the same distance.
(a) [5 pts] Consider two independent univariate random variables X and Y sampled uniformly
from the unit interval [0, 1]. Determine the expectation and variance of the random variable
Z = |X − Y |
, i.e., the squared distance between X and Y .
Note: You can either compute the integrals yourself or use the properties of certain probability distributions. In the latter case, explicitly mention what properties you have used.
(b) [5 pts] Now suppose we draw two d-dimensional points X and Y from a d-dimensional
unit cube with a uniform distribution, i.e., X, Y ∈ [0, 1]d
. Observe that each coordinate is
sampled independently and uniformly from [0, 1], that is, we can view this as drawing random
variables X1, . . . , Xd and Y1, . . . , Yd independently and uniformly from [0, 1]. The squared
Euclidean distance kX − Y k
can be written as R = Z1 + · · · + Zd, where Zi = |Xi − Yi
Using the properties of expectation and variance, determine E
kX − Y k
= E[R] and
Var[kX − Y k
] = Var[R]. You may give your answer in terms of the dimension d, and E[Z]
and Var[Z] (the answers from part (a)).
(c) [5 pts] Based on your answer to part (b), compare the mean and standard deviation of
kX − Y k
to the maximum possible squared Euclidean distance between two points within
the d-dimensional unit cube (this would be the distance between opposite corners of the
cube). Why does this support the claim that in high dimensions, “most points are far away,
and approximately have the same distance”?
2. Limiting Properties of the Nearest Neighbour Algorithm – 10 pts. In this question, we will study the limiting properties of the k-nearest neighbour algorithm.
Suppose that we are given n data points X1, . . . , Xn, sampled uniformly and independently
from the interval [0, 2] and one test point, y = 1.
Let Zi = |Xi − y| denote the distance of y to the data point Xi and Z(i) = |X(i) − y| denote
the distance of y to its i-th nearest neighbour. Then, Z(1) < · · · < Z(n)
(a) [2 pts] Show that the random variable Zi
is uniformly distributed between the unit interval
[0, 1].
(b) [4 pts] Show that E

n+1 , i.e., the expected distance to the 1st nearest neighbour
is 1
n+1 .
Hint: It may help to first compute P

Z(1) > t
. Note that Z(1) has the smallest value among
Z1, . . . , Zn, that is, Z(1) = mini=1,…,n Zi. So

Z(1) > t
= P

Zi > t
If mini=1,…,n Zi > t, what can you say about the value of Z1, . . . , Zn in relation to t? Do not
forget to benefit from the independence of Zis.
(c) [2 pts] Determine the expected value of the random variable Z(k)
, that is, the expected
distance to the k-th nearest neighbour.
Hint: You can use the fact that the density function of Z(k)
(t) = n!
(k − 1)!(n − k)!t
(1 − t)
, t ∈ [0, 1].
(d) [2 pts] Based on your answer to part (c), what can you say about the expected distance to
the k-th nearest neighbour as n → ∞ and k
n → 0.
3. Information Theory – 15 pts. The goal of this question is to help you become more
familiar with the basic equalities and inequalities of information theory. They appear in many
contexts in machine learning and elsewhere, so having some experience with them is helpful. We
review some concepts from information theory, and ask you a few questions.
Recall the definition of the entropy of a discrete random variable X with probability mass
function p:
H(X) = X
p(x) log2


Here the summation is over all possible values of x ∈ X , which (for simplicity) we assume is finite.
For example, X might be {1, 2, . . . , N}.
(a) [3pt] Prove that the entropy H(X) is non-negative.
(b) [3pt] If X and Y are independent random variables, show that H(X, Y ) = H(X) + H(Y )
(c) [3pt] Prove the chain rule for entropy: H(X, Y ) = H(X) + H(Y |X).
An important concept in information theory is the relative entropy or the KL-divergence of two
distributions p and q. It is defined as
KL(p||q) = X
p(x) log2
The KL-divergence is one of the most commonly used measure of difference (or divergence) between
two distributions, and it regularly appears in information theory, machine learning, and statistics.
For this question, you may assume p(x) > 0 and q(x) > 0 for all x.
If two distributions are close to each other, their KL divergence is small. If they are exactly the
same, their KL divergence is zero. KL divergence is not a true distance metric, as it isn’t symmetric
and doesn’t satisfy the triangle inequality, but we often use it as a measure of dissimilarity between
two probability distributions.
(d) [3pt] Prove that KL(p||q) is non-negative.
Hint: You may want to use Jensen’s inequality, which is described in the Appendix.
(e) [3pt] The Information Gain or Mutual Information between X and Y is I(Y ; X) = H(Y )−
H(Y |X). Show that
I(Y ; X) = KL(p(x, y)||p(x)p(y)),
where p(x) is the marginal distribution of X and p(y) is the marginal distribution of Y .
4. Approximation Error in Decision Trees – 10 pts. We will study how we can use a
decision tree to approximate a function and how the quality of the approximation improves as
the depth increases.
Consider the function f

: [0, 1]2 → {−1, +1} as visualized below. This function takes the value
of +1 on the upper left triangle and −1 on the lower right triangle.
We would like to approximate this function f
∗ using a decision tree with the maximum depth
of d. We denote the best approximation with depth d as fd.
(a) [2 pts] Explain why fd with a finite d cannot represent f

(b) [2 pts] Show what f4 is. You should draw the tree and include all the relevant information
such as the attributes at each node, the splitting threshold, and the value of the leaves. You
also need to show the regions that it induces.
Let us define the error the decision tree fd makes in approximating f
∗ as
ed =
I{fd(x) 6= f

This is simply the area in the region [0, 1]2 where the function fd is different from f

(c) [2 pts] What is the value of e2 and e4?
(d) [2 pts] Provide the formula for ed for any even d, that is, d = 2k with k ∈ N. You need to
justify your formula, but your justification does not need to be detailed.
(e) [2 pt] What does this tell us about the quality of approximation as a function of depth d?
5. Decision Trees and K-Nearest Neighbour – 50 pts. In this question, you will use
the scikit-learn’s decision tree and KNN classifiers to classify tweets that have been evaluated
to determine whether they agree that climate change exists, or deny that it exists. The aim of this
question is for you to read the scikit-learn API and get comfortable with training/validation
We will use a dataset consisting of tweets scraped from Twitter, for which sentiment analysis has classified as either “climate change asserting” (meaning agreeing that climate change
is real) or “climate change denying” (meaning disagreeing with climate change). We will now
refer to these sets of tweets as exists and DNE, w.r.t. sentiment towards climate change.
The exists dataset contains 3088 tweets, and the DNE dataset consists of 1099 tweets from
https://data.world/xprizeai-env/sentiment-of-climate-change. The data were cleaned
by removing special characters and removing all link artefacts (“[link]”). The cleaned data are
available as exists_climate.csv and DNE_climate.csv on the course webpage. It is expected
that you use these cleaned data sources for this assignment.
You will build a decision tree and KNN to classify “climate change asserting” or “climate
change denying” tweets. Instead of coding these methods yourself, you will do what we normally
do in practice: use an existing implementation. You should use the DecisionTreeClassifier
and KNeighborsClassifier included in scikit-learn. Note that figuring out how to use this
implementation, its corresponding attributes and methods is a part of the assignment.
All code should be submitted in hw1_code.py.
(a) [10 pts] Write a function load_data which loads the data, preprocesses it using a vectorizer
extraction.text, we suggest you use CountVectorizer as it is the simplest in nature),
and splits the entire dataset randomly into 70% training, 15% validation, and 15% test
examples. You may use train_test_split function of scikit-learn within this function.
(b) [10 pts] (Decision Tree) Write a function select_tree_model that trains the decision tree
classifier using at least 5 different sensible values of max_depth, as well as two different split
criteria (Information Gain and Gini coefficient), evaluates the performance of each one on
the validation set, and prints the resulting accuracies of each model.
You should use DecisionTreeClassifier, but you should write the validation code yourself.
Include the output of this function in your solution.
(c) [10 pts] (Decision Tree) Now let’s stick with the hyperparameters which achieved the
highest validation accuracy. Report its accuracy on the test dataset. Moreover, extract and
visualize the first two layers of the tree. Your visualization may look something like what
is shown below, but it does not have to be an image; it is perfectly fine to display text. It
may also be hand-drawn. Include your visualization in your solution pdf.
(d) [10 pts] (Decision Tree) Write a function compute_information_gain which computes the
information gain of a split on the training data. That is, compute I(Y, xi), where Y is the
random variable signifying whether the tweet is climate change asserting or denying, and
is the keyword chosen for the split. Your split should be based on whether the keyword
xi exists (True) or does not exist (False). You should ignore the number of times that the
keyword appears in the sentence.
Report the outputs of this function for the topmost split from the previous part, and for
several other keywords.
(e) [10 pts] (KNN) Write a function select_knn_model that uses a KNN classifier to classify
between climate change asserting or denying tweets. Use a range of k values between 1 to 20
and compute both training and validation errors. You should generate a graph similar to the
one on slide 44 of Lecture #1, which is Figure 2.4 of the Elements of Statistical Learning.
You do not need to worry about the Bayes error or the Linear classifier in that figure. Report
the generated graph in your report. Choose the model with the best validation accuracy and
report its accuracy on the test data.
The briefly review the concept of convexity, which you may find useful for some of the questions
in this assignment. You may assume anything given here.
Convexity is an important concept in mathematics with many uses in machine learning. We
briefly define convex set and function and some of their properties here. Using these properties
are useful in solving some of the questions in this homework. If you are interested to know more
about convexity, refer to Boyd and Vandenberghe, Convex Optimization, 2004.
A set C is convex if the line segment between any two points in C lies within C, i.e., if for any
x1, x2 ∈ C and for any 0 ≤ λ ≤ 1, we have
λx1 + (1 − λ)x2 ∈ C.
For example, a cube or sphere in R
d are convex sets, but a cross (a shape like X) is not.
A function f : R
d → R is convex if its domain is a convex set and if for all x1, x2 in its domain,
and for any 0 ≤ λ ≤ 1, we have
f(λx1 + (1 − λ)x2) ≤ λf(x1) + (1 − λ)f(x2).
This inequality means that the line segment between (x1, f(x1)) and (x2, f(x2)) lies above the
graph of f. A convex function looks like `. We say that f is concave if −f is convex. A concave
function looks like a.
Some examples of convex and concave functions are (you do not need to use most of them in
your homework, but knowing them is useful):
• Powers: x
is convex on the set of positive real numbers when p ≥ 1 or p ≤ 0. It is concave
for 0 ≤ p ≤ 1.
• Exponential: e
ax is convex on R, for any a ∈ R.
• Logarithm: log(x) is concave on the set of positive real numbers.
• Norms: Every norm on R
is convex.
• Max function: f(x) = max{x1, x2, . . . , xd} is convex on R
• Log-sum-exp: The function f(x) = log(e
x1 + . . . + e
xd ) is convex on R
An important property of convex and concave functions, which you may need to use in your
homework, is Jensen’s inequality. Jensen’s inequality states that if φ(x) is a convex function of x,
we have
φ(E[X]) ≤ E[φ(X)].
In words, if we apply a convex function to the expectation of a random variable, it is less than or
equal to the expected value of that convex function when its argument is the random variable. If
the function is concave, the direction of the inequality is reversed.
Jensen’s inequality has a physical interpretation: Consider a set X = {x1, . . . , xN } of points on
R. Corresponding to each point, we have a probability p(xi). If we interpret the probability as
mass, and we put an object with mass p(xi) at location (xi
, φ(xi)), then the centre of gravity of
these objects, which is in R
, is located at the point (E[X], E[φ(X)]). If φ is convex `, the centre
of gravity lies above the curve x 7→ φ(x), and vice versa for a concave function a.