Description
1. Decision Trees.
Consider the training and testing datasets as given in Figure 1a and Figure 1b for the sub-parts (a)-(c). For
the sub-parts (d) & (e), refer to the training and testing datasets as given in Figure 2a and Figure 2b.
Conventions. Axis-aligned, integral split means: test a single coordinate xj against an integer threshold
t ∈ Z; send xj ≤ t left and xj > t right (ties at equality go left). Depth-one means exactly one internal split
node and two leaves. “At most two splits” means at most two internal split nodes anywhere in the tree
(not depth), i.e., up to three leaves. For 1-nn, use Euclidean distance; if multiple training points are at the
same nearest distance, break ties by lexicographic order on coordinates: smaller x, and if still tied, smaller y.
Lastly, when describing decision trees, any unambiguous and succinct format is fine, for instance a nested list
or a python snipped of nested if-else statements.
0 1 2 3 4
1
2
3
4
(a) Training dataset 1.
0 1 2 3 4 5
0
1
2
3
4
5
(b) Testing dataset 1.
(a) Describe a decision tree of depth one with integral and axis-aligned decision boundaries which achieves
error at most 1
6
on training dataset 1 (Figure 1a).
(b) Describe a decision tree (of any depth) with integral and axis-aligned decision boundaries which achieves
zero error on training dataset 1 (Figure 1a).
(c) Describe a decision tree (of any depth) with integral and axis-aligned decision boundaries which achieves
zero error on training dataset 1 (Figure 1a) but has error at least 1
4
on testing dataset 1 (Figure 1b).
3
(a) Fig 2a: Training dataset 2. (b) Fig 2b: Testing dataset 2.
(d) Describe a decision tree with integral and axis-aligned decision boundaries with at most two splits, which
achieves zero error on training dataset 2 and calculate its error on testing dataset 2.
(e) Construct a 1-nn classifier using training dataset 2 and state its error on testing dataset 2.
Remark: Note that every decision tree (axis aligned integral splits) with at most two splits has positive
test error in this problem. Therefore one nearest neighbor is a better choice here.
Solution.
4
2. k-means.
Recall the k-means problem with n data points S = (xi)
n
i=1 in R
d
, k centers (µi
)
k
i=1 in R
d
, and corresponding
clusters (Sj )
k
j=1. The objective of k-means is then to minimize the cost:
ϕS(µ1
, . . . , µk
) = Xn
i=1
min
j
∥xi − µj∥
2
,
where ∥ · ∥ = ∥ · ∥2 throughout this problem. In this problem you will develop an alternate formulation of this
cost function in terms of pairwise distances.
(a) Let Sz ⊆ S be the cluster induced by using a particular point z ∈ R
d as a center. Then the cost of using
any point z as a center is then given by
ϕSz
(z) = X
x∈Sz
∥x − z∥
2
.
Let µ(Sz) be the sample mean of the cluster Sz. Prove that the cost ϕSz
(z) is equivalent to
ϕSz
(z) = ϕSz
(µ(Sz)) + |Sz|∥µ(Sz) − z∥
2
.
(b) Show that
ϕSj
(µj
) = 1
2|Sj |
X
a,b∈Sj
∥a − b∥
2
.
Conclude that solving the k-means problem is equivalent to solving
min
S1,…,Sk
X
k
j=1
1
2|Sj |
X
a,b∈Sj
∥a − b∥
2
.
Solution.
5
3. Robustness of the Majority Vote Classifier.
The purpose of this problem is to further investigate the behavior of the majority vote classifier using
Hoeffding’s inequality. Simplified versions of Hoeffding’s inequality are as follows.
Theorem 1. Given independent random variables (Z1, . . . , Zk) with Zi ∈ [0, 1],
Pr
X
k
i=1
Zi ≥
X
k
i=1
E[Zi
] + kϵ
≤ exp
−2kϵ2
, (1)
and
Pr
X
k
i=1
Zi ≤
X
k
i=1
E[Zi
] − kϵ
≤ exp
−2kϵ2
. (2)
In this problem we have an odd number n of classifiers {f1, . . . , fn} and only consider their behavior on a
fixed data example (x, y); by classifier we mean fi(x) ∈ {±1}. Define the majority vote classifier Maj as
Maj(x; f1, . . . , fn) := 2 · 1
Xn
i=1
fi(x) ≥ 0
− 1 = (
+1 Pn
i=1 fi(x) > 0,
−1
Pn
i=1 fi(x) < 0,
where we will not need to worry about ties since n is odd.
To demonstrate the utility of Theorem 1 in analyzing Maj, suppose that Pr[fi(x) = y] = p > 1/2
independently for each i. Then, by defining a random variable Zi
:= 1[fi(x) ̸= y] and noting E[Zi
] = 1 − p,
Pr[Maj(x; f1, . . . , fn) ̸= y] = Pr
Xn
i=1
1[fi(x) ̸= y] ≥
n
2
= Pr
Xn
i=1
Zi ≥ n(1 − p) −
n
2
+ np
= Pr
Xn
i=1
Zi ≥ nE[Z1] + n(p − 1/2)
≤ exp
−2n(p − 1/2)2
.
The purpose of this problem is to study the behavior of Maj(x) when not all of the classifiers {f1, . . . , fn}
are independent.
(a) Assume n is divisible by 7 and 5n/7 is odd, and that of the n classifiers {f1, . . . , fn}, now only the first
5n/7 of them (i.e., {f1, . . . , f5n/7}) have independent errors on x. Specifically, Pr[fi(x) = y] = p := 4/5
for classifiers {f1, . . . , f5n/7}. By contrast, we make no assumption on the other 2n/7 classifiers (i.e.,
{f5n/7+1, . . . , fn}) and their errors. Now use Hoeffding’s inequality to show that
Pr
5
X
n/7
i=1
1[fi(x) ̸= y] ≥
3n
14
≤ exp
−
n
70
.
(b) Continuing from (a), further show that the majority vote classifier over all n classifiers is still good,
specifically showing
Pr
Maj(x; f1, . . . , fn) ̸= y
≤ exp
−
n
70
.
Note: You need to derive the inequality Pr
Maj(x; f1, . . . , fn) ̸= y
≤ exp(−n/70) rigorously for ANY
possible behavior of the 2n
7
arbitrary classifiers.
6
(c) Is the probability of correctly classifying x reasonably good in part (b) for large n? Do you have any
interesting observations? Any answer which contains at least one complete sentence will receive full
credit.
Solution.
7
4. LLM Use and Other Sources.
Please document, in detail, all your sources, including LLMs, friends, internet resources, etc. For example:
1a. I asked my friend, then I found a different way to derive the same solution.
1b. ChatGPT 5 solved the problem in one shot, but then I rewrote it once on paper, and a few days later
tried to re-derive an answer from scratch.
1c. I accidentally found this via a google search, and had trouble forgetting the answer I found, but still
typed it from scratch without copy-paste.
1d. . . .
.
.
.
4. I used my pico-llm to write this answer.
Solution.


