## Description

1. (30 points) Consider the following 2 sets of points in the plane:

A :

0

2

,

−1

1

B :

−1

2

,

−2

3

3 2 1 0 1 2 3

2

1

0

1

2

3

A

B

Origin

(a) What is the first principal component w1 (use the unbiased estimation of covariance)? Draw the first principal component direction w1 on the plot, anchored at

the origin.

(b) Consider a more general case (not specific to the aforementioned samples): PCA

performs linear dimensionality reduction with z

t = WT x

t

, where x

t ∈ R

D is

the original data for the t-th sample, z

t ∈ R

d

is the low-dimensional projection

(d < D), W ∈ R
D×d
is the PCA projection matrix (each column is a principal
component). Professor HighLowHigh claims that we can reconstruct the original
data with v
t = Wzt
, so that ∀t v
t = x
t
. Is the claim correct? Explain your
answer with necessary details (you can use formulations if it helps explain).
(c) Compute the Within-Class Scatter matrix SW and Between-Class Scatter matrix
SB.
(d) What is the Fisher projection direction w found by the Fisher Linear Discriminant
Analysis(LDA)? Normalize w to have unit length, that is ||w||2 = 1, and draw
such w on the plot, anchored at the origin.
1
Instructor: Catherine Qi Zhao. TA: Prithvi Raj Botcha, Shi Chen, Suzie Hoops, James Yang, Yifeng
Zhang. Email: csci5521.s2022@gmail.com
1
Note: You need to show all your calculations in order to get full credits.
2. (30 points) Given the following data points in 1D: x1 = 1, x2 = 4, x3 = 5, x4 = 6, x5 =
7, x6 = 8, x7 = 10, x8 = 12, x9 = 14, perform k-means clustering algorithm for k = 3.
(a) Start from initial cluster centers c1 = 0, c2 = 5, c3 = 10. Show your steps for all
iterations: (1) the cluster assignments y1, · · · , y9; (2) the updated cluster centers
at the end of that iteration.
(b) How many iterations does it take for k-means algorithm to converge (i.e., number
of iterations includes all iterations you perform to find convergence)? What is
the reconstruction error (i.e., distortion measure J, equation 9.1 of the Bishop’s
textbook) at the end of that iteration?
(c) Repeat the above steps with initial cluster centers c1 = 2, c2 = 7, c3 = 12.
(d) How many iterations does it take for k-means algorithm to converge in this case?
What is the reconstruction error at the end of that iteration?
(e) Comparing (a) with (c), which solution is better? Why?
3. (40 points) In this programming exercise you will implement k-means and Principal
Component Analysis algorithms:
(a) You will first apply k-means algorithm (K = 8) to the provided dataset
Digits089.csv. The dataset contains 3000 samples, where each sample has 784
features. The first column of the data file contains flags that are used for separating training and test samples, the second column includes class labels (i.e., 0,
8, 9), and the rest of the columns store features. Your program should initialize the centers with a set of pre-selected samples (see the template for detailed
initialization), iteratively update the center of each cluster based on the training
samples, and record the reconstruction error after each iteration. After reaching
convergence, it should assign a class label to each cluster based on majority voting
(i.e., use the most dominant label in the cluster as its class label). To estimate
the quality of clustering, you will use the clusters for classification of test samples.
More specifically, you should determine the cluster for each test sample, and use
the class label of the selected cluster for prediction.
Report the number of iterations for convergence and the classification
accuracy on the test samples. Plot the history of reconstruction errors.
Is the plot shape following what you expect? The code for plotting and
computing classification accuracy is included in hw2.py, and you do not need to
modify the file.
(b) Repeat the above, but use low-dimensional data obtained from PCA. Your PCA
algorithm should reduce the original training samples to dimensions needed to
capture > 90% of the variance. Note that you should use the means and projection

matrix learned from training samples for projecting test samples. How many

2

dimensions are necessary in this case? Does PCA help clustering?

Explain. (Hint: Consider both the classification accuracy and the runtime of

the algorithm.)

(c) Repeat (b), but using only the first principal component. Are the results better? Explain.

We have provided the skeleton code Mykmeans.py and MyPCA.py for implementing the

algorithms. They are written in a scikit-learn convention, where you have a fit function

for model training and a predict function for generating predictions on given samples.

To verify your implementation, call the main function hw2.py, which automatically

generates plots for the reconstruction errors and computes the classification accuracy.

Submission

• Things to submit:

1. hw2 sol.pdf: a document containing all your answers for the written questions

(including answers and plots for problem 3).

2. Mykmeans.py: a Python source file containing your implementation of the kmeans algorithm with header class Kmeans. Use the skeleton file Mykmeans.py

found with the data on the class web site, and fill in the missing parts. The fit

function should take the training samples and training labels as inputs, initialize

the centers, update the model parameters and record the reconstruction errors.

It should also assign a class label to each cluster, and return the number of

iterations for convergence as well as the history of reconstruction errors. The

predict function should take the test samples as inputs, and predict their class

labels based on clustering results. The compute error function should take the

samples and cluster assignment (a vector specifying the cluster ID assigned to each

sample) as inputs, and return the reconstruction error for the current assignment.

3. MyPCA.py: a Python source file containing your implementation of the Principal Component Analysis algorithm with header class PCA. Given the original

samples and an optional parameter num dim as inputs, the fit function should

determine the projection matrix based on the training samples, and return the

low-dimensional projection as well as the dimension of features. If num dim is

specified during initialization of the class, it should project the samples to d dimensional data, where d=num dim. Otherwise, it will use the dimensions that

capture > 90% of the variance. The predict function should return the lowdimensional projection of test samples based on the projection matrix and means

learned from the training samples.

• Submit: All material must be submitted electronically via Gradescope. Note that

there are two entries for the assignment, i.e., Hw2-Written (for hw2 sol.pdf)

3

and Hw2-Programming (for a zipped file containing the Python code),

please submit your files accordingly. We will grade the assignment with vanilla

Python, and code submitted as iPython notebooks will not be graded.

4