Description
• You must implement the solution yourself for computing the projections and discriminate functions.
The Digits089.csv dataset consists of 3000 data points, each representing an image of the
three digits 0, 8, 9. Each row in the file is one data sample consisting of 786 numbers in “CSV”
format:
flag label . . . 784 pixel values . . .
(1,…,5) (0,8,9) (values 0,…,255)
Separate the data into three parallel arrays one with the flag values, one with the labels, and
one with the pixel values representing 28 × 28 images of the digits 0, 8, 9. Use the entries with
flags 1, 2, 3, 4 as training data and samples with flag value 5 as a test set.
1. (10 pts) Consider the training set extracted from this dataset for unsupervised dimensionality
reduction using PCA: Implement PCA and plot all data points using only the first 2 principal
components. Display the plot of the points projected onto the first 2 principal components.
Distinguish between the three classes (0, 8, 9) with a different plot symbol and color.
2. (30 pts) Consider the same dataset (training data only) for supervised dimensionality reduction
using LDA: In the steps below, consider only those points corresponding to 8 and 9.
Apply PCA to the training samples and project onto the first 2 principal components. Then,
apply LDA to project onto 1 dimension.
Use the LDA projection to give classifier minimizing the error-rate on the 1D projection from
LDA. That is, determine the separator using only the training set and then report the confusion
matrix for both the training and test sets separately. The 2 × 2 confusion matrix contains
(number of 8’s classed as 8, number of 8’s classed as 9, number of 9’s classed as 8, number of
9’s classed as 9).
Draw the plot of the first two principal components for just 8 and 9 and include the line
separating the two classes computed from the LDA projection.
Also show the images for the “most incorrectly” classified examples. That is, show the image
for the 8 test sample whose LDA projection lies the most in the 9 direction and the image for
the 9 test sample whose LDA projection lies the most in the 8 direction.
3. (30 pts Repeat the previous LDA classification, but this time use enough principal components
to capture 90% of the variance in the training data. You can compute the needed number of
components separately and just report the number.
1
4. (30 pts Implement the K-nearest-neighbor algorithm and apply it to the same dataset as in
the previous problem, using the PCA needed to capture 90% of the variance. Try number of
neighbors k = 1, 3, 5, 7, 9 and choose the k yielding the lowest error rate on the training set.
Then report the resulting confusion matrices for both the training and test set using that best
k. Also show the images for your choice of two misclassified test samples, one for a misclassified
8 and one for a misclassified 9.
5. (Extra Items – to explore further analysis)
• Use training examples from all three digits to learn a classifier and show results on the
test set.
• Adjust the parameters (number of dimensions in PCA, number of neighbors in KNN)
using 4-way cross-validation on the training set.
• Try PCA, LDA and/or KNN on the entire data set, available at
“https://pjreddie.com/projects/mnist-in-csv/”. Note: the data at this web site
stores the one label and the 784 = 282 pixel values per row; there is no flag entry.
Instructions
Follow the rules strictly. All code must be written in MATLAB. If we cannot run your
code, you get 0 points.
• Things to submit
1. hw2 sol.pdf: A document which contains the solution to Problems 1, 2, and 3 including
the summary of methods and results and the PCAplots from problem 1 and the LDA/PCA
plot from problem 2. The front page of the PDF file should have names and UMN email
addresses of the student(s) submitting the document. Also include any experiments and
results you carry out in problem 4.
The following is to be zipped into a single ZIP file:
2. myLDA.m starting with function [Projection, classification]=myLDA(’filename’,l);
where filename is the name of the file containing the data, Projection is a 2000×2 matrix
of projections of all the 8’s and 9’s onto the first two principal components (exactly what
is plotted in the 2D plot), classification is the predicted label 8 or 9 from the LDA
classifier, and l is the number of principal components to keep for the LDA step. You can
figure the number of principal components needed to capture 90% of the variance separately. The PCA can be computed within the myLDA function or in a separate function,
as you prefer, but be sure to identify the code computing the PCA with some comments.
3. myKNN.m starting with [classification]=myKNN(’filename’,l,k);
where classification is the predicted label 8 or 9 from the KNN classifier, l is the
number of principal components, and k is the number of neighbors to use in the KNN
classifier.
4. Any other files, except the data, which are necessary for your code.
2