Description
Introduction
Machine learning techniques have been applied to a variety of image interpretation problems. In
this project, you will investigate facial recognition, which can be treated as a clustering problem
(“separate these pictures of Joe and Mary”).
For this project, we will use a small part of a huge database of faces of famous people (Labeled Faces
in the Wild [LFW] people dataset1
). The images have already been cropped out of the original
image, and scaled and rotated so that the eyes and mouth are roughly in alignment; additionally,
we will use a version that is scaled down to a manageable size of 50 by 37 pixels (for a total of
1850 “raw” features). Our dataset has a total of 1867 images of 19 different people. You will
apply dimensionality reduction using principal component analysis (PCA) and explore clustering
methods such as k-means and k-medoids to the problem of facial recognition on this dataset.
Download the starter files from the course website. It contains the following source files:
• util.py – Utility methods for manipulating data, including through PCA.
• cluster.py – Code for the Point, Cluster, and ClusterSet classes, on which you will build
the clustering algorithms.
• faces.py – Main code for the project.
Please note that you do not necessarily have to follow the skeleton code perfectly. We encourage
you to include your own additional methods and functions. However, you are not allowed to use
any scikit-learn classes or functions other than those already imported in the skeleton code.
1 PCA and Image Reconstruction [4 pts]
Before attempting automated facial recognition, you will investigate a general problem with images.
That is, images are typically represented as thousands (in this project) to millions (more generally)
of pixel values, and a high-dimensional vector of pixels must be reduced to a reasonably lowdimensional vector of features.
(a) As always, the first thing to do with any new dataset is to look at it. Use get_lfw_data(…)
to get the LFW dataset with labels, and plot a couple of the input images using show_image(…).
Then compute the mean of all the images, and plot it. (Remember to include all requested
images in your writeup.) Comment briefly on this “average” face.
(b) Perform PCA on the data using util.PCA(…). This function returns a matrix U whose
columns are the principal components, and a vector mu which is the mean of the data. If
you want to look at a principal component (referred to in this setting as an eigenface), run
show_image(vec_to_image(v)), where v is a column of the principal component matrix.
(This function will scale vector v appropriately for image display.) Show the top twelve
eigenfaces:
plot_gallery([vec_to_image(U[:,i]) for i in xrange(12)])
Comment briefly on your observations. Why do you think these are selected as the top
eigenfaces?
1
https://vis-www.cs.umass.edu/lfw/
2
(c) Explore the effect of using more or fewer dimensions to represent images. Do this by:
• Finding the principal components of the data
• Selecting a number l of components to use
• Reconstructing the images using only the first l principal components
• Visually comparing the images to the originals
To perform PCA, use apply_PCA_from_Eig(…) to project the original data into the lowerdimensional space, and then use reconstruct_from_PCA(…) to reconstruct high-dimensional
images out of lower dimensional ones. Then, using plotGallery(…), submit a gallery of the
first 12 images in the dataset, reconstructed with l components, for l = 1, 10, 50, 100, 500, 1288.
Comment briefly on the effectiveness of differing values of l with respect to facial recognition.
We will revisit PCA in the last section of this project.
2 K-Means and K-Medoids [16 pts]
Next, we will explore clustering algorithms in detail by applying them to a toy dataset. In particular,
we will investigate k-means and k-medoids (a slight variation on k-means).
(a) In k-means, we attempt to find k cluster centers µj ∈ R
d
, j ∈ {1, . . . , k} and n cluster
assignments c
(i) ∈ {1, . . . , k}, i ∈ {1, . . . , n}, such that the total distance between each
data point and the nearest cluster center is minimized. In other words, we attempt to find
µ1
, . . . , µk and c
(1), . . . , c(n)
that minimizes
J(c, µ) = Xn
i=1
||x
(i) − µc
(i) ||2
.
To do so, we iterate between assigning x
(i)
to the nearest cluster center c
(i) and updating
each cluster center µj
to the average of all points assigned to the j
th cluster.
Instead of holding the number of clusters k fixed, one can think of minimizing the objective
function over µ, c, and k. Show that this is a bad idea. Specifically, what is the minimum
possible value of J(c, µ, k)? What values of c, µ, and k result in this value?
(b) To implement our clustering algorithms, we will use Python classes to help us define three
abstract data types: Point, Cluster, and ClusterSet (available in cluster.py). Read
through the documentation for these classes. (You will be using these classes later, so make
sure you know what functionality each class provides!) Some of the class methods are already
implemented, and other methods are described in comments. Implement all of the methods
marked TODO in the Cluster and ClusterSet classes.
(c) Next, implement random_init(…) and kMeans(…) based on the provided specifications.
3
(d) Now test the performance of k-means on a toy dataset.
Use generate_points_2d(…) to generate three clusters each containing 20 points. (You
can modify generate_points_2d(…) to test different inputs while debugging your code,
but be sure to return to the initial implementation before creating any plots for submission.)
You can plot the clusters for each iteration using the plot_clusters(…) function.
In your writeup, include plots for the k-means cluster assignments and corresponding cluster
“centers” for each iteration when using random initialization.
(e) Implement kMedoids(…) based on the provided specification.
Hint: Since k-means and k-medoids are so similar, you may find it useful to refactor your code
to use a helper function kAverages(points, k, average, init=’random’, plot=True),
where average is a method that determines how to calculate the average of points in a
cluster (so it can take on values ClusterSet.centroids or ClusterSet.medoids).2
As before, include plots for k-medoids clustering for each iteration when using random initialization.
(f) Finally, we will explore the effect of initialization. Implement cheat_init(…).
Now compare clustering by initializing using cheat_init(…). Include plots for k-means
and k-medoids for each iteration.
3 Clustering Faces [12 pts]
Finally (!), we will apply clustering algorithms to the image data. To keep things simple, we will
only consider data from four individuals. Make a new image dataset by selecting 40 images each
from classes 4, 6, 13, and 16, then translate these images to (labeled) points: 3
X1, y1 = util.limit_pics(X, y, [4, 6, 13, 16], 40)
points = build_face_image_points(X1, y1)
(a) Apply k-means and k-medoids to this new dataset with k = 4 and initializing the centroids
randomly. Evaluate the performance of each clustering algorithm by computing the average
cluster purity with ClusterSet.score(…). As the performance of the algorithms can vary
widely depending upon the initialization, run both clustering methods 10 times and report
the average, minimum, and maximum performance.
average min max
k-means
k-medoids
How do the clustering methods compare in terms of clustering performance and runtime?
2
In Python, if you have a function stored to the variable func, you can apply it to parameters arg by callling
func(arg). This works even if func is a class method and arg is an object that is an instance of the class.
3There is a bug in fetch lfw version 0.18.1 where the results of the loaded images are not always in the same order.
This is not a problem for the previous parts but can affect the subset selected in this part. Thus, you may see varying
results. Results that show the correct qualitative behavior will get full credit.
4
Now construct another dataset by selecting 40 images each from two individuals 4 and 13.
(b) Explore the effect of lower-dimensional representations on clustering performance. To do
this, compute the principal components for the entire image dataset, then project the newly
generated dataset into a lower dimension (varying the number of principal components), and
compute the scores of each clustering algorithm.
So that we are only changing one thing at a time, use init=’cheat’ to generate the same initial set of clusters for k-means and k-medoids. For each value of l, the number of principal components, you will have to generate a new list of points using build_face_image_points(…).)
Let l = 1, 3, 5, . . . , 41. The number of clusters K = 2. Then, on a single plot, plot the
clustering score versus the number of components for each clustering algorithm (be sure to
label the algorithms). Discuss the results in a few sentences.
Some pairs of people are more similar to one another and some more different.
(c) Experiment with the data to find a pair that clustering can discriminate very well and another
pair that it finds very difficult (assume you have 40 images for each individual). Describe your
methodology (you may choose any of the clustering algorithms you implemented). Report the
two pairs in your writeup (display the pairs of images using plot_representative_images),
and comment briefly on the results.
5