For this project, you will build several neural networks of
varying depth, and use them for handwritten digit recognition
and face recognition. You may import things like numpy and
matplotlib, but the idea is to implement things “from
scratch”: you may not import libraries that will do your work
for you, unless otherwise specified. You should only use
PyTorch for face recognition.
The input: digits
For parts 1-6 you will work with the MNIST dataset. The dataset is available in easy-to-read-in format here, with the digits already
seprated into a training and a test set. I recommend you divide the data by 255.0 (note: the .0 is important) so that it’s in the range 0..1.
Handout code: digits
We are providing some code here (data: snapshot50.pkl. You need to read it using pickle.load(open(“snapshot50.pkl”, “rb”),
encoding=”latin1″) in Python 3).
The code provides a partial implementation of a single-hidden-layer neural network with tanh activation functions in the hidden layer
that classifies the digits. The code is not meant to be run as is. We are simply providing some functions and code snippets to make your
Part 1 (2 pts)
Describe the dataset of digits. In your report, include 10 images of each of the digits. You may find matplotlib’s subplot useful for
displaying multiple images at once.
Part 2 (3 pts)
Implement a function that computes the network below using NumPy (do not use PyTorch for parts 1-6).
The ’s here should simply be linear combinations of the ’s (that is, the activation function in the output layer is the identity).
Supecifically, use . Include the listing of your implementation (i.e., the source code for the function; options for how
to do that in LaTeX are here) in your report for this Part.
Part 3 (10 pts)
We would like to use the sum of the negative log-probabilities of all the training cases as the cost function.
Part 3(a) (5 pts)
Compute the gradient of the cost function with respect to the weight . Justify every step. You may refer to Slide 7 of the One-Hot
Encoding lecture, but note that you need to justify every step there, and that your cost function is the sum over all the training examples.
Part 3(b) (5 pts)
Write vectorized code that computes the gradient of the cost function with respect to the weights and biases of the network. Check that
the gradient was computed correctly by approximating the gradient at several coordinates using finite differences. Include the code for
computing the gradient in vectorized form in your report.
Part 4 (10 pts)
Train the neural network you constructed using gradient descent (without momentum). Plot the learning curves. Display the weights
going into each of the output units. Describe the details of your optimization procedure – specifically, state how you initialized the
weights and what learning rate you used.
Part 5 (5 pts)
In Part 4, we used “vanilla” gradient descent, without momentum. Write vectorized code that performs gradient descent with momentum,
and use it to train your network. Plot the learning curves. Describe how your new learning curves compare with gradient descent without
momentum. In your report, include the new code that you wrote in order to use momentum.
Part 6 (15 pts)
In this Part, you will produce a demonstration of gradient descent working better when momentum is used. From your trained network
from Part 5, you should pick two weights. Denote them and . you will be keeping all other weights constant except and . To
get reasonable results, we recommend that you pick weights associated with the center of the digits, and not along the edges. This is
because pixels too close to the edge will likely be black for all digits, and therefore not interesting.
The visualizations you produce in Parts 6(b) and 6(c) should demontrate the benefits of using momentum.
Part 6(a) (4 pts)
Produce a contour plot of the cost function, when the weights and are allowed to vary around the values that you obtained in Part
5. Plot the contour of the cost function. The cost function will be a function of the two weights. The two weights and should vary
around the value obtained in part 5. Label your axes.
Part 6(b) (2 pts)
Re-initialize and to a value away from the local optimum. Keeping all other weights constant, learn and by taking K steps
using vanilla gradient descent (without momentum). Plot the trajectory. You may wish to increase your learning rate from earlier parts so
that the number of steps K is not large (say, 10-20).
Part 6(c) (2 pts)
Repeat the experiment, resetting and to the same initial value you used in part (b). Now, take K steps using gradient descent with
momentum. Plot the trajectory. You do not have to use the same learning rate as in part (b).
Part 6(d) (2 pts)
Describe any differences between the two trajectories. Provide an explanation of what caused the differences.
Part 6(e) (5 pts)
As stated above, you needed to find appropriate settings and weights and such that the benefits of using momentum are clear.
Describe how you found those. Find settings that do not demonstrate the benefits of using momentum, and explain why your settings
from Part 6(c) and 6(d) work for producing a good visualization, while the ones you found for this Part do not. You may use our hint
from the description of Part 6.
The pseudocode below demonstrates an example of how to plot trajectories on contour plots. We also included an example of what the
contour and trajectories might look like. Your visualization does not have to look like this. Base your explanation on what you see in
gd_traj = [(init_w1, init_w2), (step1_w1, step1_w2), …]
mo_traj = [(init_w1, init_w2), (step1_w1, step1_w2), …]
w1s = np.arange(-0, 1, 0.05)
w2s = np.arange(-0, 1, 0.05)
w1z, w2z = np.meshgrid(w1s, w2s)
C = np.zeros([w1s.size, w2s.size])
for i, w1 in enumerate(w1s):
for j, w2 in enumerate(w2s):
C[i,j] = get_loss(w1, w2)
CS = plt.contour(w1z, w2z, C)
plt.plot([a for a, b in gd_traj], [b for a,b in gd_traj], ‘yo-‘, label=”No Momentum”)
plt.plot([a for a, b in mo_traj], [b for a,b in mo_traj], ‘go-‘, label=”Momentum”)
Part 7 (10 pts)
Backpropagation can be seen as a way to speed up the computation of the gradient. For a network with layers each of which contains
neurons, determine how much faster is (fully-vectorized) Backpropagation compared to computing the gradient with respect to each
weight individually, without “caching” any intermediate results. Assume that all the layers are fully-connected. Show your work. Make
any reasonable assumptions (e.g., about how fast matrix multiplication can be peformed), and state what assumptions you are making.
Hint: There are two ways you can approach this question, and only one is required. You may analyze the algorithms to describe their
limiting behaviour (e.g. big O). Alternatively, you may run experiments to analyze the algorithms empirically.
Part 8 (20 pts)
We have seen PyTorch code to train a single-hidden-layer fully-connected network in this tutorial Modify the code to classify faces of the
6 actors in Project 1. Continue to use a fully-connected neural network with a single hidden layer, but train using mini-batches using an
optimizer of your choice. In your report, include the learning curve for the training and validation sets, and the final performance
classification on the test set. Include a text description of your system. In particular, describe how you preprocessed the input and
initialized the weights, what activation function you used, and what the exact architecture of the network that you selected was. Report
on the resolution (e.g., or ) of the face images you ended up using. Experiment with different settings to produce the best
performance, and report what you did to obtain the best performance.
Use 20 images per actor in the test set.
Unlike in Project 1, you must remove non-faces from your dataset. Use the SHA-256 hashes to remove bad images. You may
additionally hardcode the indices of the images that you’d like to remove.
Part 9 (5 pts)
Select two of the actors, and visualize the weights of the hidden units that are useful for classifying input photos as those particular
actors. Explain how you selected the hidden units.
A sample visualization is below.
Part 10 (20 pts)
PyTorch comes with an implementation of AlexNet, along with its weights. Extract the values of the activations of AlexNet on the face
images in a particular layer (e.g., Conv4). In your report, explain how you did that. Use those as features in order to perform face
classification: learn a fully-connected neural network that takes in the activations of the units in the AlexNet layer as inputs, and outputs
the name (or index, or one-hot encoding – up to you) of the actor. In your report, include a description of the system you built and its
performance, similarly to Part 8. It is possible to improve on the results of Part 8 by reducing the error rate by at least 30%. We
recommend starting out with only using the conv4 activations.
PyTorch’s implementation allows you to fairly easily access the layer just before the fully connected layer. You can modify that
implementation to access any layer. We include a slightly-modified version of PyTorch’s AlexNet implementation here:
Note that you are asked to only train the (newly-constructed) fully-connected layers on top of the layer that you extracted. In your report,
state how you accomplish that.
What to submit
The project should be implemented using Python 2 or 3 and should be runnable on the CS Teaching Labs computers. Your report should
be in PDF format. You should use LaTeX to generate the report, and submit the .tex file as well. A sample template is on the course
website. You will submit at least the following files: faces.py, digits.py, deepfaces,py, deepnn.tex, and deepnn.pdf. You may
submit more files as well. You may submit ipynb files in place of py files.
Reproducibility counts! We should be able to obtain all the graphs and figures in your report by running your code. The only exception is
that you may pre-download the images (what and how you did that, including the code you used to download the images, should be
included in your submission.) Submissions that are not reproducible will not receive full marks. If your graphs/reported numbers cannot
be reproduced by running the code, you may be docked up to 20%. (Of course, if the code is simply incomplete, you may lose even
more.) Suggestion: if you are using randomness anywhere, use numpy.random.seed().
You must use LaTeX to generate the report. LaTeX is the tool used to generate virtually all technical reports and research papers in
machine learning, and students report that after they get used to writing reports in LaTeX, they start using LaTeX for all their course
reports. In addition, using LaTeX facilitates the production of reproducible results.
You are free to use any of the code available from the CSC411 course website.
Readability counts! If your code isn’t readable or your report doesn’t make sense, they are not that useful. In addition, the TA can’t read
them. You will lose marks for those things.
It is perfectly fine to discuss general ideas with other people, if you acknowledge ideas in your report that are not your own. However,
you must not look at other people’s code, or show your code to other people, and you must not look at other people’s reports and
derivations, or show your report and derivations to other people. All of those things are academic offences.