1 Collaborative Filtering [60 points]
• Read the attached paper on Empirical Analysis of Predictive Algorithms for Collaborative
Filtering. You need to read up to Section 2.1, and are encouraged to read further if you have
• The dataset we will be using is a subset of the movie ratings data from the Netflix Prize.
You need to download it via Elearning. It contains a training set, a test set, a movies file,
a dataset description file, and a README file. The training and test sets are both subsets
of the Netflix training data.
You will use the ratings provided in the training set to predict
those in the test set. You will compare your predictions with the actual ratings provided in
the test set. The evaluation metrics you need to use are the Mean Absolute Error and the
Root Mean Squared Error.
The dataset description file further describes the dataset, and
will help you get started. The README file is from the original set of Netflix files, and has
• Implement (use Python3 and numpy; the latter is a must for this part) the collaborative
filtering algorithm described in Section 2.1 of the paper (Equations 1 and 2; ignore Section
2.1.2) for making the predictions.
2 Neural Networks, K-nearest neighbors and SVMs [40 points]
• For this part, you will use scikit learn.
• Download the MNIST dataset 1 via scikit-learn, see below on how to do it. The dataset has a
training set of 60,000 examples, and a test set of 10,000 examples where the digits have been
centered inside 28×28 pixel images. You can also use scikit-learn to download and rescale the
dataset using the following code:
from sklearn.datasets import fetch_openml
# Load data from https://www.openml.org/d/554
X, y = fetch_openml(’mnist_784’, version=1, return_X_y=True)
X = X / 255.
# rescale the data, use the traditional train/test split
1The dataset is also available at http://yann.lecun.com/exdb/mnist/
# (60K: Train) and (10K: Test)
X_train, X_test = X[:60000], X[60000:]
y_train, y_test = y[:60000], y[60000:]
• Use the SVM classifier in scikit learn and try different kernels and values of penalty parameter. Important: Depending on your computer hardware, you may have to carefully select
the parameters (see the documentation on scikit learn for details) in order to speed up the
computation. Report the error rate for at least 10 parameter settings that you tried (see how
it is reported on http://yann.lecun.com/exdb/mnist/). Make sure to precisely describe
the parameters used so that your results are reproducible.
• Use the MLPClassifier in scikit learn and try different architectures, gradient descent schemes,
etc. Depending on your computer hardware, you may have to carefully select the parameters
of MLPClassifier in order to speed up the computation. Report the error rate for at least 10
parameters that you tried. Make sure to precisely describe the parameters used so that your
results are reproducible.
• Use the k Nearest Neighbors classifier called KNeighborsClassifier in scikit learn and try different parameters (see the documentation for details). Again depending on your computer
hardware, you may have to carefully select the parameters in order to speed up the computation.
Report the error rate for at least 10 parameters that you tried. Make sure to precisely
describe the parameters used so that your results are reproducible.
• What is the best error rate you were able to reach for each of the three classifiers? Note that
many parameters do not affect the error rate and we will deduct points if you try them. It
is your duty to read the documentation and then employ your machine learning knowledge
to determine whether a particular parameter will affect the error rate. Finally, don’t change
just one parameter 10 times; we want to see diversity.
What to turn in for this homework:
In a single zip file:
• A PDF report containing your write up for parts 1 and 2.
• Your source code for collabortive filtering and part 2.
Note that your program must compile and we should be able to replicate your results. Otherwise
no credit will be given.