## Description

1) Collaborative Filtering: Looking at the Data (2 points) When doing machine learning, it is important to have an understanding of the dataset that you will be working on. On the course website under “links” there is a link to the MovieLens dataset. You will use the MovieLens 100K dataset to build a collaborative filter to predict the rating a user may give to a movie they haven’t seen. This dataset has 100,000 ratings from 943 users on 1682 movies.

A. (1 point) Go through each pair of users. For each pair of users, count how many movies both have reviewed in common. What is the mean number of movies two people have reviewed in common? What is the median number? Plot a histogram of this data where each bar is some number of movies reviewed in common and the height of the bar is the number of user pairs who have reviewed that many movies in common. Clearly label your dimensions. Explain any choices you made in the generation of this histogram.

B. (1 point) Go through the movies. For each movie, measure how many reviews it has. What movie had the most reviews? How many reviews did it have? What had the fewest? Order the movies by the number of reviews each one has. Now, plot a line where the vertical dimension is the number of reviews and the horizontal dimension is the movie’s number in order by the number of reviews. Clearly label your dimensions. Now that you have this data, do you think the number of reviews per movie follows Zipf’s law? (http://en.wikipedia.org/wiki/Zipf’s_law ) 2) Collaborative Filtering: Distance measures (1 point) A. (1/2 point) Assume a user will be characterized by a vector that has that user’s ratings for all the movies in the MovieLens data. This means each user is described by a vector of size 1682 (the number of movies). Assume the distance between two users will be their Manhattan distance. Here is the problem: most users have not rated most of the movies. Therefore, you can’t use Manhattan distance until you decide how to deal with this. One approach (call it approach A) is to put a 0 in for every missing rating. Another approach is to find the average rating chosen by that user on the movies he/she DID rate. Then, substitute that value in for all missing movie ratings (call it approach B). Which approach do you think is better? Say why. Give a toy example situation (i.e. 3 users, 5 movies, a few missing ratings) that illustrates your point.

B. (1/2 point) You are trying to decide between two different distance measures for movies: Euclidean distance and one based on Pearson’s correlation. Each movie is characterized by a vector of 943 (the number of users) ratings. Assume that all missing ratings values have the value 3 filled in as a place-holder. Which distance measure do you think would be better for item-based collaborative filtering? Why do you think that? Illustrate your point with a toy example. Question removed to simplify homework.

EECS 349 (Machine Learning) Homework 4

3) Collaborative Filters (4 points) You will now build two collaborative filters, user-based and item-based collaborative filters. We have provided two callable python scripts (user_cf.py and item_cf.py) that have some starter code. You will complete the functions named user_based_cf() and item_based_cf()in the starter code. Filter user_cf.py must be a user-based collaborative filter and item_cf.py must be an item-based collaborative filter. All distances will be in the user/movie ratings space. Therefore movies will each have a 943-element vector (one element per user) and the ith element will contain the rating of that movie by user i. People will have 1682 element vectors, where the jth element contains that user’s rating for movie j. Missing ratings must be filled in by the value 0. The predicted rating returned will be the mode (the number that occurs most frequently in the set) of the top K neighbors. Below is how each script will be run from the command line.

python user_cf.py