Description
1. Assignment Overview The goal in this assignment is to let you be familiar with A-Prior, MinHash, Locality Sensitive Hashing (LSH), and different types of recommendation systems. You will keep using MovieLens dataset that has been used in Lab sessions. 2. Requirements 2.1 Programming Environment Python 3.6. You could use spark, numpy, scipy, pandas, sklearn, pytorch, tensorflow library. If you need to use additional libraries, you must have approval from TA and declare it in requirements.txt There will be a 20% penalty if we cannot run your code due to Python version inconsistency or undeclared 3rd party library. 2.2 Write your own code Do not share code with other students! For this assignment to be an effective learning experience, you must write your own code! We emphasize this point because you will be able to find Python implementations of some of the required functions on the web. Please do not look for or at any such code! TAs will combine all the code we can find from the web (e.g., Github) as well as other studentsβ code from this and other (previous) sections for plagiarism detection. We will report all detected plagiarism. 2.3 What you need to turn in The submission format will be the the same as homework 1. Your submission must be a zip file with name: firstname_lastname_hw2.zip (all lowercase). You need to pack the following files in the zip: a. four Python scripts, named: (all lowercase) firstname_lastname_task1.py, firstname_lastname_task2.py, firstname_lastname_task3.py, firstname_lastname_task4.py You could have additional files, but these four Python scripts are required. b. You donβt need to include your results. We will grade on your code with our testing data (data will be in the same format). 3. MovieLens Dataset Dataset homepage: https://grouplens.org/datasets/movielens/ You could use the small dataset for development. A copy of the dataset is included, along with split training and testing data. 4. Tasks 4.1 Task 1: Find interesting association rules Usersβ ratings to movies are stored in ratings.csv. Recall market-based model. The set of movies that a user gives 5.0 rating could be viewed as a basket. Task: Find association rules {π1 , π2 , β― , ππ} β π in these baskets, such that interest β₯ πΌ, and support β₯ S. π and π are movieId. Note: 1. Only consider 5.0 rating in this task. 2. You should use an efficient algorithm like A-Prior method. 3. Although interest could be positive or negative, only consider positive interest here 4. To simplify the computation, the threshold of support is applied to πΌ βͺ π. In the textbook, the support is applied to πΌ only. Grading: 1. 20 points in total 2. This is a deterministic algorithm. Your answer should be exactly the same as the solution. 3. We will test your implementation on multiple datasets (same format). If you get them all correct, you get full points. 4. If your answer doesnβt match the solution, 5 points deducted. The remaining points will be given based on an analysis of your implementation. 5. The reference implementation takes less than 10 seconds on the ratings.csv, with I=0.5, S =10. If your implementation takes more than 5 minutes, likely itβs not implemented correctly and will deduct points. Input format: (We will use the following command to execute your code) python firstname_lastname_task1.py Params: input_filename: Path to input file, e.g.: ml-latest-small/ratings.csv output_filename: Path to output file I: Threshold of the of interest, e.g: 0.5. Use the definition in week 2, slide 19. Only consider positive interest here. S: Threshold of support, e.g.: 10. Use the definition in week 2, slide 14. Support is an integer, not a percentage. The threshold is applied to πΌ βͺ π. Output format: A JSON file serialized from a nested list of association rules. Indent doesnβt matter. You need to output every association rule {π1 , π2 , β― , ππ} β π, along with interest and support for that rule. Below is an illustration of the format. [ [[π1, π2, π3], π1, interest1, π π’πππππ‘1], [[π1, π2, π4], π2, interest2, support2], β― ] The reference output on ratings_test_truth.csv with I=0.2, S=2 is attached as task1_sample.json Sorting Order: 1. In each association rule, the list of π are sorted in ascending order 2. Between association rules, rules are first sorted by interest, descending order 3. Then sorted by support, descending order 4. Then sorted by the list of π, ascending order. Python has sorting order defined on two lists. 5. Then sorted by π, ascending order Please do the sorting correctly. Otherwise your answer will not match the solution and lose points. Step 2-5 could be done by sorted function with argument: key=lambda x: (-x[2], -x[3], x[0], x[1]) You donβt need to worry about rounding errors. There is a tolerance in grading script. Optional: Convert movieId to title. Are there any findings from the association rules? 4.2 Task 2: Content-based recommendation system with LSH In the Lab session, we have built a family of memory based recommendation system: content-based, user-based, and item-based. But they are not scalable to massive datasets. For every recommendation made, the entire dataset has to be visited once. This could be optimized by using LSH algorithm to find similar items, instead of linear search. Task: Build an efficient content-based recommendation system, that using LSH to find similar items. Note: 1. Items (movies) are represented by a set of genres. Their similarity could be measured by Jaccard similarity. 2. You should implement LSH and content-based recommendation yourself. Using spark.mllib or similar library is not allowed. 3. There are many hyper-parameters like hash size, bucket size, similarity threshold, prime number. You need to decide these hyper-parameters, as long as your MSE is below a threshold, you get full points. 4. Donβt forget to clip the rating to [0.5, 5.0] 5. The submitted program shouldnβt read ground truth values. To calculate MSE, do it in another program. Grading: 1. 20 points in total 2. 0 point if your LSH or recommendation system is provided by a library 3. This algorithm has randomness in nature. We will run your program multiple times and take the average. As long as your πππΈ β€ 0.85 , you get full points. 4. If your MSE is higher than the threshold, 5 points deducted. The remaining points will be given based on an analysis of your implementation. 5. We will test your implementation on multiple datasets(same format). These datasets come from the same distribution and should yield similar MSE. If your model gets consistent MSE among different datasets, we will use the MSE on provided dataset for grading. Otherwise, we will analyze your implementation to see if itβs memorizing test data and deduct points if found. 6. The reference implementation takes about 10 seconds. If your implementation takes more than 5 minutes, likely itβs not implemented correctly and will deduct points. 7. The running time of LSH based method is similar to linear search method on the provided dataset. Thatβs because in this dataset, on average, each user only rated a small number of movies. The rating matrix is very sparse. The LSH based method could show its advantage when the rating matrix is denser. Input format: python firstname_lastname_task2.py movies_filename: Path to movies file, e.g.: ml-latest-small/movies.csv rating_train_filename: Path to train rating file, e.g.: ml-latestsmall/ratings_train.csv rating_test_filename: Path to test rating file, e.g.: ml-latest-small/ ratings_test.csv output_filename: Path to output file In ratings_test.csv, column rating is filled with 0. If you want to evaluate your modelβs performance, you need to compare your prediction with rating_test_truth.csv Output format: Similar to rating_test.csv, itβs column rating is replaced with your modelβs prediction. 4.3 Task 3: Model based recommendation system In contrast to memory based method, model based method doesnβt need to store or look up the massive training dataset during inference, which makes it scalable and widely used in the industry. Task: Build a model-based recommendation system based on matrix factorization. Note: 1. You are free to use any matrix factorization algorithm, and any library implementing them (e.g.: numpy, sklearn), or you could implement it yourself a. Singular value decomposition b. Neural network based matrix factorization 2. You should only use these libraries to do matrix factorization. Using a recommendation system API from 3rd party library is not allowed 3. There are many hyper-parameters like number of components, regularization coefficient. You need to decide on your own, maybe with grid search. As long as your MSE is below a threshold, you get full points. 4. In order to reach required MSE, itβs recommended to include overall/user/movie bias term. Grading: 1. 20 points in total 2. 0 point if your recommendation system is provided by a library 3. This algorithm has randomness in nature. We will run your program multiple times and take the average. As long as your πππΈ β€ 0.785 , you get full points. 4. The reference implementation takes less than 10 seconds. If your implementation takes more than 5 minutes, likely itβs not implemented correctly and will deduct points. 5. Same as Task 2, Grading 4 Input and output format: Same as task 2. 4.4 Bonus Task: Further improve the recommendation system Task: Build a recommendation system that gets the lowest MSE as you can. Note: 1. You are free to use any method in slides/textbook/online. But that must be your own implementation, not a functionality of an existing library. 2. Make sure your model performs well on the test set, not training set. We will evaluate your model in multiple datasets. 3. Possible ideas: a. The BellKor Solution to the Netflix Grand Prize b. Hybrid model of Item based and User based RS, may combined with LSH. Grading: 1. For every 0.01 MSE reduced over baseline πππΈ = 0.75 , get 1 bonus point. 20 bonus point max. 2. 0 point if your method is provided by a library, or copy-paste from GitHub. 3. Same as Task 2, Grading 3,4,5 Input and output format: Same as task 2. 5. Grading Criteria The perfect score for this assignment is 60 points + 20 bonus point. Assignment Submission Policy Homework assignments are due at 11:59 pm on the due date and should be submitted in Blackboard. Every student has FIVE free late days for the homework assignments. You can use these five days for any reason separately or together to avoid the late penalty. There will be no other extensions for any reason. You cannot use the free late days after the last day of the class. You can submit homework up to one week late, but you will lose 20% of the possible points for the assignment. After one week, the assignment cannot be submitted. (% penalty = % penalty of possible points you get) 1. You can use your free 5-day extension separately or together. 2. If we cannot run your programs with the command we specified, there will be 80% penalty. 3. If your program cannot run with the required Python/Spark versions, there will be 20% penalty. 4. If our grading program cannot find a specified tag, there will be no point for this question. 5. If the outputs of your program are unsorted or partially sorted, there will be 50% penalty. 6. If the header of the output file is missing, there will be 10% penalty. 7. We can regrade on your assignments within seven days once the scores are released. No argue after one week. There will be 20% penalty if our grading is correct. 8. There will be 20% penalty for late submission within a week and no point after a week. 9. There will be no point if the total execution time exceeds 15 minutes.