DSCI 553 Assignment 1 to 6 solutions

$130.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (2 votes)

DSCI 553 Assignment 1

1. Overview of the Assignment

In assignment 1, you will work on three tasks. The goal of these tasks is to get you familiar with Spark operation types (e.g., transformations and actions) and explore a real-world dataset: Yelp dataset (https://www.yelp.com/dataset). If you have questions about the assignment, please ask on Piazza, which will also help other students. You only need to submit on Vocareum, NO NEED to submit on Blackboard.

2. Requirements

2.1 Programming Requirements a. You must use Python to implement all tasks. You can only use standard python libraries (i.e., external libraries like numpy or pandas are not allowed). There will be a 10% bonus for each task if you also submit a Scala implementation and both your Python and Scala implementations are correct. b. You are required to only use Spark RDD in order to understand Spark operations. You will not get any points if you use Spark DataFrame or DataSet.

2.2 Programming Environment Python 3.6, JDK 1.8, Scala 2.12, and Spark 3.1.2 We will use these library versions to compile and test your code. There will be no point if we cannot run your code on Vocareum. On Vocareum, you can call `spark-submit` located at `/opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit`. (Do not use the one at /usr/local/bin/spark-submit (2.3.0)). We use `–executor-memory 4G –driver-memory 4G` on Vocareum for grading.

2.3 Write your own code Do not share code with other students!! For this assignment to be an effective learning experience, you must write your own code! We emphasize this point because you will be able to find Python implementations of some of the required functions on the web. Please do not look for or at any such code! TAs will combine all the code that can be found from the web (e.g., Github) as well as other students’ code from this and other (previous) sections for plagiarism detection. We will report all detected plagiarism.

2.4 What you need to turn in We will grade all submissions on Vocareum, the submissions on blackboard will be ignored. Vocareum produces a submission report after you click the “Submit” button (It takes a while since Vocareum needs to run your code in order to generate the report). Vocareum will only grade Python scripts during the submission phase and it will grade both Python and Scala during the grading phase. a. [REQUIRED]three Python scripts, named: (all lowercase) task1.py, task2.py, task3.py b1. [OPTIONAL, REQUIRED FOR SCALA] three Scala scripts and the output jar file, named: (all lowercase) hw1.jar, task1.scala, task2.scala, task3.scala c.

You don’t need to include your results or the datasets. We will grade your code with our testing data (data will be in the same format).

3. Yelp

Data In this assignment, you will explore the Yelp dataset. You can find the data on Vocareum under resource/asnlib/publicdata/. The two files business.json and test_review.json are the two files you will work on for this assignment, and they are subsets of the original Yelp Dataset. The submission report you get from Vocareum is for the subsets.

For grading, we will use the files from the original Yelp dataset which is SIGNIFICANTLY larger (e.g. review.json can be 5GB). You should make sure your code works well on large datasets as well.

4. Tasks

4.1 Task1: Data Exploration (3 points) You will work on test_review.json, which contains the review information from users, and write a program to automatically answer the following questions: A. The total number of reviews (0.5 point) B. The number of reviews in 2018 (0.5 point) C. The number of distinct users who wrote reviews (0.5 point) D. The top 10 users who wrote the largest numbers of reviews and the number of reviews they wrote (0.5 point) E.

The number of distinct businesses that have been reviewed (0.5 point) F. The top 10 businesses that had the largest numbers of reviews and the number of reviews they had (0.5 point) Input format: (we will use the following command to execute your code) Python: spark-submit –executor-memory 4G –driver-memory 4G task1.py Scala: spark-submit –class task1 –executor-memory 4G –driver-memory 4G hw1.jar

Output format: IMPORTANT: Please strictly follow the output format since your code will be graded automatically. a. The output for Questions A/B/C/E will be a number. The output for Questions D/F will be a list, which is sorted by the number of reviews in the descending order. If two user_ids/business_ids have the same number of reviews, please sort the user_ids /business_ids in the alphabetical order. b. You need to write the results in the JSON format file.

You must use exactly the same tags (see the red boxes in Figure 2) for answering each question. Figure 1: JSON output structure for task1

4.2 Task2: Partition (2 points) Since processing large volumes of data requires performance decisions, properly partitioning the data for processing is imperative. In this task, you will show the number of partitions for the RDD used for Task 1 Question F and the number of items per partition. Then you need to use a customized partition function to improve the performance of map and reduce tasks. A time duration (for executing Task 1 Question F) comparison between the default partition and the customized partition (RDD built using the partition function) should also be shown in your results.

Hint: Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism for redistributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.

So, trying to design a partition function to avoid the shuffle will improve the performance a lot. Input format: (we will use the following command to execute your code) Python: spark-submit –executor-memory 4G –driver-memory 4G task2.py Scala: spark-submit –class –executor-memory 4G –driver-memory 4G task2 hw1.jar Output format: A. The output for the number of partitions and execution time will be a number. The output for the number of items per partition will be a list of numbers. B. You need to write the results in a JSON file.

You must use exactly the same tags. Figure 3: JSON output structure for task2

4.3 Task3: Exploration on Multiple Datasets (2 points) In task3, you are asked to explore two datasets together containing review information (test_review.json) and business information (business.json) and write a program to answer the following questions: A. What are the average stars for each city? (1 point) 1. (DO NOT use the stars information in the business file). 2. (DO NOT discard records with empty “city” field prior to aggregation). B. You are required to compare the execution time of using two methods to print top 10 cities with highest stars. Please note that this task – (Task 3(B)) is not graded.

You will get full points only if you implement the logic to generate the output file required for this task. 1. You should store the execution time (start from loading the file) in the json file with tag “m1” and “m2”. 2. Additionally, add a “reason” field and provide a hard-coded explanation for the observed execution times. Method1: Collect all the data, sort in python, and then print the first 10 cities Method2: Sort in Spark, take the first 10 cities, and then print these 10 cities Input format: (we will use the following command to execute your code) Python: spark-submit –executor-memory 4G –driver-memory 4G task3.py Scala: spark-submit –class task3 –executor-memory 4G –driver-memory 4G hw1.jar Output format: a. You need to write the results for Question A as a file.

The header (first line) of the file is “city,stars”. The outputs should be sorted by the average stars in descending order. If two cities have the same stars, please sort the cities in the alphabetical order. (see Figure 3 left). b. You also need to write the answer for Question B in a JSON file. You must use exactly the same tags for the task. Figure 3: Question A output file structure (left) and JSON output structure (right) for task3 5. Grading Criteria (% penalty = % penalty of possible points you get) 1. You can use your free 5-day extension separately or together https://forms.gle/h4t46LCahrtDk9rVA 1. This form will record the number of late days you use for each assignment.

We will not count late days if no request is submitted. 1. There will be a 10% bonus if you use both Scala and Python and get expected results. 2. We will combine all the codes we can find from the web (e.g., Github) as well as other students’ code from this and other (previous) sections for plagiarism detection. If plagiarism is detected, there will be no point for the entire assignment and we will report all detected plagiarism. 3. All submissions will be graded on the Vocareum.

Please strictly follow the format provided, otherwise you can’t get the point even though the answer is correct. You are encouraged to try out your code on Vocareum terminal. 4. We will grade both the correctness and efficiency of your implementation. The efficiency is evaluated by processing time and memory usage. The maximum memory allowed to use is 4G, and maximum processing time is 1800s for grading. The datasets used for grading are larger than the ones that you use for doing the assignment. You will get *% penalty if your implementation cannot generate correctness outputs for large files using 4G memory within the 1800s.

Therefore, please make sure your implementation is efficient to process large files. 5. Regrading policy: We can regrade your assignments within seven days once the scores are released. Regrading requests will not be accepted after one week. 6. There will be a 20% penalty for late submission within a week and no point after a week. If you use your late days, there wouldn’t be a 20% penalty. 7. Only when your results from Python are correct, the bonus of using Scala will be calculated.

There is no partial point for Scala. See the example below: Example situations Task Score for Python Score for Scala (10% of previous column if correct) Total Task1 Correct: 3 points Correct: 3 * 10% 3.3 Task1 Wrong: 0 point Correct: 0 * 10% 0.0 Task1 Partially correct: 1.5 points Correct: 1.5 * 10% 1.65 Task1 Partially correct: 1.5 points Wrong: 0 1.5 6. Common problems causing fail submission on Vocareum/FAQ (If your program runs successfully on your local machine but fail on Vocareum, please check these) 1. Try your program on Vocareum terminal.

Remember to set python version as python3.6, And use the latest Spark 2. Check the input command line formats. 3. Check the output formats, for example, the headers, tags, typos. 4. Check the requirements of sorting the results. 5. Your program scripts should be named as task1.py task2.py etc. 6. Check whether your local environment fits the assignment description, i.e. version, configuration. 7. If you implement the core part in python instead of spark, or implement it with a high time complexity (e.g. search an element in a list instead of a set), your program may be killed on the Vocareum because it runs too slow.

8. You are required to only use Spark RDD in order to understand Spark operations more deeply. You will not get any points if you use Spark DataFrame or DataSet. Don’t import sparksql. 9. Do not use Vocareum for debugging purposes, please debug on your local machine. Vocareum can be very slow if you use it for debugging. 10. Vocareum is reliable in helping you to check the input and output formats, but its function on checking the code correctness is limited. It can not guarantee the correctness of the code even with a full score in the submission report.

7. Running Spark on Vocareum We’re going to use Spark 3.1.2 and Scala 2.12 for the assignments and the competition project. Here are the things that you need to do on Vocareum and local machine to run the latest Spark and Scala: On Vocareum: 1. Please select JDK 8 by running the command “export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64”

2. Please use the spark-submit command as “/opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit” On your local machine: 1. Please download and set up spark-3.1.2-bin-hadoop3.2, the setup steps should be the same as spark-2.4.4 2. If you use Scala, please update Scala’s version to 2.12 on IntelliJ. 8. Tutorials for Spark Installation Here are some useful links here to help you get started with the Spark installation.

Tutorial for ubuntu: https://phoenixnap.com/kb/install-spark-on-ubuntu Tutorial for windows: https://medium.com/@GalarnykMichael/install-spark-on-windows-pyspark-4498a5d8d66c Windows Installation without Anaconda (Recommended): https://phoenixnap.com/kb/install-spark-on-windows-10 Tutorial for mac: https://medium.com/beeranddiapers/installing-apache-spark-on-mac-os-ce416007d79f

Tutorial for Linux systems: https://www.tutorialspoint.com/apache_spark/apache_spark_installation.htm Tutorial for using IntelliJ: https://medium.com/@Sushil_Kumar/setting-up-spark-with-scala-development-environment-using-intel lij-idea-b22644f73ef1 Tutorial for Jupyter notebook on Windows: https://bigdata-madesimple.com/guide-to-install-spark-and-use-pyspark-from-jupyter-in-windows/

DSCI553 Assignment 2

1. Overview of the Assignment

In this assignment, you will implement the SON Algorithm using the Spark Framework. You will develop a program to find frequent itemsets in two datasets, one simulated dataset and one real-world generated dataset. The goal of this assignment is to apply the algorithms you have learned in class on large datasets more efficiently in a distributed environment.

2. Requirements

2.1 Programming Requirements a. You must use Python to implement all tasks. You can only use standard python libraries (i.e., external libraries like numpy or pandas are not allowed). There will be a 10% bonus for each task if you also submit a Scala implementation and both your Python and Scala implementations are correct. b. You are required to only use Spark RDD in order to understand Spark operations. You will not get any point if you use Spark DataFrame or DataSet.

2.2 Programming Environment Python 3.6, JDK 1.8, Scala 2.12, and Spark 3.1.2 We will use these library versions to compile and test your code. There will be no point if we cannot run your code on Vocareum. On Vocareum, you can call `spark-submit` located at `/opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit`. (Do not use the one at /usr/local/bin/spark-submit (2.3.0)). We use `–executor-memory 4G –driver-memory 4G` on Vocareum for grading.

2.3 Write your own code Do not share code with other students!! For this assignment to be an effective learning experience, you must write your own code! We emphasize this point because you will be able to find Python implementations of some of the required functions on the web. Please do not look for or at any such code! TAs will combine all the code we can find from the web (e.g., Github) as well as other students’ code from this and other (previous) sections for plagiarism detection. We will report all detected plagiarism.

2.4 What you need to turn in We will grade all submissions on Vocareum and the submissions on the blackboard will be ignored. Vocareum produces a submission report after you click the “Submit” button (It takes a while since Vocareum needs to run your code in order to generate the report). Vocareum will only grade Python scripts during the submission phase and it will grade both Python and Scala during the grading phase. a. Two Python scripts, named: (all lowercase) task1.py, task2.py b. [OPTIONAL] hw2.jar and two Scala scripts, named: (all lowercase) hw2.jar, task1.scala, task2.scala c. You don’t need to include your results or the datasets. We will grade on your code with our testing data (data will be in the same format).

3. Datasets

In this assignment, you will use one simulated dataset and one real-world dataset. In task 1, you will build and test your program with a small simulated CSV file that has been provided to you. Then in task2 you need to generate a subset using the Ta Feng dataset (https://bit.ly/2miWqFS) with a structure similar to the simulated data. Figure 1 shows the file structure of task1 simulated csv, the first column is user_id and the second column is business_id. Figure 1: Input Data Format

4. Tasks

In this assignment, you will implement SON Algorithm to solve all tasks (Task 1 and 2) on top of Spark Framework. You need to find all the possible combinations of the frequent itemsets in any given input file within the required time. You can refer to the Chapter 6 from the Mining of Massive Datasets book and concentrate on section 6.4 – Limited-Pass Algorithms. (Hint: you can choose either A-Priori, MultiHash, or PCY algorithm to process each chunk of the data)

4.1 Task 1: Simulated data (3 pts) There are two CSV files (small1.csv and small2.csv) in Vocareum under ‘/resource/asnlib/publicdata’. The small1.csv is just a test file that you can use to debug your code. For task1, we will only test your code on small2.csv. In this task, you need to build two kinds of market-basket models.

Case 1 (1.5 pts): You will calculate the combinations of frequent businesses (as singletons, pairs, triples, etc.) that are qualified as frequent given a support threshold. You need to create a basket for each user containing the business ids reviewed by this user. If a business was reviewed more than once by a reviewer, we consider this product was rated only once.

More specifically, the business ids within each basket are unique. The generated baskets are similar to: user1: [business11, business12, business13, …] user2: [business21, business22, business23, …] user3: [business31, business32, business33, …] Case 2 (1.5 pts): You will calculate the combinations of frequent users (as singletons, pairs, triples, etc.) that are qualified as frequent given a support threshold. You need to create a basket for each business containing the user ids that commented on this business. Similar to case 1, the user ids within each basket are unique.

The generated baskets are similar to: business1: [user11, user12, user13, …] business2: [user21, user22, user23, …] business3: [user31, user32, user33, …] Input format: 1. Case number: Integer that specifies the case. 1 for Case 1 and 2 for Case 2.

2. Support: Integer that defines the minimum count to qualify as a frequent itemset. 3. Input file path: This is the path to the input file including path, file name and extension. 4. Output file path: This is the path to the output file including path, file name and extension. Output format: 1. Runtime: the total execution time from loading the file till finishing writing the output file You need to print the runtime in the console with the “Duration” tag, e.g., “Duration: 100”.

2. Output file: (1) Intermediate result You should use “Candidates:” as the tag. For each line you should output the candidates of frequent itemsets you found after the first pass of SON Algorithm followed by an empty line after each combination. The printed itemsets must be sorted in lexicographical order (Both user_id and business_id are type of string). (2) Final result You should use “Frequent Itemsets:”as the tag.

For each line you should output the final frequent itemsets you found after finishing the SON Algorithm. The format is the same with the intermediate results. The printed itemsets must be sorted in lexicographical order. Here is an example of the output file: Both the intermediate results and final results should be saved in ONE output result file. Execution example: Python: spark-submit task1.py Scala: spark-submit –class task2 hw2.jar

4.2 Task 2: Ta Feng data (4 pts) In task 2, you will explore the Ta Feng dataset to find the frequent itemsets (only case 1). You will use data found here from Kaggle (https://bit.ly/2miWqFS) to find product IDs associated with a given customer ID each day. Aggregate all purchases a customer makes within a day into one basket. In other words, assume a customer purchases at once all items purchased within a day.

Note: Be careful when reading the csv file as spark can read the product id numbers with leading zeros. You can manually format Column F (PRODUCT_ID) to numbers (with zero decimal places) in the csv file before reading it using spark. SON Algorithm on Ta Feng data: You will create a data pipeline where the input is the raw Ta Feng data, and the output is the file described under “output file”. You will pre-process the data, and then from this pre-processed data, you will create the final output.

Your code is allowed to output this pre-processed data during execution, but you should NOT submit homework that includes this pre-processed data. (1) Data preprocessing You need to generate a dataset from the Ta Feng dataset with following steps: 1. Find the date of the purchase (column TRANSACTION_DT), such as December 1, 2000 (12/1/00) 2. At each date, select “CUSTOMER_ID” and “PRODUCT_ID”. 3. We want to consider all items bought by a consumer each day as a separate transaction (i.e., “baskets”).

For example, if consumer 1, 2, and 3 each bought oranges December 2, 2000, and consumer 2 also bought celery on December 3, 2000, we would consider that to be 4 separate transactions. An easy way to do this is to rename each CUSTOMER_ID as “DATE-CUSTOMER_ID”. For example, if CUSTOMER_ID is 12321, and this customer bought apples November 14, 2000, then their new ID is “11/14/00-12321” 4.

Make sure each line in the CSV file is “DATE-CUSTOMER_ID1, PRODUCT_ID1”. 5. The header of CSV file should be “DATE-CUSTOMER_ID, PRODUCT_ID” You need to save the dataset in CSV format. Figure below shows an example of the output file (please note DATE-CUSTOMER_ID and PRODUCT_ID are strings and integers, respectively) Figure: customer_product file Do NOT submit the output file of this data preprocessing step, but your code is allowed to create this file. (2) Apply SON Algorithm The requirements for task 2 are similar to task 1.

However, you will test your implementation with the large dataset you just generated. For this purpose, you need to report the total execution time. For this execution time, we take into account the time from reading the file till writing the results to the output file. You are asked to find the candidate and frequent itemsets (similar to the previous task) using the file you just generated.

The following are the steps you need to do: 1. Reading the customer_product CSV file in to RDD and then build the case 1 market-basket model 2. Find out qualified customers-date who purchased more than k items. (k is the filter threshold); 3. Apply the SON Algorithm code to the filtered market-basket model; Input format: 1. Filter threshold: Integer that is used to filter out qualified users 2. Support: Integer that defines the minimum count to qualify as a frequent itemset. 3. Input file path: This is the path to the input file including path, file name and extension. 4. Output file path: This is the path to the output file including path, file name and extension.

Output format: 1. Runtime: the total execution time from loading the file till finishing writing the output file You need to print the runtime in the console with the “Duration” tag, e.g., “Duration: 100”. 2. Output file The output file format is the same with task 1. Both the intermediate results and final results should be saved in ONE output result file.

Execution example: Python: spark-submit task2.py Scala: spark-submit –class task2 hw2.jar 6. Evaluation Metric Task 1: Input File Case Support Runtime (sec) small2.csv 1 4 <=200 small2.csv 2 9 <=100 Task 2: Input File Filter Threshold Support Runtime (sec) Customer_product.csv 20 50 <=500 5.

Grading Criteria (% penalty = % penalty of possible points you get) 1. You can use your free 5-day extension separately or together. 2. There will be a 10% bonus if you use both Scala and Python. 3. We will combine all the code we can find from the web (e.g., Github) as well as other students’ code from this and other (previous) sections for plagiarism detection. If a plagiarism is detected, there will be no point for the entire assignment and we will report all detected plagiarism. 4. All submissions will be graded on the Vocareum.

Please strictly follow the format provided, otherwise you can’t get the point even though the answer is correct. 5. If the outputs of your program are unsorted or partially sorted, there will be 50% penalty. 6. If you use Spark DataFrame, DataSet, sparksql, there will be a 20% penalty. 7. We can regrade your assignments within seven days once the scores are released. No argument after one week. There will be a 20% penalty if our grading is correct. 8. There will be a 20% penalty for late submission within a week and no point after a week. 9. Only when your results from Python are correct, the bonus of using Scala will be calculated.

There is no partial point for Scala. See the example below: Example situations Task Score for Python Score for Scala Total (10% of previous column if correct) Task1 Correct: 3 points Correct: 3 * 10% 3.3 Task1 Wrong: 0 point Correct: 0 * 10% 0.0 Task1 Partially correct: 1.5 points Correct: 1.5 * 10% 1.65 Task1 Partially correct: 1.5 points Wrong: 0 1.5 6. Common problems causing fail submission on Vocareum/FAQ (If your program runs seems successfully on your local machine but fail on Vocareum, please check these) 1. Try your program on Vocareum terminal.

Remember to set python version as python3.6, And use the latest Spark 2. Check the input command line format. 3. Check the output format, for example, the header, tag, typo. 4. Check the requirements of sorting the results. 5. Your program scripts should be named as task1.py task2.py. 6. Check whether your local environment fits the assignment description, i.e. version, configuration.

7. If you implement the core part in python instead of spark, or implement it in a high time complexity way (e.g. search an element in a list instead of a set), your program may be killed on the Vocareum because it runs too slow. 8. You are required to only use Spark RDD in order to understand Spark operations more deeply. You will not get any point if you use Spark DataFrame or DataSet. Don’t import sparksql.

DSCI553 Assignment 3

1. Overview of the Assignment

In Assignment 3, you will complete two tasks. The goal is to familiarize you with Locality Sensitive Hashing (LSH), and different types of collaborative-filtering recommendation systems. The dataset you are going to use is a subset from the Yelp dataset used in the previous assignments.

2. Assignment Requirements

2.1 Programming Language and Library Requirements a. You must use Python to implement all tasks. You can only use standard python libraries (i.e., external libraries like numpy or pandas are not allowed). There will be a 10% bonus for each task (or case) if you also submit a Scala implementation and both your Python and Scala implementations are correct. b. You are required to only use the Spark RDD to understand Spark operations. You will not receive any points if you use Spark DataFrame or DataSet.

2.2 Programming Environment Python 3.6, JDK 1.8, Scala 2.12, and Spark 3.1.2 We will use these library versions to compile and test your code. There will be no point if we cannot run your code on Vocareum. On Vocareum, you can call `spark-submit` located at /opt/spark/spark-3.1.2-binhadoop3.2/bin/spark-submit`. (*Do not use the one at `/home/local/spark/latest/bin/spark-submit (2.4.4))

2.3 Write your own code Do not share your code with other students!! We will combine all the code we can find from the Web (e.g., GitHub) as well as other students’ code from this and other (previous) sections for plagiarism detection. We will report all the detected plagiarism. 3. Yelp Data In this assignment, the datasets you are going to use are from: https://drive.google.com/drive/folders/1SufecRrgj1yWMOVdERmBBUnqz0EX7ARQ?usp=shar ing We generated the following two datasets from the original Yelp review dataset with some filters.

We randomly took 60% of the data as the training dataset, 20% of the data as the validation dataset, and 20% of the data as the testing dataset. a. yelp_train.csv: the training data, which only include the columns: user_id, business_id, and stars. b. yelp_val.csv: the validation data, which are in the same format as training data. c. We are not sharing the test dataset. d. other datasets: providing additional information (like the average star or location of a business)

4. Tasks

Note: This Assignment has been divided into 2 parts on Vocareum. This has been done to provide more computational resources.

4.1 Task1: Jaccard based LSH (2 points) In this task, you will implement the Locality Sensitive Hashing algorithm with Jaccard similarity using yelp_train.csv. In this task, we focus on the “0 or 1” ratingsrather than the actual ratings/stars from the users. Specifically, if a user has rated a business, the user’s contribution in the characteristic matrix is 1. If the user hasn’t rated the business, the contribution is 0. You need to identify similar businesses whose similarity >= 0.5.

You can define any collection of hash functions that you think would result in a consistent permutation of the row entries of the characteristic matrix. Some potential hash functions are: f(x)= (ax + b) % m or f(x) = ((ax + b) % p) % m where p is any prime number and m is the number of bins. Please carefully design your hash functions.

After you have defined all the hashing functions, you will build the signature matrix. Then you will divide the matrix into b bands with r rows each, where b x r = n (n is the number of hash functions). You should carefully select a good combination of b and r in your implementation (b>1 and r>1). Remember that two items are a candidate pair if their signatures are identical in at least one band.

Your final results will be the candidate pairs whose original Jaccard similarity is >= 0.5. You need to write the final results into a CSV file according to the output format below. Example of Jaccard Similarity: user1 user2 user3 user4 business1 0 1 1 1 business2 0 1 0 0 Jaccard Similarity (business1, business2) = #intersection / #union = 1/3 Input format: (we will use the following command to execute your code) Python: spark-submit task1.py Scala: spark-submit –class task1 hw3.jar Param: input_file_name: the name of the input file (yelp_train.csv), including the file path. Param: output_file_name: the name of the output CSV file, including the file path.

Output format: IMPORTANT: Please strictly follow the output format since your code will be graded automatically. We will not regrade because of formatting issues. a. The output file is a CSV file, containing all business pairs you have found. The header is “business_id_1, business_id_2, similarity”. Each pair itself must be in the alphabetical order.

The entire file also needs to be in the alphabetical order. There is no requirement for the number of decimals for the similarity value. Please refer to the format in Figure 2. Figure 2: a CSV output example for task1 Grading: We will compare your output file against the ground truth file using precision and recall metrics. Precision = true positives / (true positives + false positives) Recall = true positives / (true positives + false negatives) The ground truth file has been provided in the Google drive, named as “pure_jaccard_similarity.csv”.

You can use this file to compare your results to the ground truth as well. The ground truth dataset only contains the business pairs (from the yelp_train.csv) whose Jaccard similarity >=0.5. The business pair itself is sorted in the alphabetical order, so each pair only appears once in the file (i.e., if pair (a, b) is in the dataset, (b, a) will not be there). In order to get full credit for this task you should have precision >= 0.99 and recall >= 0.97.

If not, then you will get only partial credit based on the formula: (Precision / 0.99) * 0.4 + (Recall / 0.97) * 0.4 Your runtime should be less than 100 seconds. If your runtime is more than or equal to 100 seconds, you will not receive any point for this task.

4.2 Task2: Recommendation System (5 points) In task 2, you are going to build different types of recommendation systems using the yelp_train.csv to predict the ratings/stars for given user ids and business ids. You can make any improvement to your recommendation system in terms of speed and accuracy. You can use the validation dataset (yelp_val.csv) to evaluate the accuracy of your recommendation systems, but please don’t include it as your training data. There are two options to evaluate your recommendation systems.

You can compare your results to the corresponding ground truth and compute the absolute differences. You can divide the absolute differences into 5 levels and count the number for each level as following: >=0 and <1: 12345 >=1 and <2: 123 >=2 and <3: 1234 >=3 and <4: 1234 >=4: 12 This means that there are 12345 predictions with < 1 difference from the ground truth. This way you will be able to know the error distribution of your predictions and to improve the performance of your recommendation systems.

Additionally, you can compute the RMSE (Root Mean Squared Error) by using following formula: Where Predi is the prediction for business i and Ratei is the true rating for business i. n is the total number of the business you are predicting. In this task, you are required to implement: Case 1: Item-based CF recommendation system with Pearson similarity (2 points) Case 2: Model-based recommendation system (1 point) Case 3: Hybrid recommendation system (2 point) 4.2.1. Item-based CF recommendation system Please strictly follow the slides to implement an item-based recommendation system with Pearson similarity.

4.2.2. Model-based recommendation system You need to use XGBregressor(a regressor based on the decision tree) to train a model. You need to use this API https://xgboost.readthedocs.io/en/latest/python/python_api.html, the XGBRegressor inside the package xgboost.

Please choose your own features from the provided extra datasets and you can think about it with customer thinking. For example, the average stars rated by a user and the number of reviews most likely influence the prediction result. You need to select other features and train a model based on that. Use the validation dataset to validate your result and remember don’t include it into your training data.

4.2.3. Hybrid recommendation system. Now that you have the results from previous models, you will need to choose a way from the slides to combine them together and design a better hybrid recommendation system. Here are two examples of hybrid systems: Example1: You can combine them together as a weighted average, which means: ����� ����� = � × �����!”#$_&'(#) + (1 − �) × �����$+)#,_&'(#) The key idea is: the CF focuses on the neighbors of the item and the model-based RS focuses on the user and items themselves.

Specifically, if the item has a smaller number of neighbors, then the weight of the CF should be smaller. Meanwhile, if two restaurants both are 4 stars and while the first one has 10 reviews, the second one has 1000 reviews, the average star of the second one is more trustworthy, so the modelbased RS score should weigh more.

You may need to find other features to generate your own weight function to combine them together. Example2: You can combine them together as a classification problem: Again, the key idea is: the CF focuses on the neighbors of the item and the model-based RS focuses on the user and items themselves. As a result, in our dataset, some item-user pairs are more suitable for the CF while the others are not.

You need to choose some features to classify which model you should choose for each item-user pair. If you train a classifier, you are allowed to upload the pre-trained classifier model named “model.md” to save running time on Vocareum. You can use pickle library, joblib library or others if you want. Here is an example: https://scikit-learn.org/stable/modules/model_persistence.html. You also need to upload the training script named “train.py” to let us verify your model.

Some possible features (other features may also work): -Average stars of a user, average stars of business, the variance of history review of a user or a business. -Number of reviews of a user or a business. -Yelp account starting date, number of fans. -The number of people think a users’ review is useful/funny/cool.

Number of compliments (Be careful with these features. For example, sometimes when I visit a horrible restaurant, I will give full stars because I hope I am not the only one who wasted money and time here. Sometimes people are satirical. :-)) Input format: (we will use the following commands to execute your code) Case1: spark-submit task2_1.py Param: train_file_name: the name of the training file (e.g., yelp_train.csv), including the file path Param: test_file_name: the name of the testing file (e.g., yelp_val.csv), including the file path Param: output_file_name: the name of the prediction result file, including the file path Case2: spark-submit task2_2.py Param: folder_path: the path of dataset folder, which contains exactly the same file as the google drive.

Param: test_file_name: the name of the testing file (e.g., yelp_val.csv), including the file path Param: output_file_name: the name of the prediction result file, including the file path Case3: spark-submit task2_3.py Param: folder_path: the path of dataset folder, which contains exactly the same file as the google drive. Param: test_file_name: the name of the testing file (e.g., yelp_val.csv), including the file path Param: output_file_name: the name of the prediction result file, including the file path Output format: a. The output file is a CSV file, containing all the prediction results for each user and business pair in the validation/testing data.

The header is “user_id, business_id, prediction”. There is no requirement for the order in this task. There is no requirement for the number of decimals for the similarity values. Please refer to the format in Figure 3. Figure 3: Output example in CSV for task2 Grading: We will compare your prediction results against the ground truth. We will grade on all the cases in Task2 based on your accuracy using RMSE. For your reference, the table below shows the RMSE baselines and running time for predicting the validation data.

The time limit of case3 is set to 30 minutes because we hope you consider this factor and try to improve on it as much as possible (hint: this will help you a lot in the competition project at the end of the semester). Case 1 Case 2 Case 3 RMSE 1.09 1.00 0.99 Running Time 130s 400s 1800s For grading, we will use the testing data to evaluate your recommendation systems. If you can pass the RMSE baselines in the above table, you should be able to pass the RMSE baselines for the testing data. However, if your recommendation system only passes the RMSE baselines for the validation data, you will receive 50% of the points for each case.

5. Submission

You need to submit following files on Vocareum with exactly the same name: a. Four Python scripts: ● task1.py ● task2_1.py ● task2_2.py ● task2_3.py b. [OPTIONAL] hw3.jar and Four Scala scripts: ● task1.scala ● task2_1.scala ● task2_2.scala ● task2_3.scala 6. Grading Criteria (% penalty = % penalty of possible points you get) 1. You can use your free 5-day extension separately or together. (Google Forms Link for Extension: https://docs.google.com/forms/d/e/1FAIpQLSeSHzGWzPi3iuS-zNYyDLbhhP4ancMEZgKDiwYZLmhyYhKFw/viewform )

2. There will be a 10% bonus if you use both Scala and Python. 3. We will combine all the code we can find from the web (e.g., Github) as well as other students’ code from this and other (previous) sections for plagiarism detection. If plagiarism is detected, you will receive no points for the entire assignment and we will report all detected plagiarism.

4. All submissions will be graded on Vocareum. Please strictly follow the format provided, otherwise you won’t receive points even though the answer is correct. 5. If the outputs of your program are unsorted or partially sorted, there will be 50% penalty. 6. Do NOT you use Spark DataFrame, DataSet, sparksql. 7. We can regrade your assignments within seven days once the scores are released. We will not accept any regrading requests after a week.

There will be a 20% penalty if our grading is correct. 8. There will be a 20% penalty for late submissions within a week and no points after a week. 9. Only if your results from Python are correct will the bonus of using Scala be calculated. There is no partial points awarded for Scala. See the example below: Example situations Task Score for Python Score for Scala (10% of previous column if correct) Total Task 1 Correct: 3 points Correct: 3 * 10% 3.3 Task 1 Wrong: 0 point Correct: 0 * 10% 0.0

Task 1 Partially correct: 1.5 points Correct: 1.5 * 10% 1.65 Task 1 Partially correct: 1.5 points Wrong: 0 1.5 7. Common problems causing fail submission on Vocareum/FAQ (If your program runs seem successfully on your local machine but fail on Vocareum, please check these) 1. Try your program on Vocareum terminal. Remember to set python version as python3.6, And use the latest Spark /opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit 2. Check the input command line format. 3. Check the output format, for example, the header, tag, typos.

4. Check the requirements of sorting the results. 5. Your program scripts should be named as task1.py task2.py etc. 6. Check whether your local environment fits the assignment description, i.e. version, configuration. 7. If you implement the core part in Python instead of Spark, or implement it in a high time complexity way (e.g. search an element in a list instead of a set), your program may be killed on Vocareum because it runs too slowly. 8. You are required to only use Spark RDD in order to understand Spark operations more deeply. You will not get any points if you use Spark DataFrame or DataSet. Don’t import sparksql.

DSCI-553 Assignment 4

1. Overview of the Assignment

In this assignment, you will explore the spark GraphFrames library as well as implement your own Girvan-Newman algorithm using the Spark Framework to detect communities in graphs. You will use the ub_sample_data.csv dataset to find users who have a similar business taste. The goal of this assignment is to help you understand how to use the Girvan-Newman algorithm to detect communities in an efficient way within a distributed environment.

2. Requirements

2.1 Programming Requirements a. You must use Python and Spark to implement all tasks. There will be a 10% bonus for each task if you also submit a Scala implementation and both your Python and Scala implementations are correct. b. For task1, you can use the Spark DataFrame and GraphFrames library. For task2 you can ONLY use Spark RDD and standard Python or Scala libraries.

2.2 Programming Environment Python 3.6, JDK 1.8, Scala 2.12, and Spark 3.1.2 We will use these library versions to compile and test your code. There will be no point if we cannot run your code on Vocareum. On Vocareum, you can call `spark-submit` located at /opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit`. (*Do not use the one at `/home/local/spark/latest/bin/spark-submit (2.4.4))

2.3 Write your own code Do not share code with other students!! For this assignment to be an effective learning experience, you must write your own code! We emphasize this point because you will be able to find Python implementations of some of the required functions on the web. Please do not look for or at any such code! TAs will combine all the code we can find from the web (e.g., Github) as well as other students’ code from this and other (previous) sections for plagiarism detection. We will report all detected plagiarism.

2.4 What you need to turn in You need to submit the following files on Vocareum: a. [REQUIRED] two Python scripts, named: task1.py, task2.py b1. [OPTIONAL, REQUIRED FOR SCALA] two Scala scripts, named: task1.scala, task2.scala b2. [OPTIONAL, REQUIRED FOR SCALA] one jar package, named: hw4.jar c. [OPTIONAL] You can include other scripts called by your main program. d. You don’t need to include your results. We will grade your code with our testing data (data will be in the same format). 3. Datasets We have generated a sub-dataset, ub_sample_data.csv, from the Yelp review dataset containing user_id and business_id. You can find the data on Vocareum under resource/asnlib/publicdata/.

4. Tasks

4.1 Graph Construction To construct the social network graph, assume that each node is uniquely labeled, and that links are undirected and unweighted. Each node represents a user. There should be an edge between two nodes if the number of common businesses reviewed by two users is greater than or equivalent to the filter threshold. For example, suppose user1 reviewed set{business1, business2, business3} and user2 reviewed set{business2, business3, business4, business5}. If the threshold is 2, there will be an edge between user1 and user2. If the user node has no edge, we will not include that node in the graph. The filter threshold will be given as an input parameter when running your code.

4.2 Task1: Community Detection Based on GraphFrames (2 pts) In task1, you will explore the Spark GraphFrames library to detect communities in the network graph you constructed in 4.1. In the library, it provides the implementation of the Label Propagation Algorithm (LPA) which was proposed by Raghavan, Albert, and Kumara in 2007. It is an iterative community detection solution whereby information “flows” through the graph based on underlying edge structure. In this task, you do not need to implement the algorithm from scratch, you can call the method provided by the library.

The following websites may help you get started with the Spark GraphFrames: https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-python.html https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-scala.html 4.2.1 Execution Detail The version of the GraphFrames should be 0.6.0. (For your convenience, graphframes0.6.0 is already installed for python on Vocareum. The corresponding jar package can also be found under $ASNLIB/public folder. )

For Python (in local machine): ● [Approach 1] Run “python3.6 -m pip install graphframes” in the terminal to install the package. ● [Approach 2] In PyCharm, you add the sentence below into your code to use the jar package os.environ[“PYSPARK_SUBMIT_ARGS”] = “–packages graphframes:graphframes:0.8.2-spark3.1-s_2.12 pyspark-shell” ● In the terminal, you need to assign the parameter “packages” of the spark-submit: –packages graphframes:graphframes:0.8.2-spark3.1-s_2.12 For Scala (in local machine): ● In Intellij IDEA, you need to add library dependencies to your project “graphframes” % “graphframes” % “0.8.2-spark3.1-s_2.12” “org.apache.spark” %% “spark-graphx” % sparkVersion ● In the terminal, you need to assign the parameter “packages” of the spark-submit: –packages graphframes:graphframes:0.8.2-spark3.1-s_2.12 For the parameter “maxIter” of the LPA method, you should set it to 5.

4.2.2 Output Result In this task, you need to save your result of communities in a txt file. Each line represents one community and the format is: ‘user_id1’, ‘user_id2’, ‘user_id3’, ‘user_id4’, … Your result should be firstly sorted by the size of communities in the ascending order and then the first user_id in the community in lexicographical order (the user_id is type of string). The user_ids in each community should also be in the lexicographical order. If there is only one node in the community, we still regard it as a valid community. Figure 1: community output file format

4.3 Task2: Community Detection Based on Girvan-Newman algorithm (5 pts) In task2, you will implement your own Girvan-Newman algorithm to detect the communities in the network graph. You can refer to the Chapter 10 from the Mining of Massive Datasets book for the algorithm details. Because your task1 and task2 code will be executed separately, you need to construct the graph again in this task following the rules in section 4.1.

For task2, you can ONLY use Spark RDD and standard Python or Scala libraries. Remember to delete your code that imports graphframes. Usage of Spark DataFrame is NOT allowed in this task. 4.3.1 Betweenness Calculation (2 pts) In this part, you will calculate the betweenness of each edge in the original graph you constructed in 4.1. Then you need to save your result in a txt file. The format of each line is (‘user_id1’, ‘user_id2’), betweenness value Your result should be firstly sorted by the betweenness values in the descending order and then the first user_id in the tuple in lexicographical order (the user_id is type of string).

The two user_ids in each tuple should also be in lexicographical order. For output, you should use the python built-in round() function to round the betweenness value to five digits after the decimal point. (Rounding is for output only, please do not use the rounded numbers for further calculation) IMPORTANT: Please strictly follow the output format since your code will be graded automatically. We will not regrade because of formatting issues. Figure 2: betweenness output file format

4.3.2 Community Detection (3 pts) You are required to divide the graph into suitable communities, which reaches the global highest modularity. The formula of modularity is shown below: According to the Girvan-Newman algorithm, after removing one edge, you should re-compute the betweenness. The “m” in the formula represents the edge number of the original graph. The “A” in the formula is the adjacent matrix of the original graph. (Hint: In each remove step, “m”, “A”, “k_i” and “k_j” should not be changed).

In the step of removing the edges with the highest betweenness, if two or more edges have the same (highest) betweenness, you should remove all those edges. If the community only has one user node, we still regard it as a valid community. You need to save your result in a txt file. The format is the same with the output file from task1.

4.4 Execution Format Execution example: Python: spark-submit –packages graphframes:graphframes:0.8.2-spark3.1-s_2.12 task1.py spark-submit task2.py Scala: spark-submit –packages graphframes:graphframes:0.8.2-spark3.1-s_2.12 –-class task1 hw4.jar spark-submit –-class task2 hw4.jar Input parameters: 1. : the filter threshold to generate edges between user nodes. 2. : the path to the input file including path, file name and extension. 3. : the path to the betweenness output file including path, file name and extension. 4. : the path to the community output file including path, file name and extension.

Execution time: The overall runtime limit of your task1 (from reading the input file to finishing writing the community output file) is 400 seconds. The overall runtime limit of your task2 (from reading the input file to finishing writing the community output file) is 400 seconds. If your runtime exceeds the above limit, there will be no point for this task.

5. About Vocareum

a. Dataset is under the directory $ASNLIB/publicdata/, jar package is under $ASNLIB/public/ b. You should upload the required files under your workspace: work/, and click submit c. You should test your scripts on both the local machine and the Vocareum terminal before submission. d. During the submission period, the Vocareum will automatically test task1 and task2. e. During the grading period, the Vocareum will use another dataset that has the same format for testing. f. We do not test the Scala implementation during the submission period.

g. Vocareum will automatically run both Python and Scala implementations during the grading period. h. Please start your assignment early! You can resubmit any script on Vocareum. We will only grade on your last submission. 6. Grading Criteria (% penalty = % penalty of possible points you get) a. You can use your free 5-day extension separately or together (https://docs.google.com/forms/d/e/1FAIpQLSf6hpYzacaV2d1CJMZfrlE-xl9N6bLkJbhi7aFlAQcObGj0X w/viewform)

b. There will be 10% bonus for each task if your Scala implementations are correct. Only when your Python results are correct, the bonus of Scala will be calculated. There is no partial point for Scala. c. There will be no point if your submission cannot be executed on Vocareum. d. There will be a 20% penalty for late submission within one week and no point after that.

7. Common problems causing fail submission on Vocareum/FAQ (If your program runs seems successfully on your local machine but fail on Vocareum, please check these) 1. Try your program on Vocareum terminal. Remember to set python version as python3.6, And use the latest Spark /opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit 2. Check the input command line format. 3. Check the output format, for example, the header, tag, typo. 4. Check the requirements of sorting the results.

5. Your program scripts should be named as task1.py task2.py. 6. Check whether your local environment fits the assignment description, i.e. version, configuration. 7. If you implement the core part in python instead of spark, or implement it in a high time complexity way (e.g. search an element in a list instead of a set), your program may be killed on the Vocareum because it runs too slowly.

DSCI553 Assignment 5

1. Overview of the Assignment

In this assignment, you are going to implement three streaming algorithms. In the first two tasks, you will generate a simulated data stream with the Yelp dataset and implement Bloom Filtering and Flajolet-Martin algorithm. In the third task, you will do some analysis using Fixed Size Sample (Reservoir Sampling).

2. Requirements

2.1 Programming Requirements a. You must use Python and Spark to implement all tasks. There will be 10% bonus for each task if you also submit a Scala implementation and both your Python and Scala implementations are correct. b. You are not required to use Spark RDD in this assignment. c. You can only use standard Python libraries, which are already installed in the Vocareum.

2.2 Programming Environment Python 3.6, JDK 1.8, Scala 2.12, and Spark 3.1.2 We will use above library versions to compile and test your codes. You are required to make sure your codes work and run on Vocareum otherwise we won’t be able to grade your code.

2.3 Important things before starting the assignment: 1. If we cannot call myhashs(s) in task1 and task2 in your script to get the hash value list, there will be a 50% penalty. 2. We will simulate your bloom filter in the grading program simultaneously based on your myhashs(s) outputs. There will be no point if the reported output is largely different from our simulation.

3. Please use integer 553 as the random seed for task3, and follow the steps mentioned below to get a random number. If you use the wrong random seed, or discard any obtained random number, or the sequence of random numbers is different from our simulation, there will be a 50% penalty.

2.4 Write your own code Do not share code with other students!! For this assignment to be an effective learning experience, you must write your own code! We emphasize this point because you will be able to find Python implementations of some of the required functions on the web. Please do not look for or at any such code! TAs will combine all the codes we can find from the web (e.g., Github) as well as other students’ code from this and other (previous) sections for plagiarism detection. We will report all detected plagiarism.

3. Datasets

For this assignment, you need to download the users.txt as the input file. You also need a Python blackbox file to generate data from the input file. Both users.txt and blackbox.py can be found in the publicdata directory on Vocareum. We use the blackbox as a simulation of a data stream. The blackbox will return a list of user ids from file users.txt every time we call it. Although it is very unlikely that the user ids returned from the blackbox are not unique, you are required to handle it wherever required. Please call the blackbox function like the example in the following figure: If you need to ask the blackbox multiple times, you can do it by the following sample code:

4. Tasks

4.1 Task1: Bloom Filtering (2.5 pts) You will implement the Bloom Filtering algorithm to estimate whether the user_id in the data stream has shown before. The details of the Bloom Filtering Algorithm can be found at the streaming lecture slide. Please find proper hash functions and the number of hash functions in the Bloom Filtering algorithm.

In this task, you should keep a global filter bit array and the length is 69997. The hash functions used in a Bloom filter should be independent and uniformly distributed. Some possible the hash functions are: f(x)= (ax + b) % m or f(x) = ((ax + b) % p) % m where p is any prime number and m is the length of the filter bit array. You can use any combination for the parameters (a, b, p). The hash functions should keep the same once you created them.

As the user_id is a string, you need to convert the type of user_id to an integer and then apply hash functions to it. The following codes show one possible solution to converting user_id string to integer: import binascii int(binascii.hexlify(s.encode(‘utf8’)),16) (We only treat the exact same strings as the same users. You do not need to consider aliases.) Execution Details To calculate the false positive rate (FPR), you need to maintain a previous user set.

The size of a single data stream will be 100 (stream_size). And we will test your code for more than 30 times (num_of_asks), and your FPRs are only allowed to be larger than 0.5 at most once. The run time should be within 100s for 30 data streams.

Output Results You need to save your results in a CSV file with the header “Time,FPR”. Each line stores the index of the data batch (starting from 0) and the false positive rate for that batch of data. You do not need to round your answer. You also need to encapsulate your hash functions into a function called myhashs. The input of myhashs function is a user_id (string) and the output is a list of hash values. For example, if you have three hash functions, the size of the output list should be three and each element in the list corresponds to an output value of your hash function.

Figure below is a template of myhashs function: Our grading program will also import your python script, call myhashs function to test the performance of your hash functions, and track your implementation.

4.2 Task2: Flajolet-Martin algorithm (2.5 pts) In task2, you will implement the Flajolet-Martin algorithm (including the step of combining estimations from groups of hash functions) to estimate the number of unique users within a window in the data stream. The details of the Flajolet-Martin Algorithm can be found at the streaming lecture slide.

You need to find proper hash functions and the number of hash functions in the Flajolet-Martin algorithm. Execution Details For this task, the size of the stream will be 300 (stream_size). And we will test your code more than 30 times (num_of_asks). And for your final result, 0.2 <= (sum of all your estimations / sum of all ground truths) <= 5. The run time should be within 100s for 30 data streams. Output Results You need to save your results in a CSV file with the header “Time,Ground Truth,Estimation”.

Each line stores the index of the data batch (starting from 0), the actual number of unique users in the window period, and the estimation result from the Flajolet-Martin algorithm. You also need to encapsulate your hash functions into a function called myhashs(s). The input of myhashs function is a user_id (string) and the output is a list of hash values.

For example, if you have three hash functions, then the size of the output list should be three and each element in the list corresponds to an output value of your hash function. Figure below is a template of myhashs function: Our grading program will also import your python script, call myhashs function to test the performance of your hash functions, and track your implementation.

4.3 Task3: Fixed Size Sampling (2pts) The goal of task3 is to implement the fixed size sampling method (Reservoir Sampling Algorithm). In this task, we assume that the memory can only save 100 users, so we need to use the fixed size sampling method to only keep part of the users as a sample in the streaming. When the streaming of the users comes, for the first 100 users, you can directly save them in a list. After that, for the n th (n starts from 1) user in the whole sequence of users, you will keep the n th user with the probability of 100/n, otherwise discard it. If you keep the n th user, you need to randomly pick one in the list to be replaced.

You also need to keep a global variable representing the sequence number of the users. The submission report in Vocareum will show both python and scala results only for this task, since the outputs generated from python and scala scripts would be different. Execution Details For this task, the size of the stream will be 100 (stream_size). And we will test your code more than 30 times (num_of_asks) Be careful here: Please write your random.seed(553) in the main function. Please do not write random.seed(553) in other places.

The run time should be within 100s for 30 data streams. Output Results: Every time you receive 100 users, you should print the current stage of your reservoir into a CSV file.For example, after receiving the 100 th user from the streaming, your codes should calculate whether the reservoir will replace it with a user in the list or not, and then output the current stage of the reservoir according to the following format, and start a newline.

For each line, the first column is the sequence number (starting from 1) of the latest user in the entire streaming, then the 1 th user (with index 0 in your list), 21 th user, 41 th user, 61 th user and 81 th user in your reservoir. Figure below is an example: streaming printing information example Important Instructions for task3: We will compare the output of your codes to the ground truth, your output should be exactly the same as the ground truth output to get the full scores.

We will not be providing the ground truth output, but if you follow the following instructions correctly, you will be able to get the correct results easily. For python: 1. Use random.seed(553) in the main function 2. For probability of whether to accept a sample or discard, use random.random() which generates a floating point number between 0 and 1. If this randomly generated probability is less then s/n, we accept the sample 3. In case we decide to accept a sample, we need to find an index in the array for replacement.

For this purpose use random.randInt() with appropriate boundaries to generate an index into the array and use this for replacement of the sample. For scala: 1. The scala implementation is very similar to python, but since the random number generation works differently, the output generated will be different. 2. We will use the scala.util.random class for random number generation. Please only instantiate one instance of this class and use it everywhere in the task3 class. 3. Set the seed to 553 immediately after creating an object of the class.

4. Use random_object.nextFloat() to generate probability for accepting and discarding similar to python. 5. Use random_object.nextInt() with appropriate boundary parameters to generate an index for replacement. 6. Sample codes for BlackBox: Since, we cannot import the python blackbox.py in scala, please use the reference code provided above. You can add this code to your task3.scala file itself and make an object of the Blackbox class within your main task3 class.

You can then use this object to request a new array of stream of size “num” by calling the ask function like python implementation. Here is an example: var stream = box.ask(input_file_path, stream_size) Please do not use the “mod” operator to limit the index. Set the appropriate boundaries in the random number generation function itself. If you use the wrong random seed, or discard any obtained random number, or the sequence of random numbers is different from our simulation, there will be a 50% penalty.

4.4 Execution Format Task1: python task1.py stream_size num_of_asks /home/local/spark/latest/bin/spark-submit –class task1 –executor-memory 4G –driver-memory 4G hw5.jar $ stream_size num_of_asks Task2: python task2.py stream_size num_of_asks /home/local/spark/latest/bin/spark-submit –class task2 –executor-memory 4G –driver-memory 4G hw5.jar stream_size num_of_asks Task3: python task3.py stream_size num_of_asks /home/local/spark/latest/bin/spark-submit –class task3 –executor-memory 4G –driver-memory 4G hw5.jar stream_size num_of_asks 5. Submission You need to submit following files on Vocareum with exactly the same name: task1.py, [task1.scala] task2.py, [task2.scala] task3.py, [task3.scala] [hw5.jar]

6. Grading Criteria (% penalty = % penalty of possible points you get)

1. You can use your free 5-day extension separately or together but not after one week after the deadline. Please request the late days using the form. 2. There will be a 10% bonus for correct scala implementation provided the python implementation works correctly. 3. If we cannot run your programs with the command we specified, there will be no regrading 4. If we can’t call myhashs(s) in your script to get the hash value list, there will be a 50% penalty.

5. When your program is running, we will simulate your program in our grading program simultaneously based on your myhashs(s) outputs. There will be no point if the reported output is largely different from our simulation. 6. If you use the wrong random seed, or discard any obtained random number, or the sequence of random numbers is different from our simulation, there will be a 50% penalty. 7. There will be a 20% penalty for late submission within a week and no point after a week.

7. Common problems causing fail submission on Vocareum/FAQ (If your program runs successfully on your local machine but fails on Vocareum, please check these) 1. Try your program on Vocareum terminal. Remember to set python version as python3.6, export PYSPARK_PYTHON=python3.6 And use the latest Spark /opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit 2. Check the input command line format.

3. Check the output format, for example, the header, tag, typos. 4. Check the requirements of sorting the results. 5. Your program scripts should be named as task1.py task2.py etc. 6. Check whether your local environment fits the assignment description, i.e. version, configuration.

7. If you implement the core part in Python instead of Spark, or implement it in a high time complexity way (e.g. search an element in a list instead of a set), your program may be killed on Vocareum because it runs too slowly.

DSCI553 Assignment 6 Clustering

1. Overview of the Assignment

In Assignment 6, you will implement the Bradley-Fayyad-Reina (BFR) algorithm. The goal is to let you be familiar with the process of clustering in general and various distance measurements. The datasets you are going to use is synthetic dataset.

2. Assignment Requirements

2.1 Programming Language and Library Requirements You must use Python to implement the algorithm. You can only use the following external Python libraries: numpy and sklearn.

2.2 Programming Environment Python 3.6, JDK 1.8, Scala 2.12 and Spark 3.1.2 We will use these library versions to compile and test your code. You are required to make sure your code works and runs on Vocareum otherwise we won’t be able to grade your code.

2.3 Write your own code Do not share your code with other students!! We will combine all the code we can find from the Web (e.g., GitHub) as well as other students’ code from this and other (previous) sections for plagiarism detection. We will report all the detected plagiarism.

3. Dataset

Since the BFR algorithm has a strong assumption that the clusters are normally distributed with independent dimensions, we generated synthetic datasets by initializing some random centroids and creating some data points with the centroids and some standard deviations to form the clusters. We also added some other data points as outliers in the dataset to evaluate the algorithm.

Data points which are outliers belong to clusters that are named or indexed as “-1”. Figure 1 shows an example of a part of the dataset. The first column is the data point index. The second column is the name/index of the cluster that the data point belongs to. The rest columns represent the features/dimensions of the data point. Figure 1: An example of the dataset a. hw6_clustering.txt is the synthetic clustering dataset.

The dataset is available on Vocareum (public data folder). b. We generate the testing dataset using a similar method. Notice that the number of the dimensions could be different from the hw6_clustering.txt. We do not share the testing dataset.

4. Task

You will implement the Bradley-Fayyad-Reina (BFR) algorithm to cluster the data contained in hw6_clustering.txt. In BFR, there are three sets of points that you need to keep track of: Discard set (DS), Compression set (CS), Retained set (RS) For each cluster in the DS and CS, the cluster is summarized by: N: The number of points SUM: the sum of the coordinates of the points SUMSQ: the sum of squares of coordinates The conceptual steps of the BFR algorithm (Please refer to the slide for details):

Implementation details of the BFR algorithm: (just for your reference, the number of input clusters = n_cluster parameter given as input) Step 1. Load 20% of the data randomly.

Step 2. Run K-Means (e.g., from sklearn) with a large K (e.g., 5 times of the number of the input clusters) on the data in memory using the Euclidean distance as the similarity measurement.

Step 3. In the K-Means result from Step 2, move all the clusters that contain only one point to RS (outliers).

Step 4. Run K-Means again to cluster the rest of the data points with K = the number of input clusters.

Step 5. Use the K-Means result from Step 4 to generate the DS clusters (i.e., discard their points and generate statistics). The initialization of DS has finished, so far, you have K numbers of DS clusters (from Step 5) and some numbers of RS (from Step 3).

Step 6. Run K-Means on the points in the RS with a large K (e.g., 5 times of the number of the input clusters) to generate CS (clusters with more than one points) and RS (clusters with only one point).

Step 7. Load another 20% of the data randomly.

Step 8. For the new points, compare them to each of the DS using the Mahalanobis Distance and assign them to the nearest DS clusters if the distance is < 2√�.

Step 9. For the new points that are not assigned to DS clusters, using the Mahalanobis Distance and assign the points to the nearest CS clusters if the distance is < 2√�

Step 10. For the new points that are not assigned to a DS cluster or a CS cluster, assign them to RS.

Step 11. Run K-Means on the RS with a large K (e.g., 5 times of the number of the input clusters) to generate CS (clusters with more than one points) and RS (clusters with only one point). Step 12. Merge CS clusters that have a Mahalanobis Distance < 2√�. Repeat Steps 7 – 12. If this is the last run (after the last chunk of data), merge CS clusters with DS clusters that have a Mahalanobis Distance < 2√�. At each run, including the initialization step, you need to count and output the number of the discard points, the number of the clusters in the CS, the number of the compression points, and the number of the points in the retained set.

Input format: (we will use the following command to execute your code) python3 task.py Param: input_file: the name of the input file (e.g., hw6_clustering.txt), including the file path. Param: n_cluster: the number of the clusters. Param: output_file: the name of the output txt file, including the file path.

Output format: The output file is a text file, containing the following information (see Figure 2): a. The intermediate results (the line is named as “The intermediate results”). Then each line should be started with “Round {�}:” and � is the count for the round (including the initialization, i.e., initialization would be “Round 1:”. You need to output the numbers in the order of “the number of the discard points”, “the number of the clusters in the compression set”, “the number of the compression points”, and “the number of the points in the retained set”.

Leave one line in the middle before writing out the cluster results. b. The clustering results (the line is named as “The clustering results”), including the data points index and their clustering results after the BFR algorithm. The clustering results should be in [0, the number of clusters). The cluster of outliers should be represented as -1. Figure 2: the output example for text file Grading: a. We will compare the percentage of discard points after the last round to our threshold.

Dataset Percentage of discard points after last round hw6_ clustering.txt 98.5% b. We will compare the centroids of the clusters you find to those in the ground truth. c. We will compute the accuracy of your clustering results to the ground truth. We will check the percentage of data points that you issue to the correct cluster (except the outliers) using the following formula: �������� = �ℎ� ����� ������ �� ������ �� �ℎ� ������� �������� �ℎ� ����� ������ �� ������ �� �ℎ� ������ ����ℎ Dataset Accuracy hw6_ clustering.txt 98.0% d. No point if your runtime is more than or equal to 600 seconds.

Execution format: Task: python3 task.py /home/local/spark/latest/bin/spark-submit –class task –executor-memory 4G –driver-memory 4G hw6.jar 5. Submission You need to submit following files on Vocareum with exactly the same name: a. One Python script, named: (all lowercase) task.py b. [OPTIONAL] One Scala scripts, named: (all lowercase) 1. task.scala 2. hw6.jar c. You don’t need to include your results or the datasets. We will grade on your code with our testing data (data will be in the same format).

6. Grading Criteria (% penalty = % penalty of possible points you get) 1. You can use your free 5-day extension separately or together but not after one week after the deadline. [Google form link: https://forms.gle/azNRyKseppHecLYn9] 2. There will be a 10% bonus for correct Scala implementation provided the python implementation works correctly.

3. If we cannot run your programs with the command we specified, there will be no regrading. 4. If your program cannot run with the required Python versions, there will be a 20% penalty. 5. There will be a 20% penalty for late submissions within a week and no point after a week.