COMP9313 Project 1 C2LSH algorithm in Pyspark solution

$29.99

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (4 votes)

Task1: C2LSH (90 points)
In this question, you will implement the C2LSH algorithm in Pyspark. Specifically, you are required to write a method c2lsh() in the file submission.py that takes the following four arguments as input:

data_hashes: is a rdd where each element (i.e., key,value pairs) in this rdd corresponds to (id, data_hash). id is an integer and data_hash is a python list that contains integers (i.e., hash values of the data point).
query_hashes is a python list that contains integers (i.e., hash values of the query).
alpha_m is an integer which indicates the minimum number of collide hash values between data and query (i.e., ).
beta_n is an integer which indicates the minimum number of candidates to be returned (i.e., ).
Note:

You don’t need to implement hash functions and generate hashed data, we will provide the data hashes for you.
Please follow the description of the algorithm provided in the lecture notes, which is slightly different to the original C2LSH paper.
While one of the main purposes of this project is to use spark to solve the problems. Therefore, it is meaningless to circumvent pyspark and do it in other ways (e.g., collect the data and implement the algorithm without transformations etc.). Any such attempt will be considered as a invalid implementation, hence will be assigned ZERO score. Specifically, you are not allowed to use the following PySpark functions:
aggregate, treeAggregate,aggregateByKey
collect, collectAsMap
countByKey, countByValue
foreach
reduce, treeReduce
saveAs* (e.g. saveAsTextFile)
take* (e.g. take, takeOrdered)
top
fold
Return Format
The c2lsh() method returns a rdd which contains a sequence of candidate id’s.

Notice: The order of the elements in the list does not matter (e.g., we will collect the elements and evaluate them as a set).

Evaluation
Your implementation will be tested using 3 different test cases. We will be evaluating based on the following factors:

the correctness of implemented c2lsh(). The output will be compared with the result from the correct implementation. Any difference will be considered as incorrect.
the efficiency of your implmentation. We will calculate the running time of c2lsh() in each testcase (denoted as ).
For each testcase (worth 30 points), the following marking criteria will be used:

Case 1, 0 points: the returned rdd is incorrect, or
Case 2, 10 points: the returned rdd is correct, and ,
Case 3, 20 points: the returned rdd is correct, and ,
Case 4, 30 points: the returned rdd is correct, and .
Where depend on the testing environment and the test cases.

Task 2: Report (10 points)
You are also required to submit your project report, named: report.pdf. Specifically, in the report, you are at least expected to answer the following questions:

Implementation details of your c2lsh(). Explain how your major transform function works.
Show the evaluation result of your implementation using your own test cases.
What did you do to improve the efficiency of your implementation?
Bonus
In order to encourage the students to come up with efficient implementations, we allow bonus part of the project with a maximum of 20 points. Rules for the BONUS part are as under:

Prerequisites:
You must have obtained 90 points for the implementation part.
The total running time of your implementation for the three test cases is among the top-50 smallest running times of the class.
All the submissions, satisfying the above-mentioned conditions will be tested against a more challenging dataset. Top-20 most-efficient and correct implementations will be awarded the bonus scores.
We will rank the top-20 implementations in an increasing order w.r.t the running time. We will award 20 points to the most efficient implementation (i.e., the one with smallest running time), 19 points to the 2nd most efficient one, and so on. The implementation ranked 20-th on the list will get 1 bonus point.
How to execute your implementation (EXAMPLE)
from pyspark import SparkContext, SparkConf
from time import time
import pickle
import submission

def createSC():
conf = SparkConf()
conf.setMaster(“local[*]”)
conf.setAppName(“C2LSH”)
sc = SparkContext(conf = conf)
return sc

with open(“toy/toy_hashed_data”, “rb”) as file:
data = pickle.load(file)

with open(“toy/toy_hashed_query”, “rb”) as file:
query_hashes = pickle.load(file)

print(query_hashes)

alpha_m = 10
beta_n = 10

sc = createSC()
data_hashes = sc.parallelize([(index, x) for index, x in enumerate(data)])
start_time = time()
res = submission.c2lsh(data_hashes, query_hashes, alpha_m, beta_n).collect()
end_time = time()
sc.stop()

# print(‘running time:’, end_time – start_time)
print(‘Number of candidate: ‘, len(res))
print(‘set of candidate: ‘, set(res))
[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 20.0, 20.0, 20.0, 20.0, 20.0, 20.0, 20.0, 20.0, 20.0, 20.0, 20.0, 20.0, 20.0, 20.0, 20.0, 20.0, 20.0, 20.0, 20.0, 20.0, 20.0, 20.0]
Number of candidate: 10
set of candidate: {0, 70, 40, 10, 80, 50, 20, 90, 60, 30}
Project Submission and Feedback
For the project submission, you are required to submit the following files:

Your implementation in the python file submission.py.
The report report.pdf.
Detailed instruction about using give to submit the project files will be announced later via Piazza.