CS 6140 Asmt 2: Document Similarity and Hashing solution

$35.00

Original Work ?
Category: Tags: , , , , , , , You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (3 votes)

Overview

In this assignment you will explore the use of k-grams, Jaccard distance, min hashing, and LSH in the
context of document similarity.

You will use four text documents for this assignment:

• https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A2/D1.txt
• https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A2/D2.txt
• https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A2/D3.txt
• https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A2/D4.txt
As usual, it is highly recommended that you use LaTeX for this assignment. If you do not, you may
lose points if your assignment is difficult to read or hard to follow. Find a sample form in this directory:
https://www.cs.utah.edu/˜jeffp/teaching/latex/

1 Creating k-Grams (40 points)

You will construct several types of k-grams for all documents. All documents only have at most 27 characters: all lower case letters and space.
[G1] Construct 2-grams based on characters, for all documents.
[G2] Construct 3-grams based on characters, for all documents.
[G3] Construct 3-grams based on words, for all documents.
Remember, that you should only store each k-gram once, duplicates are ignored.

A: (20 points) How many distinct k-grams are there for each document with each type of k-gram? You
should report 4 × 3 = 12 different numbers.

B: (20 points) Compute the Jaccard similarity between all pairs of documents for each type of k-gram.
You should report 3 × 6 = 18 different numbers.

2 Min Hashing (30 points)

We will consider a hash family H so that any hash function h ∈ H maps from h : {k-grams} → [m] for m
large enough (I suggest over m ≥ 10,000).

A: (25 points) Using grams G2, build a min-hash signature for document D1 and D2 using t = {10, 50, 100, 250, 500}
hash functions. For each value of t report the approximate Jaccard similarity between the pair of documents
D1 and D2, estimating the Jaccard similarity:
JSˆ
t(a, b) = 1
t
X
t
i=1 (
1 if ai = bi
0 if ai 6= bi
.
You should report 5 numbers.

 

B: (5 point) What seems to be a good value for t? You may run more experiments. Justify your answer in
terms of both accuracy and time.

3 LSH (30 points)

Consider computing an LSH using t = 100 hash functions. We want to find all documents which have
Jaccard similarity above τ = .4.

A: (8 points) Use the trick mentioned in class and the notes to estimate the best values of hash functions
b within each of r bands to provide the S-curve f(s) = 1 − (1 − s
r
)
b
f(s) = 1 − (1 − s
b
)
r
with good separation at τ . Report these values.

The values of r and b we mixed up before. You can report either, but please be clear which means the #
of bands (in blue = r) and the # number of hashes per band (in blue = b). This is now consistent with the
notes, but reverse of the book.

B: (24 points) Using your choice of r and b and f(·), what is the probability of each pair of the four
documents (using [G2]) for being estimated to having similarity greater that τ ? Report 6 numbers. (Show
your work.)

4 Bonus (3 points)

Describe a scheme like Min-Hashing for the Andberg Similarity, defined Andb(A, B) = |A∩B|
|A∪B|+|A4B|
. So
given two sets A and B and family of hash functions, then Prh∈H[h(A) = h(B)] = Andb(A, B). Note the
only randomness is in the choice of hash function h from the set H, and h ∈ H represents the process of
choosing a hash function (randomly) from H. The point of this question is to design this process, and show
that it has the required property.
Or show that such a process cannot be done.