Description

5/5 - (1 vote)

Summary

In this homework, you’ll write code that indexes words from multiple text les, and outputs an inverted
index that looks like this:

Description

You will need a collection of text les to index, they are. After you unzip it, you’ll get two directories
with text les in them: ‘devdata’ (with 5 les), ‘fulldata’ (with 74 les).

The input data is already cleaned, that is all the \n\r characters are removed – but one or more \t chars
might still be present (which needs to be handled). There is punctuation, and you are required to handle
this in your code: replace all the occurrences of special (punctuation) characters and numerals with the
space character, and convert all the words to lowercase. A single ‘\t’ separates the key (docID) from the
value (actual document contents). In other words, the input les are in a key-value format where docID is
the key, and the contents are the value; the key and value are separated by a tab:
here

The above format is to help you build the inverted index easily – the lename (docID) is in the le’s text so
you can simply extract it.
As you know, Google invented MapReduce, to help them do this on a massive scale. For this HW, we could
have based it on GCP/Azure/AWS/… where you would upload the les to the cloud and launch MapReduce
jobs there. To get practice doing this, please do try it own your own, after this course.

But for this HW, there is a much simpler way! It’s this [contains code I got from elsewhere, and the custom
environment I set up to run it (ie. the minimal collection of .jar les needed, including the one for
MapReduce)]: start with – it is a self-contained,
complete, minimal MapReduce example that counts words in two input documents. Study it thoroughly –
see what les are used (code, data, cong), how the code is organized (a single class called WordCount),
how it is run (we specify input and output folders). Be sure to look at the three columns: les on the left, le
contents in the center, execution on the right.

Fork it,to do your homework – in other words, do it on your
own repl.it area (you need to sign up for a free account). FYI, repl.it is built on top of GCP:
[how cool!].

For the HW, you need to create a unigram index, and a bigram one. Details are in the two sections that
follow.

Unigram index

You will need to create a le called unigram_index.txt, containing words from les in fulldata.
Modify the mapper in the repl.it link above , to output(word, docID), instead of what it currently outputs
for word counting, which is (word, count); also, in your reducer.

Bigram index

You’ll create a le called selected_bigram_index.txt, containing the inverted index for just these ve
bigrams, using les in devdata:
computer science
information retrieval
power politics
los angeles
bruce willis

Modify your mapper, to output (word1 word2, docID) pairs, rather than the (word, docID) pairs you had in
the unigram task. There is no need to change your reducer.

Extra details
If your execution takes way too long, or crashes, you can simply make the data les smaller by deleting text,
starting from the end. Each le contains 500000+ words, you can make it as small as 50000; but do keep
the le count the same, ie. use all the les in devdata as well as fulldata.

You could ‘test’ your code using the devdata collection for unigrams as well, then do it ‘for real’ (in
‘production mode’) on the fulldata set of les. Or, for both unigrams and bigrams, you can use your own set
of les, eg. a.txt, b.txt… e.txt (5 les), each with a paragraph from

https://cloud.google.com/customers/repl-it
use a HashMap data structure
https://www.gutenberg.org/les/74/74-

[:)] – that way your code will run quite fast and output results, so you can make rapid alterations in a
shortened develop-run-debug cycle. To develop+test your code, you can even simply use the two sample
les I have in the repl.

If you like, you can read/do the ‘ofcial’ MapReduce tutorial
UPDATE, 3/23: is a collection of shorter data les [each le has fewer words] – you have the choice of
using these instead of the ‘big’ les. The lenames, and the le counts are the same as for the
(‘big’) set of les, the only diff is that there are fewer words in each. For the bigrams, you’ll
get different counts compared to the originals, but that shouldn’t/doesn’t matter – if you use these smaller
les, just make sure you’re creating a bigram index that shows les and counts for ‘los angeles’ etc. Many
les end with a chopped off word, eg. ‘sabb’, but that’s ok. Also, if you’re curious, here’s how I the
fulldata/* les 🙂

Rubrics
Your (max 10) points will come from fullling these:
4 points for the unigram index entries, contained in unigram_index.txt
4 points for the index entries to the words mentioned in the bigrams section above, contained in selected_bigram_index.txt

1 point for screenshots of the output folder (the output folder is what you specify as the second argument while running the job)
for the job for unigrams, and bigrams (two screenshots)
1 point for your code, for unigrams and bigrams (two source les)

Getting help
There is a hw3 ‘forum’ on Piazza, for you to post questions/answers. You can also meet w/ the TAs, CPs, or
me.
Hope you have fun doing the HW 🙂
0.txt
here.
Here
original/previous
shortened

Solved CSCI572 HW3: Inverted-index creation Summary In this homework, you’ll write code that indexes words

Download Details:

Description

Summary

Description

Unigram index

Bigram index

Solved CSCI572 HW3: Inverted-index creation Summary In this homework, you’ll write code that indexes words

Download Details:

Description

Summary

Description

Unigram index

Bigram index

Related products

CSCI572 Homework 3: Comparing Search Engine Ranking Algorithms solution

Solved CSCI572 Web Search Engine Comparison This exercise is about comparing the search results from Google

Hadoop Exercise to Create an Inverted Index solution