CIND719 Assignment #2 Pig Programming solution

$24.99

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (6 votes)

Dataset: twitter full_text.txt

Questions:

1) Find hour of the day when highest number of tweets were generated by users on March 6, 2010

a = load ‘/user/pig/full_text.txt’ as (id: chararray, ts: chararray, location: chararray, lat: float, lon: float, tweet: chararray);
b = filter a by ts MATCHES ‘2010-03-06.*’;
c = foreach b generate GetHour(ToDate(ts)) as hourofday;
d = group c by hourofday;
e = foreach d generate group, COUNT(c) as cnt;
f = order e by cnt desc;
g = limit f 1;
dump g;

2) Find top 10 topics (#hashtags)

a = load ‘/user/pig/full_text.txt’ as (id: chararray, ts: chararray, location: chararray, lat: float, lon: float, tweet: chararray);
i = foreach a generate FLATTEN(TOKENIZE(LOWER(tweet))) as token;
j = filter i by STARTSWITH(token, ‘#’);
k = group j by token;
l = foreach k generate group, COUNT(j) as cnt2;
m = order l by cnt2 desc;
n = limit m 10;
dump n;

3) Find top 10 mentions (@xxxxxxx)

a = load ‘/user/pig/full_text.txt’ as (id: chararray, ts: chararray, location: chararray, lat: float, lon: float, tweet: chararray);
o = foreach a generate FLATTEN(TOKENIZE(LOWER(tweet))) as token2;
p = filter o by token2 MATCHES ‘@user_\\w{8}[^:]*’;
q = group p by token2;
r = foreach q generate group, COUNT(p) as cnt3;
s = order r by cnt3 desc;
t = limit s 10;
dump t;

Submission:

Pig Latin scripts uploaded in pdf or text file
Output of each query