Description
Dataset: twitter full_text.txt
Questions:
1) Find hour of the day when highest number of tweets were generated by users on March 6, 2010
a = load ‘/user/pig/full_text.txt’ as (id: chararray, ts: chararray, location: chararray, lat: float, lon: float, tweet: chararray);
b = filter a by ts MATCHES ‘2010-03-06.*’;
c = foreach b generate GetHour(ToDate(ts)) as hourofday;
d = group c by hourofday;
e = foreach d generate group, COUNT(c) as cnt;
f = order e by cnt desc;
g = limit f 1;
dump g;
2) Find top 10 topics (#hashtags)
a = load ‘/user/pig/full_text.txt’ as (id: chararray, ts: chararray, location: chararray, lat: float, lon: float, tweet: chararray);
i = foreach a generate FLATTEN(TOKENIZE(LOWER(tweet))) as token;
j = filter i by STARTSWITH(token, ‘#’);
k = group j by token;
l = foreach k generate group, COUNT(j) as cnt2;
m = order l by cnt2 desc;
n = limit m 10;
dump n;
3) Find top 10 mentions (@xxxxxxx)
a = load ‘/user/pig/full_text.txt’ as (id: chararray, ts: chararray, location: chararray, lat: float, lon: float, tweet: chararray);
o = foreach a generate FLATTEN(TOKENIZE(LOWER(tweet))) as token2;
p = filter o by token2 MATCHES ‘@user_\\w{8}[^:]*’;
q = group p by token2;
r = foreach q generate group, COUNT(p) as cnt3;
s = order r by cnt3 desc;
t = limit s 10;
dump t;
Submission:
Pig Latin scripts uploaded in pdf or text file
Output of each query