Assignment #4 CSCI 5408 (Data Management, Warehousing, Analytics) solution




5/5 - (5 votes)

Problem #1: This problem contains two tasks.
Task 1: Cluster Setup – Apache Spark Framework on GCP

Using your GCP cloud account, configure and initialize Apache Spark cluster. This cluster will be
used for Problem #2.
(Follow the tutorials provided in Lab session).
Create a flowchart or write ½ page explanation on how you completed the task, include this part in
your PDF file.
Task 2: Data Extraction and Preprocessing Engine: Sources – Twitter messages

Steps for Twitter Operation

Step 1: Create a Twitter developer account
Step 2: Explore documentation of the Twitter search and streaming APIs and required data format.
In your own words, write ½ page summary about your findings.
Winter 2022
Step 3: Write a well-formed program using Java to extract data (Extraction Engine) from Twitter.
Execute/Run the program on local machine. You can use Search API or Streaming API or both.
(Do not use any online program codes. You can only use API specification codes given within
official Twitter documentation)
o The search keywords are “mask”, “cold”, “immune”, “vaccine”, “flu”, “snow”.
Step 4: You need to include a flowchart/algorithm of your tweet extraction program in your
problem#1 PDF file.
Step 5: You need to extract the tweets and metadata related to the given keywords.
o For some keywords, you may get less number of tweets, which is not a problem.
Collectively, you should get approximately 3000 to 5000 tweets.
Step 6: If you get less data, run your method/program using a scheduler module to extract more
data points from Twitter at different time intervals. Note: Working on small datasets will not use
huge cloud resource or your local cluster memory.
Step 7: You should extract tweets, and retweets along with provided meta data, such as location,
time etc.
Step 8: The captured raw data should be kept (programmatically) in files. Each file should not
contain more than 100 tweets. These files will be needed for Problem #2
Step 9: Your program (Filtration Engine) should automatically clean and transform the data stored
in the files, and then upload each record to new MongodB database myMongoTweet
o For cleaning and transformation -Remove special characters, URLs, emoticons etc.
o Write your own regular expression logic. You cannot use libraries such as, jsoup, JTidy

Step 10: You need to include a flowchart/algorithm of your tweet cleaning/transformation program
on the PDF file.

Problem #2: This problem contains two tasks.
Task 1: Data Processing using Spark – MapReduce to perform count
Step 1: Write a MapReduce program (WordCounter Engine) to count (frequency count) the
following substrings or words. Your MapReduce should perform the frequency count on the stored
raw tweets files
o “flu”, “snow”, “cold”
o You need to include a flowchart/algorithm of your MapReduce program on the PDF file.
Step 2: In your PDF file, report the words that have highest and lowest frequencies.
Task 2: Data Visualization using Graph Database – Neo4j for graph generation
Step 3: Explore Neo4j graph database, understand the concept, and learn cypher query language
Step 4: Using Cypher, create graph nodes with name: “flu”, “snow”, “cold”
You should add properties to the nodes. For adding properties, you should check the
relevant tweets Collections.
Winter 2022
o Check if there are any relationships between the nodes.
o If there are relationships between nodes, then find the direction
o Include your Cypher and generated graph in the PDF file.

Assignment 4 Submission Format:
1) Compress all your reports/files into a single .zip file and give it a meaningful name.
2) Submit your reports only in PDF format.
Please avoid submitting .doc/.docx and submit only the PDF version. You can merge all the reports into
a single PDF or keep them separate. You should also include output (if any) and test cases (if any) in the
PDF file.
3) Your Java code needs to be submitted on gitlab