Description

5/5 - (1 vote)

DSCI 551 – Homework #1: Firebase, JSON, and Data Modeling

Consider managing customer churn data in Firebase realtime database. The data are stored in a CSV file, with 7044 rows and 21 columns. You can find the details about the data set at Kaggle web site: https://www.kaggle.com/blastchar/telco-customer-churn. You can also download the data set from the web site (archive.zip containing WA_Fn-UseC_-Telco-Customer-Churn.csv). To reduce the amount of data to be handled, this homework will only consider customers who are senior citizens (1142 of them). Tasks: 1. [40 points] Write a python script “load.py” which load the rows for the above senior customers to your database. Execution format: python3 load.py You can assume the WA_Fn-UseC_-Telco-Customer-Churn.csv file is stored at the same directory where you execute your script. 2. [30 points] Write a Python script “churn.py” to find the first k (senior) customers who has churned. Only need to return IDs of first k customers (ordered by their IDs). Execution format: python3 churn.py For example: python3 churn.py 10 will return IDs of first 10 customers who have churned. 3. [30 points] Write a Python script “tenure.py” to find out how many customers who have used the service for at least k months. Execution format: python3 tenure.py For example, python3 tenure.py 10 Requirements: ● For each query in both patterns, only one round trip (send request and receive response) is permitted to the Firebase server. ● You should not download entire database to answer the query. ● You should create indexes in Firebase console that allow the above programs to execute without errors. Permitted libraries: pandas, requests, json, and other common Python libraries (e.g., sys). Do not use firebase-admin, firebase python libraries. Submissions: ● Above 3 scripts. ● Prepend your full name to the script name, e.g., John_Smith_load.py, so on. ● A document (word/pdf) explaining why your program sends only one request to Firebase for each query. ● A JSON dump of your Firebase database for this app. ● A screenshot of your Firebase, showing the structure of your database. ● Submit online. See syllabus for late penalty! Checklist for Submission : 1. Name your folder and zip LASTNAME_FIRSTNAME_HWX. Your submission should be a zip (not rar) file AND unzipping it would have all your files (no folders). Do not include csv files. Notice that your submission should have capital first and last names. Example: TANEJA_DAKSH_HW1. 2. DO NOT return anything we didn’t ask for. For example, “please enter XXX: ___”. Please no. We have given you the EXACT output format. Please just follow them. 3. Use ONLY relative path. You should assume your scripts will be run in the directory where the scripts are at. For example, no ‘C:\\homework1\…’ or ‘/Users/blabla/….’ 4. Make sure that you are able to run the code according to the execution format mentioned above in the questions. 5. Double-check your files before submitting them. Please use python3 to complete the homework and try to maintain the python version as 3.7. Do not use any libraries other than the ones specified in the handout. You can use EC2 to test your code, and python 3.7 is preinstalled on EC2. 6. You can submit it multiple times on DEN but only the latest attempt will be graded.

DSCI 551 – Homework #2: Exploring HDFS Metadata Using XML & XPath

In this homework, we will explore the metadata stored in the namenode of HDFS. You can obtain such metadata by using the Offline Image Viewer (oiv) tool provided by Hadoop (https:// hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html). For example, /bin/hdfs oiv -i /tmp/hadoop-ec2-user/dfs/name/current/ fsimage_0000000000000000564 -o fsimage564.xml -p XML will export the metadata stored in the specified fsimage (file system image) to an XML file called fsimage546.xml. Fsimage has a INodeSection listing metadata about each inode and a INodeDirectorySection describing the directory structure, as show above. Note that id of inode is its inumber; and the directory nodes are represented by their inumbers, e.g., 16385. DSCI 551 – Spring 2022 Your task is to implement a Python program stats.py that takes a fsimage file in XML and outputs an JSON file that contains the following statistics about the file system. {“number of files”: 5, “number of directories”: 10, “maximum depth of directory tree”: 4 “file size”: {“max”: 3518, “min”: 16}} Note the maximum depth of directory tree is the number of levels of the tree, e.g., the maximum depth of the following directory tree is 4. Note if the file system does not contain any files, then you should not output the statistics about the “file size”. Permitted libraries: lxml. Execution format: python3 stats.py e.g., python3 stats.py fsimage564.xml stats.json Submission: submit stats.py DSCI 551 – Spring 2022 Checklist for Submission : 1. DO NOT return anything we didn’t ask for. For example, “please enter XXX: ___”. Please no. We have given you the EXACT output format. Please just follow them. 2. Use ONLY relative path. You should assume your scripts will be run in the directory where the scripts are at. For example, no ‘C:\\homework1\…’ or ‘/Users/ blabla/….’ 3. Make sure that you are able to run the code according to the execution format mentioned above in the questions. 4. Double-check your files before submitting them. Please use python3 to complete the homework and try to maintain the python version as 3.7. Do not use any libraries other than the ones specified in the handout. You can use EC2 to test your code, and python 3.7 is preinstalled on EC2. 5. You can submit it multiple times on DEN but only the latest attempt will be graded. 6. Please only submit the stats.py file, and do not include your output file in the submission.

DSCI 551 – HW3

You will need to install MySQL Sakila database for this homework. You can either install the database as
described in hIps://dev.mysql.com/doc/sakila/en/; or you may follow these steps to install it on EC2.
● Download package:
○ wget hIps://downloads.mysql.com/docs/sakila-db.tar.gz
● Unzip it:
○ tar xvf sakila-db.tar.gz
● Install:
○ cd sakila-db
○ mysql -u root -p
■ source sakila-schema.sql
■ source sakila-data.sql
■ use sakila
Note that two source commands above need to be executed aZer you log into your MySQL as root.
Note that you can also download MySQL server and WorkBench from MySQL website and install them
on your laptop/PC and use them for your homework/project.
1. [70 points] Write an SQL query for each of the following quesaons.
1) Find out how many films are rated ‘PG-13’ and last between 100 and 200 minutes.
2) Find first and last names of actors whose 2nd to the last leIer of last name is ‘i’.
3) Find the atle and length of the longest films.
4) Find out how many films there are in each category. Output category name and the number of
films in the category.
5) Find ids of customers who have rented films at least 40 ames. Return the same ids only once.
6) Find first and last names of customers whose total payment exceeds $200.
7) Find first and last names of actors who have never played in films rated R.
8) Find out how many films are not available in the inventory.
9) Find out how many actors who have the same first name but a different last name with another
actor.
10) Show the first name, last name, and city of the customers whose first name is either Jamie,
Jessie, or Leslie. Order the result by first name.
Submission: a text document named sql_queries.txt that contains both the queries and their results
(copy and paste your output from mysql terminal).
2. [30 points] Write a Python script search.py that searches for customers using their first name (case
insensiave). It should return first name, last name, and city of found customers.
For example,
python3 search.py ‘john’
will find customers whose first name is john.
Libraries permiIeds: pandas, sqlalchemy, pymysql, mysql-connector-python
*Note:
1. You should already have the database sakila in mysql at this point.
2. In order to use the package mysql.connector, you’ll need to set up the mysql.connector.connect() in
your code (Please refer to lecture slides for examples). If you haven’t created the user ‘dsci551’, please
do so by following the posted slides on how to setup MySQL on EC2. AZer set up, run the command
below:
GRANT ALL PRIVILEGES ON sakila.* TO ‘dsci551’@’localhost’;
Submission: search.py
Checklist for Submission :
1. DO NOT return anything we didn’t ask for. For example, “please enter XXX: ___”. Please no. We have
given you the EXACT output format. Please just follow them.
2. Make sure that you are able to run the code according to the execuaon format menaoned above in
the quesaons.
3. Double-check your files before submiwng them. Please use python3 to complete the homework and
try to maintain the python version as 3.7. Do not use any libraries other than the ones specified in the
handout. You can use EC2 to test your code, and python 3.7 is preinstalled on EC2.
4. You can submit it mulaple ames on D2L but only the latest aIempt will be graded.
5. You should submit TWO files to D2L this ame: A text document sql_queries.txt that contains your
answer to Q1, and the search.py for Q2.

DSCI 551 – HW4 (Indexing and Query Execution)

1. [40 points] Consider the following B+tree for the search key “age. Suppose the degree d of the
tree = 2, that is, each node (except for root) must have at least two keys and at most 4 keys.
Note that sibling nodes are nodes with the same parent.
a. [10 points] Describe the process of finding keys for the query condition “age >= 10 and
age <= 50”. How many blocks I/O’s are needed for the process?
b. [15 points] Draw the B+-tree after inserting 31 and 32 into the tree. Only need to show
the final tree after the insertions.
c. [15 points] Draw the tree after deleting 18 from the original tree.
2. [60 points] Consider natural-joining tables R(a, b) and S(a,c). Suppose we have the following
scenario.
i. R is a clustered relation with 1000 blocks.
ii. S is a clustered relation with 500 blocks.
iii. 102 pages available in main memory for the join.
iv. Assume the output of join is given to the next operator in the query execution plan
(instead of writing to the disk) and thus the cost of writing the output is ignored.
Describe the steps (including input, output, and their sizes at each step, e.g., sizes of runs or
buckets) for each of the following join algorithms. What is the total number of block I/O’s
needed for each algorithm? Which algorithm is most efficient?
a. [10 points] (Block-based) nested-loop join with R as the outer relation.
b. [10 points] (Block-based) nested-loop join with S as the outer relation.
c. [20 points] Sort-merge join (assume only 100 pages are used for sorting and 101 pages for
merging). Note that if join can not be done by using only a single merging pass, runs from
one or both relations need to be further merged, in order to reduce the number of runs.
Select the relation with a larger number of runs for further merging first if both have too
many runs.
d. [20 points] Partitioned-hash join (assume 101 pages used in partitioning of relations and no
hash table is used to lookup in joining tuples).

DSCI 551 – HW5 (Hadoop MapReduce & Spark)

In this homework, we will consider the churn data set again (as in hw1). You are given two versions of
the file: churn4hadoop.csv and churn.csv. The former has not header, to be used for Hadoop quesQon
below; the laSer has header used in Spark.
1. [Hadoop MapReduce, 40 points] Complete the provided Churn.java by supplying the missing code as
indicated in the source file, so that it answers the following SQL query.
Select InternetService, max(tenure)
From Churn
Where churn = “Yes”
Group by InternetService
Having count(*) > 200;
ExecuQon format: hadoop jar churn.jar Churn input output
Where the input directory contains a single file: churn4hadoop.csv.
2. [40 points] For each of the following SQL queries, write a Spark script that finds the answer to the
query. Note to read a csv file with header into Spark as a dataframe, proceed as follows:
churn = spark.read.csv(‘churn.csv’, header=True)
You will also need to import this:
import pyspark.sql.funcQons as fc
a) select count(*)
from churn
where gender = ‘Male’ and churn = ‘Yes’;
b) select gender, max(TotalCharges)
from churn
where churn = “Yes”
group by gender;
Note: you will need to change the data type of TotalCharges from string to double. For example,
churn = churn.withColumn(‘TotalCharges’, fc.col(‘TotalCharges’).cast(‘double’))
c) select gender, count(*)
from churn
where churn = ‘Yes’
group by gender;
d) select churn, contract, count(*) cnt
from churn
group by churn, contract
order by churn, cnt desc;
(churn is ascending)
e) select gender, churn, count(*)
from churn
group by gender, churn
having count(*) > 1000;
3. [20 points] Write a Spark RDD script for each of the following SQL queries.
a. Same as q2.a.
b. Same as q2.b.
Submission:
• Q1: Churn.java and churn.jar and part-r-00000 under the output directory.
• Q2: submit a text file q2-soluQon.txt with your scripts and outputs from each script.
• Q3: submit a text file q3-soluQon.txt with your scripts and outputs from each script.

DSCI 551 – Homeworks 1 to 5 solutions

Download Details:

Description

DSCI 551 – Homework #1: Firebase, JSON, and Data Modeling

DSCI 551 – Homework #2: Exploring HDFS Metadata Using XML & XPath

DSCI 551 – HW3

DSCI 551 – HW4 (Indexing and Query Execution)

DSCI 551 – HW5 (Hadoop MapReduce & Spark)

DSCI 551 – Homeworks 1 to 5 solutions

Download Details:

Description

DSCI 551 – Homework #1: Firebase, JSON, and Data Modeling

DSCI 551 – Homework #2: Exploring HDFS Metadata Using XML & XPath

DSCI 551 – HW3

DSCI 551 – HW4 (Indexing and Query Execution)

DSCI 551 – HW5 (Hadoop MapReduce & Spark)

Related products

DSCI 551 – HW5 (Hadoop MapReduce & Spark) solution

DSCI 551 – HW4 (Indexing and Query Execution) solution

DSCI 551 – HW3 solution