Description
DSCI551 Homework 1 (Firebase)
Consider the air quality index provided to you as a zip file named aqi.zip which contains aqi.csv. The file
has four columns: date, country, status, aqi value.
1. Write a Python script stat.py that computes the average AQI value of each country from July
2022 to January 2023 (note some months might not have data for every day of the month).
Output the results in a JSON file in the following format:
[{“Country”: “Albania”, “Year”: 2022, “Month”: 7, “Avg AQI”: 32.9}, …]
Execution format: python3 stat.py aqi.csv aqi.json
Where aqi.json is the output JSON file.
2. Write a Python script load.py that load the aqi.json produced in step 1 into your Firebase
database.
Execution format: python3 load.py aqi.json <database-url>/aqi.json
Where <database-url> is the base URL of your Firebase database, e.g.,
https://dsci551-default-rtdb.firebaseio.com/.
3. Write a Python script search.py that outputs countries’ names along with the corresponding
month and year if the average AQI falls in the given range.
Execution format: python3 search.py <database-url>/aqi.json 20 30
This will find all countries (with months and years) whose average AQI is between 20 and 30,
inclusive. You can expect the second argument would always be smaller than or equal to the
third argument. You can also expect the .indexOn has been set properly.
Allowed python packages: sys, pandas, requests, json
Hints: when grading, we only run the above execution codes (e.g. ‘python3 stat.py aqi.csv aqi.json’) in
our terminal and check whether your code gives the desired output.
Submission:
1. 3 .py files, no .ipynb files
2. Do not zip your files
3. Do not submit any other files except the 3 .py files
4. If your code does not run, you will get 0 points
5. Submit your work by the deadline. No exceptions will be made!
DSCI551 Homework 2 (HDFS)
In this homework, you are asked to emulate the file system structure of HDFS using Firebase and allow the export of its structure in the XML format. The emulated HDFS (or EDFS) should support ls, create, mkdir, rmdir, rm, and export commands as detailed below.
Your task is to write a Python script edfs.py to support the above commands. Formats: ● python edfs.py -ls
DSCI551 Homework 3 (SQL and XML)
In this homework, we will take a fsimage file and store its inode and inode directory information (as
illustrated below) in MySQL and write SQL queries to retrieve or analyze the content.
Remember that
XML contents are case-sensitive and need to be stored in MySQL databases as such too. For example,
“FILE” needs to be stored as “FILE” instead of “file”, “DIRECTORY” needs to be stored as “DIRECTORY”
instead of “directory”.
The homework assumes that you have created a database dsci551 on your EC2 MySQL, a user
dsci551@localhost (with password Dsci-551), and grant all privileges on dsci551.* (i.e., objects in
dsci551) to the user. You can log in as root and execute the following to satisfy the assumption:
create database dsci551;
create user dsci551@localhost identified by “Dsci-551”;
grant all privileges on dsci551.* to dsci551@localhost;
Tasks:
1. [25 points] Write one SQL script “create.sql” that creates the following 3 tables in a MySQL
database dsci551. The script may assume that the database dsci551 already exists.
Note that you should properly define the primary key and foreign key(if applicable) for each
table and choose a suitable data type for each attribute according to the screenshot. If the
above tables already exist in the database, your script should be able to recreate them.
Your code should run without error and satisfy all requirements stated in the question.
It’s only required for you to define PK, FK and data type for each attribute. It’s up to you
whether to add “unique”/ “NOT NULL”/ “CHECK”/ FK CASCADE, etc.
a. A table “inode” which stores the information about the inodes in a fsimage file. The
table should have following attributes: id, type, name, replication, mtime, atime,
preferredBlockSize, and permission.
b. A table “blocks” which stores the block information for a file. The table should have
the following attributes: id, inumber, genstamp, numBytes, where inumber is inode
id of the file and id is a block id. Note that it is possible that a file has multiple
blocks. You can assume that every block has a unique id. No need to store
storagePolicyID.
c. A table “directory” which has two attributes: parent and child, where parent is the
inumber of parent directory and child is the inumber of file/directory stored under
the parent directory.
2. [35 points] Write a Python script “load.py” that takes a fsimage file and stores its inode and
inode directory information in the tables (in the database dsci551) you created in Task 1.
Execution format:
python3 load.py <fsimage.xml>
where <fsimage.xml> is a file system image file.
Note that your load.py should access the dsci551 database (with user = dsci551 and
password = Dsci-551, as mentioned earlier) in your MySQL on your localhost. You can
test your code by first uploading it to your EC2 instance and running it there.
Note that you are provided with a sample fsimage file but your code may be tested on
additional files. This means you should not hardcode the provided fsimage file in your
code. If there is no value for a particular attribute, for an integer consider it to be 0, and
for string consider it to be ‘ ’. For example, in the sample fsimage92.xml inode with
id=16386 does not have atime, so assign 0 to it.
You should use lxml and its xpath function to extract data from the xml file. You may use
sqlalchemy, pymysql, and pandas to work with MySQL inside Python.
3. [40 points] For each of the following questions, write a MySQL query that uses the tables
you created in Question 1 to answer the question. Please use queries covered during
lectures.
a. Find names of files which have the latest modification time. Note that it is possible
that multiple files have the same modification time, please return all of them.
format of output: +———+
| name |
+———+
b. Find names of files or directories whose permissions end with 6xx. In other words,
the third to the last character of the permission is 6.
format of output: +———+
| name |
+———+
c. For each file, output its id, name, and total size (in bytes).
format of output: +———–+————– +———————-+
| id | name | total_size |
+———–+————– +———————-+
d. For each directory, output its id, name, and the number of objects (files or
directories) in the directory.
format of output: +———–+————– +———————+
| id | name | num_obj |
+———–+————– +———————+
e. Find names of directories which contain at least two files.
format of output: +———+
| name |
+———+
Allowed Libraries for q2: lxml, sys, sqlalchemy, pymysql, and pandas.
Submission (3 files in total):
1. For question 1: one .sql file
2. For question 2: one .py file
3. For question 3: one pdf file including all 5 queries
Notes:
1. Please strictly follow the wording of all attributes name and tables name
2. Do not zip your files when submitting your work
3. Fail to follow any submission requirements will lead to deduction of marks, please follow the
instructions carefully
DSCI551 Homework 4 (Indexing and Query Execution)
1. [40 points] Consider the following B+-tree for the search key “age. Suppose the degree d of the
tree = 2, that is, each node (except for root) must have at least two keys and at most 4 keys.
Note that sibling nodes are nodes with the same parent.
a. [10 points] Describe the process of finding keys for the query condition “age >= 15 and
age <= 50”. How many blocks I/O’s are needed for the process?
b. [15 points] Show the B+-tree after deleting 20, 30, and 43 (in the shown order). Show
the updated tree after EACH deletion.
c. [15 points] Show the tree after inserting 14, 15, and 45 (in the shown order) into the
tree produced in sub-question b . Show the updated tree after EACH insertion.
2. [60 points] Consider natural-joining tables R(a, b) and S(a,c). Suppose we have the following
scenario.
i. R is a clustered relation with 20,000 blocks.
ii. S is a clustered relation with 50,000 blocks.
iii. 102 pages available in main memory for the join.
iv. Assume the output of join is given to the next operator in the query execution plan
(instead of writing to the disk) and thus the cost of writing the output is ignored.
Describe the steps for each of the following join algorithms. For sorting and hashing-based
algorithms, also indicate the sizes of output from each step. What is the total number of block
I/O’s needed for each algorithm? Which algorithm is most efficient in terms of block’s I/O?
a. [10 points] (Block-based) nested-loop join with R as the outer relation.
b. [10 points] (Block-based) nested-loop join with S as the outer relation.
c. [20 points] Sort-merge join (assume only 100 pages are used for sorting and 101 pages for
merging). Note that if join can not be done by using only a single merging pass, runs from
one or both relations need to be further merged, in order to reduce the number of runs.
Select the relation with a larger number of runs for further merging first if both have too
many runs.
d. [20 points] Partitioned-hash join (assume 101 pages used in partitioning of relations and no
hash table is used to lookup in joining tuples).
Submission Requirements:
Please read carefully before submitting your work:
Please submit one .pdf file containing answers for both Question 1 and 2. The answers can either be
typed or handwritten. Name the submission files as firstname_lastname_HW4.pdf
DSCI551 Homework 5 (Hadoop MapReduce & Spark)
1. [40 points] Write a Hadoop MapReduce program named SQL2MR.java to find answers to the following SQL query on the aqi.csv data set (for Air Quality).
The file has 10,129 rows, each row has four values: date, country, status, value. select status, avg(value) from aqi where date like ‘2022-08%’ group by status having count(value) >= 100; You are provided with a template for SQL2MR.java where you can provide the missing code to produce a complete program. The template also has some hints that may help you.
Before you compile and run the program, please complete the following:
● You should remove the header of aqi.csv file and save it under a directory called aqi-input on EC2.
● You should comment out OR remove xxx part in core-site.xml. Here is the path for the core-site.xml file /etc/hadoop/core-site.xml.
● Remember to stop your mysql server before running hadoop(sudo service mysql stop). You are reminded of the following steps to compile and run the program.
● hadoop com.sun.tools.javac.Main SQL2MR.java
● jar cf sql2mr.jar SQL2MR*.class
● hadoop jar sql2mr.jar SQL2MR aqi-input aqi-output Submission: SQL2MR.java, sql2mr.jar, and part-r-00000 file under aqi-output.
2. [30 points] Using the JSON files in country-db.zip and the aqi.csv file, answer the following questions using Spark DataFrame API. You can use “import pyspark.sql.functions as fc”. Note: you should not use Spark SQL in this question. Remember to stop your mysql server before running spark (sudo service mysql stop).
Submission: Copy your Spark DataFrame scripts and outputs into one file and generate PDF to submit.
a. [8 points] Find countries that are in both country.json and aqi.csv. i. Using join ii. Using set operation
b. [8 points] Find (names of) countries that are in aqi.csv but not in country.json. Output the same countries only once. i. Using join ii. Using set operation
c. [6 points] Find countries that are in country.json but not in aqi.csv. i. Using join ii. Using set operation d. [8 points] Find answer to the SQL query in Task 1, copied below: select status, avg(value) from aqi where date like ‘2022-08%’ group by status having count(value) >= 100; Note that if the “date” column is a timestamp, you may proceed as follows to extract year and month of the dates: fc.year(‘date’) # this will get year fc.month(‘date’) # of this will get month
3. [30 points] Using the JSON files in country-db.zip and the aqi.csv file, answer the same questions as in Task 2 but using Spark RDD API. Note that you should first convert the dataframe for the entire data set (e.g., aqi for aqi.csv) to rdd (e.g., using aqi.rdd) and work on the RDDs to solve the questions.
Hint: if date is a datetime, e.g., datetime.datetime(2022, 8, 1, 0, 0), you can use date.year and date.month to get year and month respectively. [Row(date=datetime.datetime(2022, 8, 1, 0, 0), country=’Albania’, …]
Submission: Copy your Spark RDD scripts and outputs into one file and generate PDF to submit. a. [8 points] Find countries that are in both country.json and aqi.csv. i. Using join ii. Using set operation
b. [8 points] Find (names of) countries that are in aqi.csv but not in country.json. i. Using join ii. Using set operation
c. [6 points] Find countries that are in country.json but not in aqi.csv. i. Using join ii. Using set operation
d. [8 points] Find answer to the SQL query in Task 1, copied below: select status, avg(value) from aqi where date like ‘2022-08%’ group by status having count(value) >= 100; Note: you are required to use aggregateByKey in question
d. Submission: 1. Please ZIP your files as a whole .zip file(don’t use any other compress type such as ‘.rar’) and submit. Otherwise you may not be able to submit because of D2L security reasons.
2. For Q1: SQL2MR.java, sql2mr.jar, and part-r-00000 file under aqi-output 3. For Q2: Copy your Spark DataFrame scripts and outputs into one file and generate PDF to submit. 4. For Q3: Copy your Spark RDD scripts and outputs into one file and generate PDF to submit.


