## Description

1 Key-value Stores (MapReduce and Spark)

1. Consider the document “MapReduce and the New Software Stack” available in the module on MapReduce.1

In that document, you can, in Sections 2.3.3-2.3.7, find descriptions of algorithms to implement relational

algebra operations in MapReduce. (In particular, look at the mapper and

reducer functions for various RA operators.)

In the following problems, you are asked to write MapReduce programs

that implement some RA expressions in PostgreSQL. In addition, you

need to add the code which permits the PostgreSQL simulations for these

MapReduce programs. A crucial aspect of solving these problem is to

develop an appropriate data representation for the input to these problems. Recall that, in MapReduce, the input is a single binary relation of

(key,value) pairs. Take for example Problem 1c. To solve this problem,

you need to find a representation that puts both the data from R and

from S in the same binary key-value relation. And, for problem 1d, you

need to have a representation that put all the data from R, S, and T in

the same binary key-value relation.

(a) Write, in PostgreSQL, a basic MapReduce program, i.e., a mapper

function and a reducer function, as well as a 3-phases simulation that

implements the projection πA,B(R) where R(A, B, C) is a relation.

You can assume that the domains of A, B, C are integer. (Recall

that in a projection, duplicates are eliminated.)

(b) Write, in PostgreSQL, a basic MapReduce program, i.e., a mapper

function and a reducer function, as well as a 3-phases simulation that

implements the set difference of two unary relations R(A) and S(A),

i.e., the relation R − S. You can assume that the domain of A is

integer.

(c) Write, in PostgreSQL, a basic MapReduce program, i.e., a mapper

function and a reducer function, as well as a 3-phases simulation that

implements the semijoin RnS of two relations R(A, B) and S(B, C).

You can assume that the domains of A, B, and C are integer.

(d) Let R(A), S(A), and T(A) be unary relations that store integers.

Write, in PostgreSQL, a MapReduce program that implements the

RA expression R − (S ∪T). Also provide a simulation that evaluates

this program.

2. Write, in PostgreSQL, a basic MapReduce program, i.e., a mapper function and a reducer function, as well as a 3-phases simulation that implements the SQL query

1This is Chapter 2 in Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman,

and Jeffrey D. Ullman.

1

SELECT r.A, array_agg(r.B), cardinality(array_agg(r.B))

FROM R r

GROUP BY (r.A)

HAVING COUNT(r.B) >= 2;

Here R is a relation with schema (A, B). You can assume that the domains

of A and B are integers.

3. Let R(K, V ) and S(K, W) be two binary key-value pair relations. Consider the cogroup transformation R.cogroup(S) introduced in the lecture

on Spark. You can assume that the domains of K, V , and W are integers.

(a) Define a PostgreSQL view that represent the co-group transformation

R.cogroup(S). Show that this view works.

(b) Write a PostgreSQL query that use this cogroup view to compute

RnS.

(c) Write a PostgreSQL query that uses this cogroup view to implement

the SQL query

SELECT distinct r.K as rK, s.K as sK

FROM R r, S s

WHERE ARRAY(SELECT r1.V

FROM R r1

WHERE r1.K = r.K) <@ ARRAY(SELECT s1.W

FROM S s1

WHERE s1.K = s.K);

4. Let A and B be two unary relations of integers. Consider the cogroup

transformation introduced in the lecture on Spark. Using an approach

analogous to the one in Problem 3 solve the following problems:

(a) Write a PostgreSQL query that uses the cogroup transformation to

compute A ∩ B.

(b) Write a PostgreSQL query that uses the cogroup operator to compute

the symmetric difference of A and B, i.e., the expression

(A − B) ∪ (B − A).

2

2 Nested Relations and Semi-structured databases

5. Consider the lecture on Nested and Semi-structured Data models. In

that lecture, we considered the studentGrades nested relation and we

constructed it using a PostgreSQL query starting from the Enroll relation.

(a) Write a PostgreSQL query that creates the nested relation courseGrades.

The type of this relation is

(cno, gradeInfo{(grade, students{(sid)})})

This relation stores for each course, the grade information of the

students enrolled in this course. In particular, for each course and

for each grade, this relation stores in a set the students who obtained

that grade in that course.

(b) Starting from this nested relation courseGrades, write a PostgreSQL

that creates the nested relation studentGrades which is as described

in the lecture.

(c) In the lecture, we defined the jstudentGrades semi-structured relation. Write a PostgreSQL query that creates a jcourseGrades semistructured relation which stores JSON objects whose structure conforms with the structure of tuples as described for the courseGrades

nested relation in question 5a.

(d) Repeat question 5b but now for the semi-structured relations jcourseGrades

and the jstudentGrades. In other words, starting from this semistructured relation courseGrades, write a PostgreSQL that creates

the semi-structured relation studentGrades.

(e) In the lecture on Nested and Semi-structured data models, we considered the query “For each student who major in ‘CS’, list his or

her sid and sname, along with the courses she is enrolled in. Furthermore, these courses should be grouped by the department in which

they are enrolled.” We formulated this query in the context of nested

relations.

Write a PostgreSQL query to solve this query, but this time for semistructured relations that store only JSON objects.

3

3 Graph Databases

6. Consider the database schema Student, Course, Buys, Major, and Cites

that we have been using throughout the assignments.

(a) Specify an Entity-Relationship Diagram that models this database

schema.

(b) Specify the node and relationship types of a Property Graph for

this database schema. In addition, specify the properties, if any,

associated with each such type.

7. Using the Property Graph model in Problem 6b, formulate the following

queries in the Cypher query language:

(a) Find the types of the relationships associated with Student nodes.

(b) Find each student (node) whose name is ‘John’ and who bought a

book whose price is at least $50.

(c) Find each student (node) who bought a book that cites a book whose

price is at least $50.

(d) Find each book (node) that is cited directly or indirectly (i.e., recursively) by a book that cost more that $50.

(e) Find for each book node, that node along with the number of students

who major in both CS and in Math and who bought that book.

4