B561 Assignment 7 Testing Effectiveness of Query Optimization; Object-Relational Database Programming Key-Value Databases and Graph Databases (Draft) solution

$29.99

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (5 votes)

This assignment focuses on problems related to Lecture 9 and Lectures 18
through 22.
• Lecture 18: Algorithms for RA operations
• Lecture 19: Query processing and query plans
• Lecture 20: Object-relational database programming
• Lecture 20: Key-value stores. NoSQL in MapReduce style
• Lecture 21: Key-value stores; NoSQL in Spark style
• Lecture 22: Graph databases
Other lectures that a relevant for this assignment are Lectures 8, 13, and 14:
• Lecture 8: Translating Pure SQL queries into RA expressions
• Lecture 9: Query optimization
• Lecture 13: Object-Relational databases and queries
• Lecture 14: Nested Relational, Semi-structured Databases, Document
Databases
This assignment has problems that are required to be solved. Others, identified as such, are practice problems that you should attempt since the serve as
preparation for the final exam.
Turn in a single assignment7.sql file that contains the PostgreSQL code of
the solutions for the problem that require such code. (Do not include solutions
for the practice problems.) Also turn in a assignment7.txt file that contains
all the output associated with the problems in this assignment. For all the other
problems, submit a single assignment7.pdf file with your solutions.
1
1 Analysis of Queries Using Query Plans
Consider Lecture 19 on Query Processing: Query Planning in (Object) Relational Systems. Consider the analysis, using query plans, for the SOME quantifier.
1. Assume the relation schemas P(x), Q(x), R(x, y) and S(x, z).
Consider the NOT ALL generalized query
{(p.x, q.x)|P(p) ∧ Q(p) ∧ R(p.x) 6⊃ S(p.x)}
where
R(p.x) = {r.y | R(r) ∧ r.x = p.x}
S(q.x) = {s.z | S(s) ∧ s.x = q.x}
Consider Lecture 19 on Query Processing: Query Planning in (Object)
Relational Systems and in particular the analysis, using query plans, for
the SOME generalized quantifier.
Now to the problem. In analogy with the analysis for the SOME generalized
quantifier, do an analysis for the NOT ALL generalized quantifier.
2 Experiments to Test the Effectiveness of Query
Optimization
In the following problems, you will conduct experiments in PostgreSQL to gain
insight into whether or not query optimization can be effective. In other words,
can it be determined experimentally if optimizing an SQL or an RA expression
improves the time (and space) complexity of query evaluation? Additionally,
can it be determined if the PostgreSQL query optimizer attains the same (i.e.,
better or worse) optimization as optimization by hand. Recall that in SQL you
can specify each RA expression as an RA SQL query. This implies that each of
the optimization rules for RA can be applied directly to queries formulated in
RA SQL.
In the following problems you will need to generate artificial data of increasing size and measure the time of evaluating non-optimized and optimized
queries. The size of this data can be in the ten or hundreds of thousands of
tuples. This is necessary because on very small data is it is not possible to gain
sufficient insights into the quality (or lack of quality) of optimization. You can
use the data generation functions that were developed in Assignment 6. Additionally, you are advised to examine the query plans generated by PostgreSQL.
For the problems in this assignments, we will use three relations:1
P(a int)
R(a int, b int)
S(b int)
1A typical case could be where P is Person, R is Knows, and S is the set of persons with the
Databases skill. Another case could where P is the set of persons who work for Amazon, R is
personSkill and S is the set of skills of persons who live in Bloomington. Etc.
2
To generate P or S, you should use the function SetOfIntegers which generate a set of up to n randomly selected integers in the range [l, u]:
create or replace function SetOfIntegers(n int, l int, u int)
returns table (x int) as
$$
select floor(random() * (u-l+1) + l)::int as x
from generate_series(1,n)
group by (x) order by 1;
$$ language sql;
To generate R, you should use the function BinaryRelationOverIntegers
which generates up to n randomly selected pairs with first components in the
range [l1, u1] and second components in the range [l2, u2]:
create or replace function BinaryRelationOverIntegers(n int, l_1 int, u_1 int, l_2 int, u_2 int)
returns table (x int, y int) as
$$
select floor(random() * (u_1-l_1+1) + l_1)::int as x,
floor(random() * (u_2-l_2+1) + l_2)::int as y
from generate_series(1,n)
group by (x,y) order by 1,2;
$$ language sql;
Example 1 Consider the query Q1
select distinct r1.a
from R r1, R r2
where r1.b = r2.a;
This query can be translated and optimized to the query Q2
select distinct r1.a
from R r1 natural join (select distinct r2.a as b from R r2) r2;
Image that you have generated a relation R. Then when you execute
explain analyze
select distinct r1.a
from R r1, R r2
where r1.b = r2.a;
the system will return its query plan as well as the execution time to evaluate
Q1 measured in ms. And, when you execute
explain analyze
select distinct r1.a
from R r1 natural join (select distinct r2.a as b from R r2) r2;
the system will return its query plan as well as the execution time to evaluate
Q2 measured in ms. This permits us to compare the non-optimized query Q1
3
with the optimized query Q2 for various differently-sized relations R. Here are
some of these comparisons for various differently-sized random relations R. In
this table, R was generated with lower and upper bounds l1 = l2 = 1000 and
u1 = u2 = 1000.
2
R Q1 (in ms) Q2 (in ms)
104 27.03 7.80
105 3176.53 58.36
106 69251.58 400.54
Notice the significant difference between the execution times of the non-optimized
query Q1 and the optimized query Q2. So clearly, optimization works on query
Q1.
Incidentally, below are the query plans for Q1 and Q2. Examining these
query plans should reveal why Q1 runs much slower than Q2. (Why?)
QUERY PLAN for Q1
————————————
HashAggregate
Group Key: r1.a
-> Hash Join
Hash Cond: (r1.b = r2.a)
-> Seq Scan on r r1
-> Hash
-> Seq Scan on r r2
QUERY PLAN for query Q2
——————————————
HashAggregate
Group Key: r1.a
-> Hash Join
Hash Cond: (r1.b = r2.a)
-> Seq Scan on r r1
-> Hash
-> HashAggregate
Group Key: r2.a
-> Seq Scan on r r2
2All the experiments where done on a MacMini.
4
We now turn to the problems for this section.
2. Consider query Q3
select distinct p.a
from P p, R r1, R r2, R r3, S s
where p.a = r1.a and r1.b = r2.a and r2.b = r3.a and r.b = S.b;
Intuitively, if we view R as a graph, and P and S as node types (properties),
then Q3 determines each P-node in the graph from which there emanates
a path of length 3 that ends at a S-node.3
I.e., a P-node n0 is in the
answer if there exists sequence of nodes (n0, n1, n2, n3) such that (n0, n1),
(n1, n2), and (n2, n3) are edges in R and n3 is a S-node.
(a) Translate and optimize this query and call it Q4. Then write Q4 as
an RA SQL query just as was done for query Q2 in Example 1.
(b) Compare queries Q3 and Q4 in a similar way as we did for Q1 and
Q2 in Example 1.
You should experiment with different sizes for R. Incidentally, these
relations do not need to use the same parameters as those shown in
the above table for Q1 and Q2 in Example 1.
(c) What conclusions do you draw from the results of these experiments
regarding the effectiveness of query optimization in PostgreSQL and/or
by hand?
3Such a query is typical in Graph Databases.
5
3. Consider the Pure SQL Q5 which is an formulation of a variation of the
not subset (not only) set semijoin query
{p.a | P(p) ∧ R(p.a) 6⊆ S}
where
R(p.a) = {r.b | R(r) ∧ r.a = p.a}.
select p.a
from P p
where exists (select 1
from R r
where r.a = p.a and
not exists (select 1 from S s where r.b = s.b));
(a) Translate and optimize this query and call it Q6. Then write Q6 as
an RA SQL query just as was done for Q2 in Example 1.
(b) An alternative way to write a query equivalent with Q5 is as the
object-relational query
with nestedR as (select P.a, array_agg(R.b) as bs
from P natural join R
group by (P.a)),
Ss as (select array(select b from S) as bs)
select a
from nestedR
where not (bs <@ (select bs from Ss));
Call this query Q7.
Compare queries Q5, Q6, and Q7 in a similar way as we did in Example 1. However, now you should experiment with different sizes
for P, R and S as well as consider how P and S interact with R.
(c) What conclusions do you draw from the results of these experiments?
6
4. Consider the Pure SQL Q8 which is an formulation of a variation of the
not superset, (not all) set semijoin query
{p.a | |P(p) ∧ R(p.a) 6⊇ S}
where
R(p.a) = {r.b | R(r) ∧ r.a = p.a}.
select p.a
from P p
where exists (select 1
from S s
where not exists (select 1 from R where p.a = r.a and r.b = s.b));
(a) Translate and optimize this query and call it Q9. Then write Q9 as
an RA SQL query just as was done for Q2 in Example 1.
(b) An alternative way to write a query equivalent with Q8 is as the
object-relational query
with nestedR as (select P.a, array_agg(R.b) as bs
from P natural join R
group by (P.a)),
Ss as (select array(select b from S) as bs)
select a
from P
where a not in (select a from nestedR) and
not((select bs from Ss) <@ ’{}’)
union
select a
from nestedR
where not((select bs from Ss) <@ bs); Call this query Q10. Compare queries Q8, Q9, and Q10 in a similar way as we did In Example 1. However, now you should experiment with different sizes for P, R and S as well as consider how P and S interact with R. For example, it could be that (c) What conclusions do you draw from the results of these experiments? 5. Give a brief comparison of your results for Problem 3 and Problem 4. In particular, where the results show significant differences, explain why you think that is the case. And, where the results show similarities, explain why you think that is the case. 7 3 Object Relational Programming The following problems require you to write object relational programs. Many of these require program written in Postgres’ plpgsql database programming language. 6. Practice Problem–not graded Consider the relation schema V(node int) and E(source int, target int) representing the schema for storing a directed graph G with nodes in V and edges in E. Now let G be a directed graph that is acyclic, i.e., a graph without cycles. 4 A topological sort of an acyclic graph G is a list of all nodes (n1, n1, . . . , nk) in V such that for each edge (m, n) in E, node m occurs before node n in this list. Note that a path can be stored in an array. Write a PostgreSQL program topologicalSort() that returns a topological sort of G. 7. Consider a parent-child relation PC(parent, child). (You can assume that PC is a rooted tree and the domain of the attributes parent and child is int.) An edge (p, c) in PC indicates that node p is a parent of node c. Now consider a pair of nodes (m, n) in PC (m and n maybe the same nodes.) We say that m and n are in the same generation when the distance from m to the root of PC is the same as the distance from n to the root of PC. Consider the following recursive query that computes the sameGeneration relation: WITH RECURSIVE sameGeneration(m, n) AS ((SELECT parent, parent FROM PC) UNION (select child, child from PC) UNION SELECT t1.child, t2.child FROM sameGeneration pair, PC t1, pc t2 WHERE pair.m = t1.parent and pair.n = t2.parent) select distinct pair.m, pair.n from sameGeneration pair order by m, n; Write a non-recursive function sameGeneration() in the language plpgsql that computes the sameGeneration relation. 4A cycle is a path (n0, . . . , nl) where n0 = nl . 8 8. Consider the following relational schemas. (You can assume that the domain of each of the attributes in these relations is int.) partSubpart(pid,sid,quantity) basicPart(pid,weight) A tuple (p, s, q) is in partSubPart if part s occurs q times as a direct subpart of part p. For example, think of a car c that has 4 wheels w and 1 radio r. Then (c, w, 4) and (c, r, 1) would be in partSubpart. Furthermore, then think of a wheel w that has 5 bolts b. Then (w, b, 5) would be in partSubpart. A tuple (p, w) is in basicPart if basic part p has weight w. A basic part is defined as a part that does not have subparts. In other words, the pid of a basic part does not occur in the pid column of partSubpart. (In the above example, a bolt and a radio would be basic parts, but car and wheel would not be basic parts.) We define the aggregated weight of a part inductively as follows: (a) If p is a basic part then its aggregated weight is its weight as given in the basicPart relation (b) If p is not a basic part, then its aggregated weight is the sum of the aggregated weights of its subparts, each multiplied by the quantity with which these subparts occur in the partSubpart relation. 9 Example tables: The following example is based on a desk lamp with pid 1. Suppose a desk lamp consists of 4 bulbs (with pid 2) and a frame (with pid 3), and a frame consists of a post (with pid 4) and 2 switches (with pid 5). Furthermore, we will assume that the weight of a bulb is 5, that of a post is 50, and that of a switch is 3. Then the partSubpart and basicPart relation would be as follows: partSubPart pid sid quantity 1 2 4 1 3 1 3 4 1 3 5 2 basicPart pid weight 2 5 4 50 5 3 Then the aggregated weight of a lamp is 4 × 5 + 1 × (1 × 50 + 2 × 3) = 76. (a) Write a recursive function recursiveAggregatedWeight(p integer) that returns the aggregated weight of a part p. Test your function. (b) Write a non-recursive function nonRecursiveAggregatedWeight(p integer) that returns the aggregated weight of a part p. Test your function. 10 9. Practice problem–not graded. Consider the heap data structure. For a description, consult https://en.wikipedia.org/wiki/Binary heap. (a) Implement this data structure in PostgreSQL. This implies that you need to implement the insert and extract heap operations. In this problem, you are not allowed to use arrays to implement this data structure! Rather you must you relations. (b) Then, using the heap data structure developed in question 9a, write a PostgreSQL program heapSort() that implement the Heapsort algorithm. For a description of this algorithm, see https://en.wikipedia.org/wiki/Heapsort You are not allowed to use arrays to implement this the Heapsort algorithm! The input format is a list of integers stored in a binary relation Data(index,value). For example, Data could contain the following data. Data index value 1 3 2 1 3 2 4 0 5 7 The output of heapSort() should be stored in a relation sortedData(index,value). On the Data relation above, this should be the following relation: 11 10. Practice problem–not graded. Suppose you have a weighted (directed) graph G stored in a ternary table with schema Graph(source int, target int, weight int) A triple (s, t, w) in Graph indicates that Graph has an edge (s, t) whose edge weight is w. (In this problem, we will assume that each edge weight is a positive integer.) Below is an example of a graph G. Graph G source target weight 0 1 2 1 0 2 0 4 10 4 0 10 1 3 3 3 1 3 1 4 7 4 1 7 2 3 4 3 2 4 3 4 5 4 3 5 4 2 6 Implement Dijkstra’s Algorithm as a PostgreSQL function Dijkstra(s integer) to compute the shortest path lengths (i.e., the distances) from some input vertex s in G to all other vertices in G. Dijkstra(s integer) should accept an argument s, the source vertex, and outputs a table which represents the pairs (t, d) where d is the shortest distance from s to t in graph G. To test your procedure, you can use the graph shown above. When you apply Dijkstra(0), you should obtain the following table: target shortestDistance 0 0 1 2 2 9 3 5 4 9 12 11. Consider the relation schema document(doc int, words text[]) representing a relation of pairs (d, W) where d is a unique id denoting a document and W denotes the set of words that occur in d. Let W denote the set of all words that occur in the documents and let t be a positive integer denoting a threshold. Let X ⊆ W. We say that X is t-frequent if count({d|(d, W) ∈ document and X ⊆ W}) ≥ t In other words, X is t-frequent if there are at least t documents that contain all the words in X. Write a PostgreSQL program frequentSets(t int) that returns the set of all t-frequent set. In a good solution for this problem, you should use the following rule: if X is not t-frequent then any set Y such that X ⊆ Y is not t-frequent either. In the literature, this is called the Apriori rule of the frequent itemset mining problem. This rule can be used as a pruning rule. In other words, if you have determined that a set X in not t-frequent then you no longer have to consider any of X’s supersets. To learn more about this problem you can visit the site https://en.wikipedia.org/wiki/Apriori algorithm. Test your function frequentSets for thresholds 0 through 5. 13 12. Consider a directed graph G stored in a relation Graph(source int, target int). We say that G is Hamiltonian if G has a cycle (n1, . . . nk) such that each node n in G occurs once, but only once, as a node ni in this cycle. (a) Write a recursive function recursiveHamiltonian() that returns true if the graph stored in Graph is Hamiltonian, and false otherwise. Test your function. (b) Write a non-recursive function nonRecursiveHamiltonian that returns true if the graph stored in Graph is Hamiltonian, and false otherwise. Test your function. 14 4 Key-value Stores (MapReduce and Spark) Consider the document “MapReduce and the New Software Stack” available in the module on MapReduce.5 In that document, you can, in Sections 2.3.3- 2.3.7, find descriptions of algorithms to implement relational algebra operations in MapReduce. (In particular, look at the mapper and reducer functions for various RA operators.) Remark 1 Even though MapReduce as a top-level programming language is only rarely used, it still serves as an underlying programming environment to which other languages compile. Additionally, the programming techniques of applying maps to key-value stores and reducing (accumulating, aggregating) intermediate and final results is an important feature of parallel and distributed data processing. Additionally, the MapReduce framework forces one to reason about modeling data towards key-value stores. Finally, the fact that the MapReduce programming model can be entirely simulated in the PostgreSQL objectrelational system underscores again the versatility of this system for a broad range of database programming and application problems. In the following problems, you are asked to write MapReduce programs that implement some RA operations and queries with aggregation in PostgreSQL. In addition, you need to add the code which permits the PostgreSQL simulations for these MapReduce programs. Discussion A crucial aspect of solving these problems is to develop an appropriate data representation for the input to these problems. Recall that in MapReduce the input is a single binary relation of (key, value) pairs. We will now discuss a general method for representing (encoding) a relational database in a single key-value store. Crucial in this representation is the utilization of json objects.6 Consider a relation R(a,b,c). For simplicity, we will assume that the domain of the attributes of R is integer.7 create table R (a int, b int, c int); insert into R values (1,2,3), (4,5,6), (1,2,4); table R; a | b | c —–+—+— 1 | 2 | 3 4 | 5 | 6 1 | 2 | 4 5This is Chapter 2 in Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. 6 Incidentally, this modeling technique is independent of MapReduce and can also be used to map relational data to other systems and programming languages that center around json objects. 7However, this approach can be generalized for other domains such as string, booleans, etc. 15 Starting from this relation R we can, using jsonb8 functions and operations on jsonb objects, come up with an encoding of R as a key-value store. Consider the tuple (1, 2, 3) in R. We will represent (encode) this tuple as the key-value pair (‘R’,{“a”:1, “b”:2, “c”:3}). So the key of this pair is the relation name ‘R’ and the jsonb object {“a”: 1, “b”:2, “c”: 1} represents the tuple (1, 2, 3). Based on this idea of representing tuples of R, we can generate the entire key-value store for R using an object-relational SQL query.9 To that end, we can use the jsonb build object PostgreSQL function. create table encodingofR (key text, value jsonb); insert into encodingofR select ’R’ as key, jsonb_build_object(’a’, r.a, ’b’, r.b, ’c’, r.c) as value from R r; This gives the following encoding for R. table encodingofR; key | value —–+—————————– R | {“a” : 1, “b” : 2, “c” : 3} R | {“a” : 4, “b” : 5, “c” : 6} R | {“a” : 1, “b” : 2, “c” : 4} Note that we can also “decode” the encodingofR key-value store to recover R by using the following object-relational SQL query. To that end, we can use the jsonb selector function ->.
select p.value->’a’ as a, p.value->’b’ as b, p.value->’c’ as c
from encodingofR p;
a | b | c
—–+—+—
1 | 2 | 3
4 | 5 | 6
1 | 2 | 4
An important aspect of this encoding strategy is that it is possible to put
multiple relations, possible with different schemas and arities, into the same
key-value store. Besides R, let us also consider a binary relation S(a,d).
create table S (a int, d int);
insert into S values (1,2), (5,6), (2,1), (2,3);
table S;
8PostgreSQL support both json and jsonb objects. For this assignment, you should use the
jsonb object type since it comes with more functionality and offers more efficient computation.
9Notice that this strategy works in general for any relation, independent of the number of
attributes of the relation.
16
a | d
—–+
1 | 2
5 | 6
2 | 1
2 | 3
(4 rows)
We can now encode both R and S into a single key-value store encodingofRandS
as follows:
create table encodingofRandS(key text, value jsonb);
insert into encodingofRandS
select ’R’ as key, jsonb_build_object(’a’, r.a, ’b’, r.b, ’c’, r.c) as value
from R r
union
select ’S’ as key, jsonb_build_object(’a’, s.a, ’d’, s.d) as value
from S s
order by 1, 2;
table encodingofRandS;
key | value
—–+————————–
R | {“a”: 1, “b”: 2, “c”: 3}
R | {“a”: 1, “b”: 2, “c”: 4}
R | {“a”: 4, “b”: 5, “c”: 6}
S | {“a”: 1, “d”: 2}
S | {“a”: 2, “d”: 1}
S | {“a”: 2, “d”: 3}
S | {“a”: 5, “d”: 6}
(7 rows)
Furthermore, we can decode this key-value store using 2 object-relational SQL
queries and recover R and S.
select p.value->’a’ as a, p.value->’b’ as b, p.value->’c’ as c
from encodingofRandS p
where p.key = ’R’;
a | b | c
—–+—+—
1 | 2 | 3
4 | 5 | 6
1 | 2 | 4
(3 rows)
select p.value->’a’ as a, p.value->’d’ as d
from encodingofRandS p
where p.key = ’S’;
a | d
—–+
1 | 2
5 | 6
2 | 1
2 | 3
(4 rows)
17
Example 2 Consider the following problem. Write, in PostgreSQL, a basic
MapReduce program, i.e., a mapper function and a reducer function, as well as
a 3-phases simulation that implements the set intersection of two unary relations
R(a) and S(a), i.e., the relation R ∩ S. You can assume that the domain of the
attribute ‘a’ is integer.
— EncodingOfRandS;
drop table R; drop table S;
create table R(a int);
insert into R values (1),(2),(3),(4);
create table S(a int);
insert into S values (2),(4),(5);
drop table EncodingOfRandS;
create table EncodingOfRandS(key text, value jsonb);
insert into EncodingOfRandS
select ’R’ as key, jsonb_build_object(’a’, r.a) as value
from R r
union
select ’S’ as key, jsonb_build_object(’a’, s.a) as value
from S s
order by 1;
table EncodingOfRandS;
key | value
—–+———-
R | {“a”: 1}
R | {“a”: 4}
R | {“a”: 2}
R | {“a”: 3}
S | {“a”: 4}
S | {“a”: 5}
S | {“a”: 2}
(7 rows)
— mapper function
CREATE OR REPLACE FUNCTION mapper(key text, value jsonb)
RETURNS TABLE(key jsonb, value text) AS
$$
SELECT value, key;
$$ LANGUAGE SQL;
— reducer function
CREATE OR REPLACE FUNCTION reducer(key jsonb, valuesArray text[])
RETURNS TABLE(key text, value jsonb) AS
$$
SELECT ’R intersect S’::text, key
WHERE ARRAY[’R’,’S’] <@ valuesArray; $$ LANGUAGE SQL; — 3-phases simulation of MapReduce Program followed by a decoding step WITH Map_Phase AS ( SELECT m.key, m.value FROM encodingOfRandS, LATERAL(SELECT key, value FROM mapper(key, value)) m 18 ), Group_Phase AS ( SELECT key, array_agg(value) as value FROM Map_Phase GROUP BY (key) ), Reduce_Phase AS ( SELECT r.key, r.value FROM Group_Phase, LATERAL(SELECT key, value FROM reducer(key, value)) r ) SELECT p.value->’a’ as a FROM Reduce_Phase p
order by 1;
a

2
4
(2 rows)
We now turn to the problems for this section.
13. Practice problem–not graded. Write, in PostgreSQL, a basic MapReduce program, i.e., a mapper function and a reducer function, as well
as a 3-phases simulation that implements the symmetric difference of two
unary relations R(a) and S(a), i.e., the relation (R−S)∪(S−R). You can
assume that the domain of the attribute ‘a’ is integer.
14. Write, in PostgreSQL, a basic MapReduce program, i.e., a mapper function and a reducer function, as well as a 3-phases simulation that implements the semijoin of two relations R(A,B) and S(A,B,C), i.e., the relation
R n S. You can assume that the domains of A, B, and C are integer.
Use the encoding and decoding methods described above.
15. Practice problem–not graded. Write, in PostgreSQL, a basic MapReduce program, i.e., a mapper function and a reducer function, as well as a
3-phases simulation that implements the natural join R ./ S of two relations R(A, B) and S(B,C). You can assume that the domains of A, B, and
C are integer. Use the encoding and decoding methods described above.
16. Write, in PostgreSQL, a basic MapReduce program, i.e., a mapper function and a reducer function, as well as a 3-phases simulation that implements the SQL query
SELECT r.A, array_agg(r.B), sum(r.B)
FROM R r
GROUP BY (r.A)
HAVING COUNT(r.B) < 3;
Here R is a relation with schema (A, B). You can assume that the domains
of A and B are integers. Use the encoding and decoding methods described
above.
19
We now turn to some problems that relate to query processing in Spark.
Note that in Spark it is possible to operate on multiple key-value stores.
17. Let R(K, V ) and S(K, W) be two binary key-value pair relations. You
can assume that the domains of K, V , and W are integers. Consider the
cogroup transformation R.cogroup(S) introduced in the lecture on Spark.
(a) Define a PostgreSQL view coGroup that computes a complex-object
relation that represent the co-group transformation R.cogroup(S).
Show that this view works.
(b) Write a PostgreSQL query that use this coGroup view to compute
the semi join R n S, in other words compute the relation R ./ πK(S).
(c) Write a PostgreSQL query that uses this coGroup view to implement
the SQL query
SELECT distinct r.K as rK, s.K as sK
FROM R r, S s
WHERE NOT ARRAY(SELECT r1.V
FROM R r1
WHERE r1.K = r.K) && ARRAY(SELECT s1.W
FROM S s1
WHERE s1.K = s.K);
18. Practice problem–not graded. Let A(x) and B(x) be the schemas to
represent two set of integers A and B. Consider the cogroup transformation introduced in the lecture on Spark. Using an approach analogous to
the one in Problem 17 solve the following problems:10
(a) Write a PostgreSQL query that uses the cogroup transformation to
compute A ∩ B.
(b) Write a PostgreSQL query that uses the cogroup operator to compute
the symmetric difference of A and B, i.e., the expression
(A − B) ∪ (B − A).
10An important aspect of this problem is to represent A and B as a key-value stores.
20
5 Graph query languages
Each of the following problems is a practice problem.
19. Practice Problem–not graded. Consider the database schema Person,
Company, companyLocation, Knows, jobSkill, and personSkill.
(a) Specify an Entity-Relationship Diagram that models this database
schema.
(b) Specify the node and relationship types of a Property Graph for
this database schema. In addition, specify the properties, if any,
associated with each such type.
20. Practice Problem–not graded. Using the Property Graph model in
Problem 19b, formulate the following queries in the Cypher query language:
(a) Find the types of the relationships associated with Person nodes.
(b) Find each person (node) whose name is ‘John’ and has a salary that
is at least 50000.
(c) Find each jobSkill (node) that is the job skill of a person who knows
a person who works for ’Amazon’ and who has a salary that is at
least 50000.
(d) Find each person (node) who knows directly or indirectly (i.e., recursively) another person who works for Amazon.
(e) Find for each company node, that node along with the number of
persons who work for that company and who have both the Databases
and Networks job skills.
21