Description

5/5 - (1 vote)

CPTS415 Assignment#1

1. [Big Data concept] (10) Give one example of Big Data application you know. Use the detailed
example to explain each of the five Big V’s. If you are required to design a database system for
this application, what are the best data models (relational, XML, RDF, among others) you would
use to represent the data and why?
2. [Relational Data Model] (30) As of January 2017, the OpenFlights Airports Database
(https://openflights.org/data.html) contains over 10,000 airports, train stations and ferry
terminals spanning the globe. Each entry in the Airport table contains the following:
——————————————————————————————————————————————-
Airport ID Unique OpenFlights identifier for this airport.
Name Name of airport. May or may not contain the City name.
City Main city served by airport. May be spelled differently from Name.
Country Country or territory where airport is located. See countries.dat to cross-reference to ISO
3166-1 codes.
IATA 3-letter IATA code. Null if not assigned/unknown.
ICAO 4-letter ICAO code.
Latitude Decimal degrees, usually to six significant digits. Negative is South, positive is North.
Longitude Decimal degrees, usually to six significant digits. Negative is West, positive is East.
Altitude In feet.
Timezone Hours offset from UTC. Fractional hours are expressed as decimals, eg. India is 5.5.
DST Daylight savings time. One of E (Europe), A (US/Canada), S (South America), O (Australia), Z
(New Zealand), N (None) or U (Unknown). See also: Help: Time
Tz database time zone Timezone in “tz” (Olson) format, eg. “America/Los_Angeles”.
Type Type of the airport. Value “airport” for air terminals, “station” for train stations, “port” for
ferry terminals and “unknown” if not known. In airports.csv, only type=airport is included.
Source Source of this data. “OurAirports” for data sourced from OurAirports, “Legacy” for old data
not matched to OurAirports (mostly DAFIF), “User” for unverified user contributions. In airports.csv, only
source=OurAirports is included.
——————————————————————————————————————————————-
(a) (5) Consider the following terms: relation schema, relational database schema, domain,
attribute, attribute domain, relation instance. Give what these terms are with the above Airport
database. Give one small (4-5 tuples) instance of the Airport table.
(b) (10) There are three databases in the OpenFlight dataset: Airport, Airline, and Route. Give
the schema of these three databases and mark the primary keys, foreign keys and provide
examples of functional dependencies you identified over the three tables. [You may draw a
diagram to illustrate the schema, PKs, FKs and FDs]
(c) [FD inferencing] (10)
Recall Armstrong’s axioms.
1. Reflexivity rule: if Y ⊆ X then X → Y
2. Augmentation rule: if X → Y then XZ → YZ
3. Transitivity rule: if X → Y and Y → Z then X → Z
(1) Give two examples for using Armstrong’s inference rules to induce new FDs from the set of
FDs you designed in question 2 (b).
(2) Prove the following inference rules also hold, using FD definition and Armstrong’s Axioms.
a. decomposition rule: if X → YZ then: X → Y and X → Z
b. Pseudo transitivity: if X → Y and YW → Z then: XW → Z
(d) [Normalization] (5) Given a relation R(A1, A2, A3, A4), with three FDs A2, A3 → A4 ; A3, A4 → A1;
A1, A2→ A3. Provide the 3NF and BCNF form of the schema and explain why.
3. [Relational Algebra] (20) Consider the following database schema:
Movies (Title, Director, Actor);
Location (Theater, Address, Phone number);
Schedule (Theater, Title, Time).
Express the following queries in relational algebra (select σ, project ∏ , Cartesian product X, join
(theta-join))
-Q1: which theaters feature “Zootopia”?
-Q2: List the names and address of theaters featuring a film directed by Steven Spielberg.
-Q3: What are the address and phone number of the Le Champo theater?
-Q4: List pairs of actors that acted in the same movie. (* you want to use renaming on Movies
and join the Movies with its copy Movie’).
4. [Join Operators] (40) This sets of questions test the understanding of basic database search
operators. Consider a join ⋈!.#$%.&. We ignore the cost of output the result, and measure the
cost with the number of I/Os. Given the information about relations to be joined below:
Relation � contains 20,000 tuples and has 10 tuples per block. Relation � contains 100,000
tuples and has 10 tuples per block. Attribute � is the primary key of �. In total, 52 blocks are
available in memory. Assume neither relation has any index.
a. (10) Describe a block nested join algorithm, Give the cost of joining � and � with a block
nested loops join.
b. (15) Describe a sort-merge join algorithm. Give the cost of joining � and � with a sort-merge
join.
c. (15) Describe a hash-join algorithm. Give the cost of joining � and � with a hash join.

CPTS415 Assignment#2

1. [XML and RDF] (40)
(a) (10) Consider the database instance you gave in Assignment 1 Question 2 (a). Assume now that
you don’t have any schema. Give an XML document to represent the tuples as the fact about the
airports.
(b) (10) Consider the relational schemas you gave in Assignment 1 Question 2 (b). Give an XML
schema representation of each relational schema. How do you encode keys? Foreign keys?
(c) (20) Consider a set of natural language sentences collected from Webpages.
i. A human can like another human.
ii. A human can have a sex property of a man or a woman.
iii. A man can be the father of anotherhuman.
iv. A woman can be the mother of another human.
v. A human can be married to another human.
vi. A human can have a BirthYear property of type“xs:Year”.
vii. If a human is married to another, then they like eachother.
viii. If a human is a mother or father, the human is a parent.
Write a RDF schema and give a graphical presentation to describe these relationships.
2. [Graph Algorithm] (30) The following questions test your understanding on basic graph algorithms
a. (15) Given a directed graph �(�, �, �) with � the node set, � the edge set and � a function
that assigns to each edge � ∈ � a label �(�). A label constrained reachability query �(�,�, �)
tests if there exists a path from a source node � to a target node �, which consists of edges
having a label from a label set �. Give an algorithm (pseudo-code) to answer query �.
(hint: A straightforward way is to revise BFS or DFS traversal)
b. (15) Consider a network �(�, �) of servers, where each edge � = (�, �) represents a
communication channel from a server � to another server �. Each edge has an associated
value �(�, �), which is a constant in [0,1]. The value represents the reliability of the channel,
i.e., the probability that the channel from server � to server � will not fail. Assume that these
probabilities are independent. Give an algorithm (pseudo-code) to find the most reliable path
between two given servers. Give the complexity (in Big O notation) of your algorithm.
(hint: transform the weight to non-negative numbers and the problem may become very
familiar to you).
3. [Approximate Query Processing] (25) This question continues our discussion on using data
synopsis for query processing based on data-driven approximation. You are given a vector of
numbers: [127, 71, 87, 31, 59, 3, 43, 99, 100, 42, 0, 58, 30, 88, 72, 130], each data point records the
frequency of communication of a server in a 5-minute interval. For example, in the first 5 minutes,
127 contacts are observed. In the next 5 minutes which is time interval [5, 10], 71 contacts, …
a. Give the Haar decomposition and draw a corresponding error tree for the contacts data vector.
b. Give the process and result for reconstructing the frequency during time interval [15, 20] using
Haar decomposition (explain the path in a top-down fashion).
c. Use Haar decomposition and error tree to compute the total number of communications
between time interval [15, 30] (explain the path in a top-down fashion).

CPTS415 Assignment #4

1. [noSQL] (40) This set of questions get you to engage discussion related with noSQL systems.
a. Explain the concept of noSQL
b. Describe the 4 types of noSQ systems; for each category, give an example noSQL system (try
to give different system from what we introduced in the class).
c. Pick one type of noSQL system you give in b, give a real-world application that can make
best use of the system, and an application that may not be well-suited for the system.
Explain your reasons.
2. [CAP and ACID] (20) This set of questions are related to data consistency
a. What is CAP Theory? Consider a toy example of a cluster that contains two servers S1 and S2.
Use the example cluster to explain the CAP theory. You can simply use the proof I explained in
class.
b. Consider the relation Accounts(acctNo, balance), and two SQL statements that conduct a
request “transfer $200 from account 123 to 456”. In a DBMS, this usually includes two steps:
i. Add $200 to account 456:
UPDATE Accounts SET balance=balance+200 WHERE acctNo = 456
ii. Subtract $200 from account 123:
UPDATE Accounts SET balance=balance‐200 WHERE acctNo=123
Use this example and necessary scenarios to show when Atomicity, Consistency, Isolation
and Durability can be violated.
3. [Column Store] (20) We take a closer look at Column Store
a. Explain the major features of column stores in terms of data storage and storage key.
b. We introduced three techniques to optimize column-oriented databases: compression, late
materialization and block iteration. Please explain how they work.
4. [newSQL] (20)
a. Describe the definition of NewSQL systems and design principles.
b. For in-memory DBMS, describe the set-associative cache for block placement and LRU block
replacement policy.

CPTS415 Assignment #3

1. [Parallel Data Models] (30)
a. What is speedup and scaleup? Give three reasons why we cannot do better than linear
speedup.
b. Assume a program P running on a single-processor system takes time T to complete. 40% of
P can only be executed sequentially on a single processor, and the rest is “embarrassingly
parallel” in that it can be easily divided into smaller tasks executing concurrently across
multiple processors. What are the best time costs to execute P using 2, 4, 8 machines
(expressed by T)? What are the speed-ups respectively? What are the optimal speed-ups
given an infinite number of machines?
c. Describe and compare the pros and cons of the three architecture for parallel systems.
2. [MapReduce] (40) This set of questions test the understanding and application of MapReduce
framework.
a. (20) Facebook updates the “common friends” of you and response to hundreds of millions of
requests every day. The friendship information is stored as a pair (Person, [List of Friends]) for
every user in the social network. Write a MapReduce program to return a dictionary of
common friends of the form ((User i, User j), [List of Common Friends of User i and User j]) for
all pairs of i and j who are friends. The order of i and j you returned should be the same as the
lexicographical order of their names. You need to give the pseudo-code of 1 main function,
and 1 Map() and 1 Reduce() function. Specify the key/value pair and their semantics (what are
they referring to?).
b. (20) Top-10 Keywords. Search engine companies like Google maintains hot webpages in a set
� for keyword search. Each record � ∈ � is an article, stored as a sequence of keywords. Write
a MapReduce program to report the top 10 most frequent keywords appeared in the
webpages in �. Give the pseudo-code of your MR program.
Hit: You may need two rounds of MR processes for (b)
3. [Apache Spark] (30) This set of questions relate to Apache Spark
a. Explain the definition of RDD and how the lineage retrieval works
b. List the reasons why Spark can be faster than MapReduce.
c. Explain the definitions of narrow dependencies and wide dependencies. In addition, explain
how Spark determines the boundary of each stage in a DAG and why put operators into stages
will improve the performance.

CPTS415 Assignments 1 to 4 solution

Download Details:

Description

CPTS415 Assignment#1

CPTS415 Assignment#2

CPTS415 Assignment #4

CPTS415 Assignment #3

CPTS415 Assignments 1 to 4 solution

Download Details:

Description

CPTS415 Assignment#1

CPTS415 Assignment#2

CPTS415 Assignment #4

CPTS415 Assignment #3

Related products

CPTS415 Assignment #3 solution

CPTS415 Assignment #4 solution

CPTS415 Assignment#1 solution