Description
Assignment #1 CSCI 5408 (Data Management, Warehousing, Analytics)
Problem #1: This item will be graded based on aforesaid rubric.
Hypothetical Scenario
The Halifax city universities are trying to merge their databases under one city university initiative. In this
regard, the council has decided to revisit the data and information that are already available, and rebuild a data
model. Once the data model is obtained in the conceptual model, physical design of the database will be
performed.
You are hired as a Strategist to create a conceptual model for this scenario. The council will not accept a poorly
designed data model or data model with design flaws. Therefore, you are expected to follow some steps to
create the conceptual model. This model is independent of a specific database or hardware, and therefore, while
creating the model you do not have to consider any database or tables etc. This is a higher level conceptual
model, which must be designed using tools, such as Erwin, Visio, Dia, Draw.io etc. or similar tools.
Conditions/Steps, you must Follow (Do not skip any point):
1. Install/Access or explore appropriate software tool (ErWin/ Visio/ Draw.io etc.) for ERD/EERD creation
2. Explore these three university websites (including sub-pages/domains) given below and write down
names of possible entity sets required for the city university initiative project.
(1) https://www.dal.ca/
(2) https://www.msvu.ca/
(3) https://www.smu.ca/
3. You must identify at least 25 entity sets, and for each entity write a reason (in single sentence/ one line)
for your selection in the tabular format given below:
Table: Entity Identification
4. Once you got all the entities/ entity sets of your choice, now start building the ERD. You can start creating
your ERD (Chen Model) using the software tool. At this point try to focus on the relationships, and
cardinality. β {Label it as ERD_Initial_P1}
5. Identify and highlight if your designed ERD has any design issues. You need to perform a systematic
approach to find solution for the design issues, or attributes that were not considered, or entities that you
discovered new, and document it with possible solution. You need to write (at least Β½ page) on what
problems you found and how are you going to solve in your final ERD.
6. Modify your current ERD and solve the design issues β {Label it as ERD_Final_P1}
Problem #1 Submission Requirements:
(1) A single Problem #1 PDF with list of entities,
(2) Image of ERD_Initial_P1,
(3) Image of ERD_Final_P1
Problem #2: This item will be graded based on aforesaid rubric.
Hypothetical Scenario
You are hired as a Strategist to create a conceptual model for a company working on WHO (World Health
Organization). You are expected to create the conceptual model without flaws, and with a possibility of future
enhancements. This model is independent of a specific database or hardware, and therefore, while creating the
model you do not have to consider any database or tables or data points etc. This is a higher level conceptual
model, which must be designed using tools, such as Erwin, Visio, Dia, Draw.io etc. or similar tools.
Conditions/Steps, you must Follow (Do not skip any point):
1. To create an Extended ERD (EERD), use the tool you already installed or accessed in the previous problem.
If your EERD is large, you can always break it in multiple parts (smaller EERDs or ERDs).
2. You need to visit https://www.who.int/ and check the different sub-pages or sub-domains that are publicly
available.
3. Without writing any script or program, just by performing some manual observations, you need to identify
at least 20 entities from the pages you have visited.
4. For writing the entities, create a same tabular format that you have done for the previous problem. In
addition, add possible attributes (separated by commas) for the Entities in an additional column.
Entity Reasons for
considering
source possible
attributes
5. Once you got all the entities/ entity sets of your choice, now start building the EERD. You can start creating
your EERD (Chen Model) using the software tool. At this point try to focus on the relationships, and
cardinality. β {Label it as EERD_Initial_P2}
6. Identify and highlight if your designed EERD has any design issues. You need to perform a systematic
approach to find solution for the design issues, or attributes that were not considered, or entities that you
discovered new, and document it with possible solution. You need to write (at least Β½ page) on what
problems you found and how are you going to solve in your final EERD.
7. Modify your current EERD and solve the design issues β {Label it as EERD_Final_P2}
Problem #2 Submission Requirements:
(1) A single Problem #2 pdf with list of entities, reasons, sources, attributes
(2) Image of EERD_Initial_P2,
(3) Image of EERD_Final_P2
Problem #3: This item will be graded based on aforesaid rubric.
Format, Clean, Store Ocean Tracking Data and Report your findings
Dalhousie Ocean Research wants you to explore the dataset they provided, and perform the following:
β’ Read the document available at
https://oceantrackingnetwork.org/about/#oceanmonitoring
β’ Write a report on what are the different datasets, and attributes you discovered.
β’ Clean and transform the dataset using combination of manual work/ spreadsheet filtration/ code written
in Java. Include your cleaning/transformation steps in the Problem #3 pdf file. (Note: If you write any
programming script, it must be added to gitlab as part of the submission)
Transformation and Cleaning requirements.
β’ remove NULL values
β’ rearrange the columns, if columns are shifted and not matching the flow.
β’ transform the data in a column or attribute if required to fit a common format (e.g. date
format)
β’ In the clean spreadsheets/CSVs you created, is there a possibility of combining some of the
files, or columns in the files (without losing information)? If yes, please perform the task and
add your findings in the Problem #3 pdf file.
OR
β’ In the clean spreadsheets/CSVs you created, is there a possibility of further decomposing of
the files, or columns in the files (without losing information)? If yes, please perform the task
and report your findings in Problem #3 pdf file.
β’ Based on your final CSVs or spreadsheet files, create relational schema using MySQL DBMS.
β’ Populate the database with your transformed dataset. If the dataset is large you can consider uploading a
random subset of maximum 3000 data points on the database.
β’ Using MySQL Workbench and reverse engineering create the possible ERD. Your report must contain the
ERD produced by the reverse engineering. In addition, you need to add the cardinality.
Problem #3 Submission Requirements:
(1) Any findings, logic etc. must be included in the Problem #3 pdf file.
(2) SQL Dump of Table structure and values must be submitted.
(3 – optional) If you write a script/program to clean/format data, then upload your program code to gitlab
(https://git.cs.dal.ca).
(4) You can export the ERD from workbench and include it as an image file
Assignment Submission Instructions:
All files must be added to a single .zip file before uploading to Brightspace.
Do not use any other compression format.
Rename the .zip file as Your_FirstNameB00xxxxx.zip
Assignment #2 CSCI 5408 (Data Management, Warehousing, Analytics)
Problem #1: This problem contains one reading task. Reading Material #1: To retrieve the paper, visit IEEE database through libraries.dal.ca Y. Gao, X. Gao, X. Yang, J. Liu and G. Chen, “An Efficient Ring-Based Metadata Management Policy for Large-Scale Distributed File Systems,” in IEEE Transactions on Parallel and Distributed Systems, vol. 30, no. 9, pp. 1962-1974, 1 Sept. 2019, doi: 10.1109/TPDS.2019.2901883. β Read the paper and perform the following: β’ Write a summary (ο 1.5 page) on the paper in your own words. (you do not need to add images/figures/tables from the paper. However, you can add your own block diagrams or flowcharts to support the summary you have written) β’ What is the central idea of discussion? β’ Did you find any topic of interest in this paper? If Yes, what are those, and why do you think those are interesting? If No, then as per you, what are the shortcomings of this paper? Problem #1 Submission Requirements: Approx. 1.5-page Report (problem1.pdf) containing the summary and analysis
Problem #2: This problem contains series of tasks. You need to fill a worksheet and provide screenshots to support your worksheet Research and Development: You need to create distributed DBMS infrastructure, and perform a comparative study of local and distributed transactions Visit the website and extract the following datasets: https://www.kaggle.com/olistbr/brazilian-ecommerce Task 1: Building database from given dataset β’ Download the CSV files given in the Kaggle link, and consider those as your data points. o Similar to previous assignment (Assignment 1), clean and format each csv file (replace null values with β0β if it is a number column || replace null values with β β if it is a text/string column) – Write a Β½ page summary on what type of cleaning and formatting you have applied on the CSV files. A single Β½ page summary for all CSV files is sufficient. Add summary to problem2.pdf o You do not need to remove any columns, keep all columns as it is in CSVs. β’ Create a single database (name of the database βA2β) in your local MySQL RDBMS. β’ Create 9 tables within the newly created database. These tables will contain structure identical to the CSV files you cleaned, and you will populate the tables by importing data from the CSV files. (if files are large, you need to import using command prompt. Workbench import option may not work) β’ Export the SQL (structure + value) of the database with all tables and data. Name the file βMyDatabase.SQLβ β’ Create a GCP MySQL Virtual Machine instance and use the MyDatabase.SQL to create the database, tables, and uploading data. β’ If you have successfully completed the above steps – it means, at this point, you have two identical databases. One in your local machine and another in the cloud. β’ In Problem2.pdf, attach screenshots of your cloud instance, and local instance running mysql command βSHOW TABLES FROM A2;β Task 2: Perform Transaction in your local machine β’ In your local machine MySQL, you have 9 tables in the database A2. You need to write a Transaction with at least (2 insert, 2 delete, 2 update, 3 select operations) considering all the 9 tables. For insert, and update you can create your own set of dummy data. Based on your transaction query formation, complexity, and uniqueness you will get points for Novelty β’ Capture the transaction execution time (i.e. execution time for all 9 or more statements Begin Transaction —– —– Winter 2022 saurabh.dey@dal.ca End Transaction Task 3: Perform Transaction in the Remote machine β’ In your GCP MySQL virtual machine MySQL, you have 9 tables in the database A2. You need to write a Remote Transaction with at least (2 insert, 2 delete, 2 update, 3 select operations) considering all the 9 tables. For insert, and update you can create your own set of dummy data. Based on your transaction query formation, complexity, and uniqueness you will get points for Novelty. You should not use the same queries that you used in Task 2. Please do some major modifications in the SQL queries β’ Capture the transaction execution time (i.e. execution time for all 9 or more statements Begin Transaction —– —– End Transaction Task 4: Perform Distributed Transaction β’ In your GCP MySQL virtual machine, and in your local MySQL, you have 9 tables in the database A2. You need to write a Distributed Transaction with at least (2 insert, 2 delete, 2 update, 3 select operations) considering all the 9 tables. β’ This time, from the 9 unique tables, you need to use any 5 tables (your choice) from your Local machine database, and the remaining 4 tables in the GCP instance for performing the distributed transaction. β’ For insert, and update you can create your own set of dummy data. Based on your transaction query formation, complexity, and uniqueness you will get points for Novelty. Please do some major modifications in the SQL queries. β’ You need to write a Java program to call the distributed transaction. Hint: Set connection.setAutoCommit(false); in your Java program to control the execution and database update o Run Java Program: o within Java program select the 1st database (Local machine) Begin Transaction —– —– select the 2nd database (Remote machine) —– —– End Transaction connection.setAutoCommit(true); β’ Capture the distributed transaction execution time (i.e. execution time for all 9 or more statements β’ Fill the worksheet and submit it as part of Problem2.pdf Tasks Execution Time Transaction Query Your Observations Task 2 Write the transaction query here for each transaction Task 3 Task 4
Problem #2 Submission Requirements: 1. Upload your program code for Problem 2 (Task 4) to gitlab (https://git.cs.dal.ca). 2. Provide summary needed for Task 1, and screenshots 3. Provide Queries of all transactions you performed in Task2, Task3, and Task4. 4. Provide the filled worksheet
Assignment #3 CSCI 5408 (Data Management, Warehousing, Analytics)
Problem #1
1. Read the given paper and write summary (Maximum 2 pages in your own words – Do not copy
any content verbatim).
2. In addition, explore** and report, if the paper has any scope of improvements in terms of
technical details or concepts.
** Due to time constraint, and limitation of formal research background, you do not need to explore
in details the scientific equations or parameters presented in the paper.
Visit and Login to https://dal.ca.libguides.com/ieee and search the following papers.
3. Do not forget to provide citation in proper format for the given paper and for any other
supporting materials that you may use as reference.
Database Administration and Security Reading
G. Vonitsanos, E. Dritsas, A. Kanavos, P. Mylonas and S. Sioutas, “Security and Privacy Solutions associated
with NoSQL Data Stores,” 2020 15th International Workshop on Semantic and Social Media Adaptation
and Personalization (SMA, 2020, pp. 1-5, doi: 10.1109/SMAP49528.2020.9248442.
Paper Link: https://ieeexplore-ieee-org.ezproxy.library.dal.ca/document/9248442
Problem #1 Submission Format:
Submit one single PDF file
Problem #2 (MySQL and Java β Implementation of 2 phase locking)
1. Visit the website and download the dataset https://www.kaggle.com/johnharshith/hollywoodtheatrical-market-synopsis-1995-to-2021
2. Create a database in your local MySQL server, and import the CSV files to your tables. [Each CSV
file should represent a table in your database].
I. You can perform cleaning and formatting of the CSV same as Assignment 1, or
Assignment 2. (using few sentences and screenshots, exhibit what kind of cleaning or
formatting you have performed)
II. If CSV files contain huge volume of data, you can consider max 3000 data points
3. Write a Java program (without using any packages/libraries etc.) to implement concurrent
transaction with 2-phase locking protocol β This program will perform 3 major operations
using 3 methods.
You can initiate two Java priority threads representing two concurrent transactions
Table 2.1: Highlights the instructions and interleaving sequence
T1 T2
Select βTICKET SOLDβ from
βAnnualTicketSalesβ where βYEARβ=2011;
Update βAnnualTicketSalesβ set βTICKET SOLDβ = 0
where βYEARβ=2011;
Commit;
Update βAnnualTicketSalesβ set βTICKET SOLDβ = 0
where βYEARβ=2011;
Select βTICKET SOLDβ from
βAnnualTicketSalesβ where βYEARβ=2011;
Commit;
Method 1: acquire_locks() //In this method, each transaction will get the lock requested.
Depending on arrival the transaction will get the lock
Method 2: transaction() //This method will be called for each transaction that acquired the lock; In
this method a transaction will perform the execution of SQL statement and will select or perform
update in the MySQL. Make sure to use autocommit =false in your code; This will be set to true only
if commit is reached.
Method 3: release_ locks()//In this method, each transaction will release the lock acquired before
Problem #2 Submission Format:
Submit one single PDF file; SQL Dump (Structure + Value); Java Code; screenshots
Problem #3
1. Read the given paper and write a summary within 1 page. (Do not copy any content verbatim).
2. Considering the concept discussed in the paper, you need to draw a high-level architecture
(block diagram) to present a data management system for Canadaβs winter storm data (I am not
providing any dataset for winter storm. This is a scenario you need to assume; You can use
weather site, news, articles, blogs etc. to gain some insight). To draw the architecture, you can
use draw.io or power point or similar drawing tools.
3. Write a Β½ page description of your block diagram/ architecture.
4. Do not forget to provide citation in proper format for the given paper and for any other
supporting materials that you may use as reference.
Data Management of Massive Data Reading
H. Wang, X. Tang, S. Shi and F. Ye, “Research on the Construction of Data Management System of Massive
Satellite Images,” 2018 International Workshop on Big Geospatial Data and Data Science (BGDDS), 2018, pp.
1-4, doi: 10.1109/BGDDS.2018.8626845.
Paper Link: https://ieeexplore-ieee-org.ezproxy.library.dal.ca/document/8626845
Problem #3 Submission Format:
Submit one single PDF file β must contain summary, block diagram/architecture, description
Assignment #4 CSCI 5408 (Data Management, Warehousing, Analytics)
Problem #1: This problem contains two tasks.
Task 1: Cluster Setup – Apache Spark Framework on GCP
Using your GCP cloud account, configure and initialize Apache Spark cluster. This cluster will be
used for Problem #2.
(Follow the tutorials provided in Lab session).
Create a flowchart or write Β½ page explanation on how you completed the task, include this part in
your PDF file.
Task 2: Data Extraction and Preprocessing Engine: Sources β Twitter messages
Steps for Twitter Operation
Step 1: Create a Twitter developer account
Step 2: Explore documentation of the Twitter search and streaming APIs and required data format.
In your own words, write Β½ page summary about your findings.
Step 3: Write a well-formed program using Java to extract data (Extraction Engine) from Twitter.
Execute/Run the program on local machine. You can use Search API or Streaming API or both.
(Do not use any online program codes. You can only use API specification codes given within
official Twitter documentation)
o The search keywords are βmaskβ, βcoldβ, βimmuneβ, βvaccineβ, βfluβ, βsnowβ.
Step 4: You need to include a flowchart/algorithm of your tweet extraction program in your
problem#1 PDF file.
Step 5: You need to extract the tweets and metadata related to the given keywords.
o For some keywords, you may get less number of tweets, which is not a problem.
Collectively, you should get approximately 3000 to 5000 tweets.
Step 6: If you get less data, run your method/program using a scheduler module to extract more
data points from Twitter at different time intervals. Note: Working on small datasets will not use
huge cloud resource or your local cluster memory.
Step 7: You should extract tweets, and retweets along with provided meta data, such as location,
time etc.
Step 8: The captured raw data should be kept (programmatically) in files. Each file should not
contain more than 100 tweets. These files will be needed for Problem #2
Step 9: Your program (Filtration Engine) should automatically clean and transform the data stored
in the files, and then upload each record to new MongodB database myMongoTweet
o For cleaning and transformation -Remove special characters, URLs, emoticons etc.
o Write your own regular expression logic. You cannot use libraries such as, jsoup, JTidy
etc.
Step 10: You need to include a flowchart/algorithm of your tweet cleaning/transformation program
on the PDF file.
Problem #2: This problem contains two tasks.
Task 1: Data Processing using Spark β MapReduce to perform count
Step 1: Write a MapReduce program (WordCounter Engine) to count (frequency count) the
following substrings or words. Your MapReduce should perform the frequency count on the stored
raw tweets files
o βfluβ, βsnowβ, βcoldβ
o You need to include a flowchart/algorithm of your MapReduce program on the PDF file.
Step 2: In your PDF file, report the words that have highest and lowest frequencies.
Task 2: Data Visualization using Graph Database β Neo4j for graph generation
Step 3: Explore Neo4j graph database, understand the concept, and learn cypher query language
Step 4: Using Cypher, create graph nodes with name: βfluβ, βsnowβ, βcoldβ
You should add properties to the nodes. For adding properties, you should check the
relevant tweets Collections.
o Check if there are any relationships between the nodes.
o If there are relationships between nodes, then find the direction
o Include your Cypher and generated graph in the PDF file.
Assignment 4 Submission Format:
1) Compress all your reports/files into a single .zip file and give it a meaningful name.
2) Submit your reports only in PDF format.
Please avoid submitting .doc/.docx and submit only the PDF version. You can merge all the reports into
a single PDF or keep them separate. You should also include output (if any) and test cases (if any) in the
PDF file.
3) Your Java code needs to be submitted on gitlab
Assignment #5 CSCI 5408 (Data Management, Warehousing, Analytics)
Problem #1
Business Intelligence Reporting using Cognos
1. Download the weather dataset available on https://www.kaggle.com/PROPPG-PPG/hourlyweather-surface-brazil-southeast-region?select=sudeste.csv
2. Explore the dataset and identify data field(s) that could be measured by certain factors or
dimensions. (Follow recorded lecture #18, and synchronous session #18)
Example: In a Sales dataset, you may find a measurable field βtotal salesβ, which could be
analyzed by other factors such as, βproductsβ, βtimeβ, βlocationβ etc. These factors are known as
dimensions. Depending on the data, you may also find possibilities of slice and dice, i.e. analysis
could be possible in more granular level; From total sales by city to total sales by store
3. Write Β½ page explanation on how did you select the measurable filed, i.e. fact and what are the
possible dimensions. Include this part in your PDF file.
4. Clean the dataset, if required perform formatting. You can perform the cleaning and formatting
using spreadsheet operation or programming script. If you use program add that in GitLab, if
you use other methods, write the steps in the PDF file.
5. Create Cognos account and import your dataset. Identify the dimensions, and create/import the
dimension tables.
6. Based on your understanding of the domain (please read the information/metadata available
on the dataset source, i.e. Kaggle), create star schema or snowflake schema. Provide justification
of your model creation in the PDF file. At this stage build your schema (dimension modelling)
using a drawing tool like draw.io.
7. In addition to justification, attach screenshot of the model created using Cognos (star schema
or snowflake schema) in the PDF file.
8. Display visual analysis of the data in a suitable format, e.g. bar graph showing temperature
change in terms of a suitable dimension. Add the screenshot of the analysis on the pdf or add a
screen recording of the analysis on your .zip folder.
Problem #2
Sentiment Analysis β Java Program only with no additional libraries. Use the parser you
developed in previous assignment.
1. To perform this task, you need to consider the processed tweets (messages only, ignore other
fields) that you obtained and stored in MongoDB in your previous assignment. If you could not
perform/complete the task, then obtain the processed MongoDb tweets collection by contacting
your TA Manraj
2. Write a script to create bag-of-words for each tweet message. (code from online or other
sources are not accepted)
e.g. tweet1 = βCanada is cold cold. I donβt feel good; but is not badβ
bow1 = {βCanadaβ:1, βisβ:2, βcoldβ:2, βIβ:1, βdonβtβ: 1, βfeelβ:1, βgoodβ:1, βbutβ:1, notβ:1, βbadβ:1}
You do not need any libraries. Just implement a simple counter using loop.
Compare each bag-of-words with a list of positive and negative words. You can download list of
positive and negative words from online source(s). You do not need any libraries. Just perform
word by word comparison with a list of positive and negative words that you can get from any
online platform. E.g. negative words can be found here https://gist.github.com/mkulakowski2/4289441
3. Tag each tweets as βpositiveβ, βnegativeβ, or βneutralβ based on overall score. You can add an
additional column to present your finding.
E.g. frequencies of the matches βcoldβ=2, βnotβ =1, βbadβ=1 (negative -4), βgoodβ=1 (positive +1). Overall score = -3
Tweets # message match polarity
1 Canada is cold cold. I
donβt feel good; but
is not bad
cold, good, not, bad negative
Problem #3
Semantic Analysis β Java Program only with no additional libraries. Use the parser you
developed in previous assignment.
1. For this task, consider the processed tweets collection that you created in Assignment 4.
2. Use the following steps to compute TF-IDF (term frequency-inverse document frequency)
a. Suppose, you have 50 tweets (messages only) that are stored in 50 JSON arrays. You
need to consider these data points as the total number of documents (N). In this case
N=50
Now, use the search query βpeopleβ, βconditionβ, βweatherβ and search in how many
documents these words have appeared.
Table 3.1: tf-idf table
Total Documents 50
Search Query Document
containing
term(df)
Total Documents(N)/ number
of documents term appeared
(df)
Log10(N/df)
weather 30 50/30 0.221848749
people 5 50/5 1
condition 10 50/10 0.698970004
b. Once you build the above table, you need to find which document has the highest
occurrence of the word βweatherβ. You can find this by performing frequency count
of the word per document.
Table 3.2: term frequency table
Term weather
Canada appeared in 30 documents Total Words (m) Frequency (f )
tweet #1 6 2
tweet #2 10 1
: : :
tweet #30 8 1
c. You should print the tweets (programmatically), which has the highest relative
frequency. You can find this by computing (f/m).
*** Just to create tabular structure (table 3.1, 3.2), you can use a 3rd party library
Assignment 4 Submission Format:
1) Compress all your reports/files into a single .zip file and give it a meaningful name.
You are free to choose any meaningful file name, preferably – BannerId_Lastname_firstname_5408_A5
but avoid generic names like assignment-5.
2) Submit your reports only in PDF format.
Winter 2022 saurabh.dey@dal.ca
Please avoid submitting .doc/.docx and submit only the PDF version. You can merge all the reports into
a single PDF or keep them separate. You should also include output (if any) and test cases (if any) in the
PDF file in the proper format (e.g. tables, charts etc. as required).
3) Your code needs to be submitted on https://git.cs.dal.ca/
CSCI 5408 Group Project – Building a simple and portable distributed database, DDBMS, and analytics systems
Project Objective:
By completing this project, a student will be able to describe concepts in modelling notation (e.g.,
Entity-Relation Diagrams or UML) and how they would be used. A student will be able to describe
the most common designs for core database system components including the query optimizer,
query executor, storage manager, access methods, and transaction processor. By completing this
task, a student will be able to explain the techniques used for data replication, allocation during the
database design process, and data analytics
In this project, each group is required to create a light-weight distributed database management
system (e.g. metadata management, data structure design, data storing, retrieval, building database
and logs, analysis, and security), with custom database structure for in memory operations using
various data structures such as Arrays, Tree, Linked Lists etc., and custom file design for persistent
storage.
Plagiarism Policy:
β’ This project is a group task. Collaboration of any type (outside the assigned group) amounts to
a violation of the academic integrity policy and will be reported to the AIO.
β’ Content cannot be copied verbatim from any source(s). Please understand the concept and
write in your own words. In addition, cite the actual source. Failing to do so will be considered
as plagiarism and/or cheating.
β’ The Dalhousie Academic Integrity policy applies to all material submitted as part of this course.
Please understand the policy, which is available at:
https://www.dal.ca/dept/university_secretariat/academic-integrity.html
Project Rubric:
Based on the discussion board rubric (McKinney, 2018)
Excellent (20%) Proficient (15%) Marginal (10%) Unacceptable (0%)
Completeness
including
Citation
All required tasks
are completed
Submission
highlights tasks
completion.
However, missed
some tasks in
between, which
created a
disconnection
Some tasks are
completed, which
are disjoint in
nature.
Incorrect and irrelevant
Correctness All parts of the
given tasks are
correct
Most of the given
tasks are correct
However, some
Most of the given
tasks are
incorrect. The
submission
Incorrect and
unacceptable
Project Requirement
portions need minor
modifications
requires major
modifications.
Novelty The submission
contains novel
contribution in
key segments,
which is a clear
indication of
application
knowledge
The submission
lacks novel
contributions. There
are some evidences
of novelty, however,
it is not significant
The submission
does not contain
novel
contributions.
However, there is
an evidence of
some effort
There is no novelty
Clarity The written or
graphical
materials, and
developed
applications
provide a clear
picture of the
concept, and
highlights the
clarity
The written or
graphical materials,
and developed
applications do not
show clear picture
of the concept.
There is room for
improvement
The written or
graphical
materials, and
developed
applications fail to
prove the clarity.
Background
knowledge is
needed
Failed to prove the
clarity. Need proper
background knowledge
to perform the tasks
Group Work Evidence of group
work, meeting
logs, Coordination
Evidence of group
work. However,
missed meeting
logs, room for
improvement in
Coordination
Missed meeting
logs, failed to
display group
coordination
No group work done.
Project is unacceptable
Project Requirements
In this project, you need to build a simple distributed database management system (D2_DB) that
will operate in 2 Linux virtual instances in the Google Cloud Platform. Your team should explore and
implement data structure concepts for creating the databases, its management system, and security.
In addition, your team should build an analytics engine to perform basic data analytics. Your database
should handle requests handle from 2 users (one user for each VM instance). The database
management system layer should provide a command-line interface (Graphical User Interface is not
required), and perform various functionalities of database management system listed below.
Functional Requirements:
Module 1: DB Design
1. You need to identify one or two linear data structure(s), which can be used for query, and
data processing.
2. Once the data is processed it can be stored in a customized file format, which will be
considered as your persistent storage
Customized File Format cannot be in any of these formats – JSON/XML/CSV/Serialized or
binary
3. You need to maintain data dictionary or meta data file for each local DBMS, and a global
metadata file; You can use same customized file format to maintain the data dictionary.
Project Requirement
CSCI 5408 (Winter 2022) saurabh.dey@dal.ca
Module 2: Query Implementation
1. Your application should validate and execute the following operations.
(Standard SQL command and queries only)
a. Create database
b. Use database
c. Create table
d. Insert into table
e. Select from table with single where condition (AND || OR || NOT are not required)
f. Update one column with single where condition (AND not required)
g. Delete a row with single where condition
Module 3: Transaction Processing
1. Your system should identify what is a transaction and what is a normal query.
2. In case of transaction, you cannot write the data immediately to persistent storage, you need
to perform the operation in the linear data structure.
3. Transaction follows ACID properties, so you need to consider that. [Hint: It can be achieved
by scanning logs]
{You do need to work on concurrency control. Single distributed transaction is sufficient}
Module 4: Log Management
You need to create 3 Logs β JSON format is allowed for Logs
a. General Logs: query execution time, state of the database (e.g. how many tables are
there with number of records at a given time)
b. Event Logs: changes in database, concurrent transactions, crash reports, etc.
c. Query Logs: capture user queries and timestamp of query submission
{Do not use any in-build package or libraries; You need to perform normal file read write operations
to capture the query, timestamp, login details, data change etc.}
Module 5: Data Modelling β Reverse Engineering
1. Your application should provide option for generating an ERD based on current database
state.
2. User will provide the database name, and the application will create ERD based on the
metadata, and data files.
{You do not have to create any graphical ERD, a text file containing tables, columns, relationships,
cardinality will be sufficient}
Module 6: Export Structure & Value
1. Your application should provide an option for exporting structure and values (in standard
SQL format) for each database, which is selected by the end user.
2. This is like SQL dump (structure+value) created by any standard DBMS Export option.
3. You cannot use any external packages, and as mentioned, your export format must be in SQL
4. Your data dump must capture the current state of the database (E.g. if there is any update
performed on a data, it must be reflected in the data dump. Do not write create or insert
statements from console input to a file as the SQL export. You must dynamically generate it)
Project Requirement
Module 7: Analytics
1. The D2_DB application should provide some basic analytics β The results must be written in
files, and displayed on screen
a. How many queries (valid + invalid) are submitted by Database. (E.g. ran in VM1
instance)
E.g.: >> count queries;
βuser SDEY submitted 10 queries for DB1 running on Virtual Machine 1β
βuser Alex submitted 3 queries for DB2 running on Virtual Machine 2β
b. How many Update operations are successfully executed by Tables
E.g. >> count update DB1;
βTotal 9 Update operations are performed on Employeeβ
βTotal 3 Update operations are performed on Departmentβ
Module 8: User Interface & Login Security
1. Your user interface design should be basic console based
2. It must provide access to the valid user only.
3. User interface can be menu driven, and provide options for registration or login
4. For registration, it should accept the userID, password, and security questions/answers – and
store all information in the User_Profile text file
a. Use some hashing like md5, Sha1 etc. and store the hashed UserID, hashed password
in the file. (you can use Java libraries for hashing)
b. You do not have to encrypt security question/answer.
5. If a registered user wants to access the system he/she/they needs to provide valid UserID,
and password, which will be hashed, and checked with the entry available in the User_Profile
text file. The security answer will also be asked at login. Since you are building a distributed
database management system, User_Profile text file needs to be present in both vistual
machine instances.
6. Once the users gain access, they get 5 options.
E.g.
1. Write Queries
2. Export
3. Data Model
4. Analytics
7. The βWrite Queriesβ option should work for both normal queries and transaction.
Deliverables:
Phase 1 (Feasibility Study and Design): (2%)
β’ Each group will work on the problem and identify a solution.
β’ Each group will perform a feasibility study and submit a short-recorded group presentation
with their meeting Logs. (only 1-member can present for 10 min. as group representative)
β’ Each group will submit their tentative design, and implementation plan (maximum 5 pages.
Include citations if you use any source for background study, coding skills, or concepts.)
Phase 2 (Go-Live and Project Closure): (20%)
β’ Each group will complete the task incorporating the changes suggested in Phase 1
Project Requirement
β’ Each group will complete the project implementation and record the details in the form of a
final report
β’ Each group will submit a report (40 to 65 pages) and a recorded group presentation
explaining major code blocks, and demonstration of the product. (every member needs to
present)
o Final Report should contain architecture details, design details, implementation
details of each module, use case, test case, meeting logs etc.
β’ There will be a synchronous Q&A session with each group at the end of the course
Visual Representation for your understanding
GCP VM 1 running Linux GCP VM 2 running Linux
D2_DB Package
Java Code + Data Files_VM1 +
Folders_VM1 + Meta Data_Local_VM1
+ Meta Data_Global
D2_DB Package
Java Code + Data Files_VM2 +
Folders_VM2 + Meta Data_Local_VM2
+ Meta Data_Global
ssh to either VM and execute your Java
code, which will perform the
Distributed DB operation
Local Machine
Cloud Machine 1 Cloud Machine 2