CSC 555 / DSC 333 Assignment 1

$30.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (4 votes)

Part 1

  1. Compute (you can use any tool or software to compute answers in this part – but if you do not know to perform this computation, please talk to me about your course prerequisites):

211

(24)4

44

85

837 MOD 100 (MOD is the modulo operator, a.k.a. the remainder)

842 MOD 20

23 MOD 112

112 MOD 23

 

  1. Given vectors V1 = (1, 1, 3) and V2 = (1, 2, 2) and a 3×3 matrix M = [(2, 1, 3), (1, 2, 1), (1, 0, 1)], compute:

V2 + V1

V1 – V1

|V1| (Euclidean vector length, not the number of dimensions)

|V2|

M * V2 (matrix times vector, transpose it as necessary)

M * M (or M2)

M4

 

  1. Suppose we are flipping a coin with Head (H) and Tail (T) sides. The coin is not balanced with 0.4 probability of H coming up (and 0.6 of T). Compute the probabilities of getting:

HTHH

THTT

Exactly 1 Head out of a sequence of 3 coin flips.

Exactly 2 Tails out of sequence of 3 coin flips.

 

  1. Consider a database schema consisting of two tables, Employee (ID, Name, Address), Project (PID, Name, Deadline), Assign(EID, PID, Date). Assign.EID is a foreign key referencing employee’s ID and Assign.PID is a foreign key referencing the project.

Write SQL queries for:

  1. Find projects that are not assigned to any employees (Name and Deadline of the project).

 

  1. For each date, find how many assignments were made that day.

 

  • Find all projects that have fewer than 2 employees assigned to them (note that the answer should include 0 or 1 employees to be correct).

 

  1. Mining of Massive Datasets, Exercise 1.3.3

Justify your answer (an example only would be worth partial credit)

 

  1. Hadoop Distributed Filesystem.

 

  1. What are the guarantees offered by a replication factor of 3 (3 copies of each block)?

 

  1. What action does NameNode have to take when a machine in the Hadoop cluster fails/crashes?

 

  • What is the overall storage cost for a file of size 950 MBs, when the HDFS replication factor is set to 3?

Part 2

Please be sure to submit all python code with your answers (you can either include it in the same document or as a separate .py file).

 

  1. Write python code that is going to read a text file and compute a total word count using a dictionary (e.g., {‘Hadoop’:3, ‘cloud.’: 2, ‘MapReduce’:4}. For our purposes, a word is anything split by space (.split(‘ ‘)), even if it includes things like punctuation.

Test the code on HadoopBlurb.txt (attached to the homework, from Apache Hadoop Wikipedia entry).

How many keys does your dictionary have?

 

  1. Write python code that is going to create two different word count dictionaries instead, assigning the words at random. Each time you process a word, choose at random which count dictionary to add it to (that means some words will appear in both dictionaries simultaneously).

How many keys does each dictionary have?

 

  1. Write python code to merge the two dictionaries into one (adding the counts) and verify that it matches the dictionary from Part 2-a.

 

  1. Write python code that is going to randomly but deterministically assign each word to one of the two dictionaries instead. For example, you can make that assignment using the remainder (YourNumber % 2 will always return 0 or 1 depending on the number). You can convert a word string into a numeric value using hash (e.g., hash(‘Hadoop.’)). We will talk about hashing in more detail later in the quarter.

How many keys does each dictionary have?

 

Part 3

 

Write (and test) python code that is going to measure the speed of reading from the web (using urllib or similar), reading from a file and writing to a file on your computer.

That means your code will read or write some amount of data, time the operation, and compute the read or write rate (in MBytes/sec). The files have to be sufficiently large so that each of the measuring operations has to execute for at least 4 seconds or more (we’ll check why in Part-a)

 

  1. Compute the speed of reading from disk. We will do that in two different ways

 

  • Use the HadoopBlurb file as the file you read and time and compute the MB/sec speed (this one will be less than 4 seconds).

 

  • Use a large file (at least 4 seconds of reading from disk) and compute the MB/sec speed.

 

How do they compare? Which one do you think is more accurate?

 

  1. Compute the speed of reading from the web (you can use http://dbgroup.cdm.depaul.edu/DSC450/OneDayOfTweets.txt if you need a large file, but remember that you don’t need to read the whole thing).

 

  1. Compute the speed of writing to disk

 

  1. Finally, add a print statement in part 3-a (i.e., print everything you read from the file) and measure the new throughput in MBytes/sec.

 

 

Submit a single document containing your written answers.  Be sure that this document contains your name and “CSC 555 Assignment 1” at the top.