CSC 555 / DSC 333 Mining Big Data Assignment 3

$30.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (4 votes)

1) Describe how to implement the following queries in MapReduce:

  1. SELECT a.First, a.Last, e.EID, a.AID, e.Age

FROM Employee as emp, Agent as a

WHERE emp.Last = a.Last AND emp.First = a.First;

 

  1. SELECT lo_quantity, SUM(lo_extendedprice)

FROM lineorder, dwdate

WHERE lo_orderdate = d_datekey

AND d_yearmonth = ‘Feb1996’

AND lo_discount = 5

GROUP BY lo_quantity;

 

  1. SELECT d_month, COUNT(d_year)

FROM dwdate

GROUP BY d_month

ORDER BY COUNT(d_year)

 

  • Consider a Hadoop job that processes an input data file of size equal to 79 disk blocks (79 different blocks, not considering HDFS replication factor). The mapper in this job requires 1 minute to read and fully process a single block of data. Reducer requires 1 second (not minute) to produce an answer for one key worth of values and there are a total of 5000 distinct keys (mappers generate a lot more key-value pairs, but keys only occur in the 1-5000 range for a total of 5000 unique entries). Assume that each node has a reducer and that the keys are distributed evenly.

The total cost will consist of time to perform the Map phase plus the cost to perform the Reduce phase.

 

  1. How long will it take to complete the job if you only had one Hadoop worker node? For simplicity, assume that that only one mapper and only one reducer are created on every node.

 

  1. 30 Hadoop worker nodes?

 

  1. 50 Hadoop worker nodes?

 

  1. 100 Hadoop worker nodes?

 

  1. Would changing the replication factor have any affect your answers for a-d?

 

You can ignore the network transfer costs as well as the possibility of node failure.

 

    1. Suppose you have an 8-node cluster with replication factor of 3. Describe what MapReduce has to do after it determines that a node has crashed while a job is being processed. For simplicity, assume that the failed node is not replaced and your cluster is reduced to 7 nodes. Specifically:

 

  1. What does HDFS (the storage layer) have to do in response to node failure in this case? I.e., what is the guarantee that HDFS has to maintain?

 

  1. What does MapReduce engine (the execution layer) have to do to respond to the node failure? Assume that there was a job in progress at the time of the crash (because MapReduce engine only needs to take action if a job was in progress).

 

  1. Where does the Mapper store output key-value pairs before they are sent to Reducers?

 

  1. Can Reducers begin processing before Mapper phase is complete? Why or why not?

 

  • Repeat the RSA computation examples by
  1. a) Select two (small) primes and generate a public-private key pair.

 

  1. b) Compute a sample ciphertext using your public key

 

  1. c) Decrypt your ciphertext from 4-b using the private key

 

  1. d) Why can’t the encrypted message sent through this mechanism be larger than the value of

n?

 

 

NOTE: By default Hive assumes ‘\t’ separated tables. You will need to modify your CREATE TABLE statement provided above to account for the ‘|’ delimiter in the data.

 

Use Hive user defined function to perform the following transformation on Part table (creating a new PartSwapped table with the same number of columns): in the 7th column/p_type swap the first and last word in the column and replace the space by a comma. For example, STANDARD BRUSHED TIN would become TIN, BRUSHED STANDARD. For the rest of the columns, where applicable, replace space (‘ ’) and # characters by a tilde (~), so that MFGR#4 becomes MFGR~4 and MED BAG becomes MED~BAG.

 

Keep in mind that your transform python code (split/join) should always use tab (‘\t’) between fields even if the source data is |-separated. You can also take a look at the transform example included with this assignment for your reference (Examples_Assignment3.doc) which deliberately uses a different delimiter (‘?’).

 

  • Download and install Pig:

cd

wget http://cdmgcsarprd01.dpu.depaul.edu/CSC555/pig-0.15.0.tar.gz

gunzip pig-0.15.0.tar.gz

tar xvf pig-0.15.0.tar

 

set the environment variables (this can also be placed in ~/.bashrc to make it permanent)

export PIG_HOME=/home/ec2-user/pig-0.15.0

export PATH=$PATH:$PIG_HOME/bin

 

Use the same vehicles file. Copy the vehicles.csv file to the HDFS if it is not already there.

 

Now run pig (and use the pig home variable we set earlier):

cd $PIG_HOME

bin/pig

 

Create the same table as what we used in Hive, assuming that vehicles.csv is in the home directory on HDFS:

 

VehicleData = LOAD ‘/user/ec2-user/vehicles.csv’ USING PigStorage(‘,’)

AS (barrels08:FLOAT, barrelsA08:FLOAT, charge120:FLOAT, charge240:FLOAT, city08:FLOAT);

 

You can see the table description by

DESCRIBE VehicleData;

Verify that your data has loaded by running:

             

VehicleG = GROUP VehicleData ALL;

Count = FOREACH VehicleG GENERATE COUNT(VehicleData);

              DUMP Count;

 

How many rows did you get? (if you get an error here, it is likely because vehicles.csv is not in HDFS)

 

Create the same ThreeColExtract file that you have in the previous assignment, by placing barrels08, city08 and charge120 into a new file using PigStorage. You want the STORE command to record output in HDFS. (discussed in p457, Pig Chapter, “Data Processing Operator section)

 

For example, you can use this to get one column (multiple columns are comma-separated)

 

OneCol = FOREACH VehicleData GENERATE barrels08;

 

Verify that the new file has been created and report the size of the newly created file.

(you can use quit to exit the grunt shell)

Submit a single document containing your written answers.  Be sure that this document contains your name and “CSC 555 Assignment 3” at the top.