Description
CS544 Module 1 Assignment
Part1) 50 points
The data set rivers contains the lengths (in miles) of “major” rivers in North America, as
compiled by the US Geological Survey. Use the data set to answer the following
questions using R:
a) How many data points are there in the data set?
b) Compute the mean, median, and mode.
c) Compute the variance and the standard deviation.
d) Compute the five number summary, the interquartile range, and outliers, if
any.
e) Compute the standardized version (z-scores) of the above data.
f) Create a matrix of size 2 x 30 using the first 60 data points in rivers. The first 30
values belong to the first row of the matrix. Assign the result to the variable,
rivers.60, and display the result.
g) Without hardcoding, displaying the first and last columns of the matrix.
h) Assign row names for the rivers.60 as Row_1 and Row_2 and column
names as Length_1, Length_2, ….Length_30. The code should not hard
code the values of the numbers in the row and column names.
Part 2) 50 points
The data file Johnson.csv contains quarterly earnings (dollars) per Johnson & Johnson
share 1960–80.
a) Read the data from johnson.csv into a data frame. In the data frame, the data in
“Year” column should be used as row names and “Qtr1”, ‘Qtr2”, “Qtr3”, and “Qtr4”
should be column names.
b) Show the summary for earnings for each quarter.
c) Add a new column, Yearly, showing the earnings for the whole year (the sum of
earnings for the 4 quarters). Display the new resulting data frame.
d) Which was the best performing year (in terms of highest earning) and worst performing
year?
e) Show all rows of the data frame whose “Yearly” is greater than 20.
Submission:
Create a folder, CS544_HW1_lastName and place the following files in this folder.
Write the solution in a Word document, HW1_lastName.doc.
For the code portions (Part1 and Part2), provide all R code in a single file,
HW1_lastName.r. For this homework only, you can earn an extra credit of 10 points by
providing the solutions in a Jupyter Notebook, HW1_lastName.ipynb, instead. (You can
include the R file and the Jupyter notebook if you wish).
Archive the folder (CS544_HW1_lastName.zip). Upload the zip file to the Assignments
section of Blackboard.
CS544 Module 2 Assignment
Part1) Probability – 20 points
a) From the Bayes’ rule example given in Section 3.10, compute the probabilities that a
randomly selected non-smoker i) has lung disease and ii) does not have lung disease. Show the
calculations without using R. Then, verify with the bayes function provided in the code samples.
b) Suppose that in a particular state, among the registered voters, 40% are democrats, 50 %
are republicans, and the rest are independents. Suppose that a ballot question is whether to
impose sales tax on internet purchases or not. Suppose that 70% of democrats, 40% of
republicans, and 20% of independents favor the sales tax. If a person is chosen at random that
favors the sales tax, what is the probability that the person is i) a democrat? ii) a republican, iii)
an independent. Show the solutions with the calculations without using R. Then, verify with the
bayes function provided in the code samples.
Part2) Random Variables – 30 points
a) Consider the experiment of rolling a pair of dice. Using R, show how would you define a
random variable for the absolute value of the difference of the two rolls, using a user-defined
function.
b) Using the above result, what is the probability that the two rolls differ by exactly 2? What is
the probability that the two rolls differ by at most 2? What is the probability that the two rolls
differ by at least 3? Use the Prob function as shown in the code samples.
c) Show the marginal distribution of the above random variable (using R).
d) Using R, add another random variable to the above probability space using a user defined
function. The random variable is TRUE if the sum of the two rolls is even, and FALSE otherwise.
What is the probability that the sum of the two rolls is even? Show also the marginal distribution
for this random variable.
Part3) Functions – 20 points
Using a for loop, write your own R function, evensum(data), that returns the sum of all the even
values in the given numeric data vector.
Now, without using any loop, write your own R function, evensum2(data), that returns the sum
of all the even values in the given numeric data vector.
Test both functions with sample data.
Sample output:
Part4) R – 30 points
Initialize the Dow Jones Industrials daily closing data as shown below:
dow <- read.csv(‘https://kalathur.com/dow.csv’, stringsAsFactors = FALSE)
Provide the simplest R code and output for all of the following. The code should work for any
given data.
a) Use the diff function to calculate the differences between consecutive values.
Insert the value 0 at the beginning of these differences. Add this result as the DIFFS column of
the data frame.
b) How many days did the Dow close higher than its previous day value? How many days did
the Dow close lower than its previous day value?
c) Show the subset of the data where there was a gain of at least 400 points from its previous
day value.
d) Provide the solution to compute the longest gaining streak of at least 100 points in the data.
Show the data for that longest gaining streak. Hint: Use the rle function provided by R.
Submission:
Create a folder, CS544_HW2_lastName and place the following files in this folder.
Provide the text and code part of the solutions and the corresponding output in a single
Word document, HW2_lastName.doc.
For the code portions, provide the R file, HW2_lastName.R, with each portion of the
code identified by comments.
Archive the folder (CS544_HW2_lastName.zip). Upload the zip file to the Assignments
section of Blackboard.
CS544 Module 3 Assignment
Part 1) 10 points
Use the primes (UsingR) dataset. Use the diff function to compute the
differences between successive primes. Show the frequencies of these
differences. Show the barplot of these differences.
Part 2) 10 points
Use the coins (UsingR) dataset. Do not use explicit loops for any
calculations. Do not hard code the denominations in the solution. The
solution should work for any denominations.
a) How many coins are there of each denomination?
b) What is the total value of the coins for each denomination?
c) What is the total value of all the coins?
d) Show the barplot for the number of coins by year.
Part 3) 10 points
Use the south (UsingR) dataset.
a) Show the stem plot of the data. What do you interpret from this plot?
b) Show the five number summary of the data. Calculate the lower and
upper ends of the outlier ranges. What are the outliers in the data?
c) Show the horizontal boxplot of the data along with the appropriate labels
on the plot.
Part 4) 10 points
Use the pi2000 (UsingR) dataset.
a) How many times each of the digits 0 to 9 occur in this dataset?
b) Show the percentages of their frequencies.
c) Show the histogram of the data.
Part 5) 15 points
Suppose that a football (NFL), basketball (NBA), and hockey (NHL)
games are being shown at the same time. Consider the two-way
summarized data shown below showing the preferences of men and
women what sport they wish to watch.
a) Using cbind, create the matrix for the above data.
b) Set the row names for the data.
c) Set the column names for the data.
d) Now, add the dimension variables Gender and Sport to the data.
e) Show the marginal distributions for the Gender and the Sport.
f) Show the result of adding margins to the data.
g) Show the proportional data separately for Gender and Sport. Interpret
the results.
h) Using appropriate colors, show the mosaic plot for the data. Also show
the barplot for Gender and Sport separately with the bars side by side. Add
legend to the plots.
Part 6) 10 points
Use the midsize (UsingR) dataset.
a) Show the pair wise plots for all the variables.
b) Provide at least 4 interpretations of the results.
Part 7) 15 points
Use the MLBattend (UsingR) dataset.
a) Extract the wins for the teams BAL, BOS, DET, LA, PHI into the
respective vectors.
b) Create a data frame of five columns using these vectors. Use the team
names for the columns
c) Show the boxplot of the data frame.
d) Provide at least 5 interpretations of the results.
Part 8) 20 points
Initialize the House and Senate data as shown below:
house <- read.csv(‘https://kalathur.com/house.csv’, stringsAsFactors = FALSE)
senate <- read.csv(‘https://kalathur.com/senate.csv’, stringsAsFactors = FALSE)
Provide the simplest R code for the following:
a) Show how many senators and house members are there by party lines?
b) Show the top 10 states in decreasing order by the number of house
members in that state?
c) Use a box plot on the number of house members per state and
determine which states are outliers?
d) What is the average number of years served by party line in the house
and senate?
Submission:
Create a folder, CS544_HW3_lastName and place the following files
in this folder.
Provide the R code, HW3_lastName.R, with each portion of the code
clearly identified by the corresponding question. Prepare a corresponding
word document by pasting the output for each question
(HW3_lastName.docx)
Archive the folder (CS544_HW3_lastName.zip). Upload the zip file to
the Assignments section of Blackboard.
CS544 Module 4 Assignment
Part1) Binomial distribution (20 points)
Suppose a pitcher in Baseball has 50% chance of getting a strike-out when
throwing to a batter. Using the binomial distribution,
a) Compute and plot the probability distribution for striking out the next 6
batters.
b) Plot the CDF for the above
c) Repeat a) and b) if the pitcher has 70% chance of getting a strike-out.
d) Repeat a) and b) if the pitcher has 30% chance of getting a strike-out.
e) Infer from the shape of the distributions.
Part2) Binomial distribution (15 points)
Suppose that 80% of the flights arrive on time. Using the binomial
distribution,
a) What is the probability that four flights will arrive on time in the next 10
flights?
b) What is the probability that four or fewer flights will arrive on time in the
next 10 flights?
c) Compute the probability distribution for flight arriving in time for the next
10 flights.
d) Show the PMF and the CDF for the next 10 flights.
Part3) Poisson distribution (15 points)
Suppose that on average 10 cars drive up to the teller window at your bank
between 3 PM and 4 PM and the random variable has a Poisson
distribution. During this time period,
a) What is the probability of serving exactly 3 cars?
b) What is the probability of serving at least 3 cars?
c) What is the probability of serving between 2 and 5 cars (inclusive)?
d) Calculate and plot the PMF for the first 20 cars.
Part4) Uniform distribution (15 points)
Suppose that your exams are graded using a uniform distribution between
60 and 100 (both inclusive).
a) What is the probability of scoring i) 60? ii) 80? iii) 100?
b) What is the mean and standard deviation of this distribution?
c) What is the probability of getting a score of at most 70?
d) What is the probability of getting a score greater than 80 (use the
lower.tail option)?
e) What is the probability of getting a score between 90 and 100 (both
inclusive)?
Part5) Normal distribution (20 points)
Suppose that visitors at a theme park spend an average of $100 on
souvenirs. Assume that the money spent is normally distributed with a
standard deviation of $10.
a) Show the PDF plot of this distribution covering the three standard
deviations on either side of the mean.
b) What is the probability that a randomly selected visitor will spend more
than $120?
c) What is the probability that a randomly selected visitor will spend
between $80 and $90 (inclusive)?
d) What are the probabilities of spending within one standard deviation, two
standard deviations, and three standard deviations, respectively?
e) Between what two values will the middle 90% of the money spent will
fall?
f) Show a plot for 10,000 visitors using the above distribution.
Part6) Exponential distribution (15 points)
Suppose your cell phone provider’s customer support receives calls at the
rate of 18 per hour.
a) What is the probability that the next call will arrive within 2 minutes?
b) What is the probability that the next call will arrive within 5 minutes?
c) What is the probability that the next call will arrive between 2 minutes
and 5 minutes (both inclusive)?
d) Show the CDF of this distribution.
Submission:
Create a folder, CS544_HW4_lastName and place the following file in this
folder.
Provide the R code, HW4_lastName.R, with each portion of the code
clearly identified by the corresponding question. Prepare a corresponding
word document by pasting the output for each question
(HW4_lastName.docx)
Archive the folder (CS544_HW4_lastName.zip). Upload the zip file to
the Assignments section of Blackboard.
CS544 Module 5 Assignment
Part1) Central Limit Theorem (20 points)
The input data consists of the sequence from 1 to 20 (1:20). Show the following three
plots in a single row.
a) Show the histogram of the densities of this distribution.
b) Using all samples of this data of size 2, show the histogram of the densities of the
sample means.
c) Using all samples of this data of size 5, show the histogram of the densities of the
sample means.
d) Compare of means and standard deviations of the above three distributions.
Part2) Central Limit Theorem (20 points)
The data in the file queries.csv contains the number of queries Google has had each day for a one
year period (365 days). The data file is also available at
https://kalathur.com/cs544/data/queries.csv. Use this link to read the data using read.csv function
when submitting the homework.
a) Show the histogram of the distribution of the number of queries. Compute the mean and
standard deviation of the number of queries Google has had per day.
b) Draw 1000 samples of this data of size 5, show the histogram of the densities of the sample
means. Compute the mean of the sample means and the standard deviation of the sample means.
c) Draw 1000 samples of this data of size 20, show the histogram of the densities of the sample
means. Compute the mean of the sample means and the standard deviation of the sample means.
d) Compare of means and standard deviations of the above three distributions.
Part3) Central Limit Theorem – Negative Binomial distribution (20
points)
Suppose the input data follows the negative binomial distribution with the
parameters size = 5 and prob = 0.5.
a) Generate 1000 random numbers from this distribution. Show the barplot
with the proportions of the distinct values of this distribution.
b) With samples sizes of 10, 20, 30, and 40, generate the data for 5000
samples using the same distribution. Show the histograms of the densities
of the sample means. Use a 2 x 2 layout.
c) Compare of means and standard deviations of the data from a) with the
four sequences generated in b).
Part4) Sampling (40 points)
Use the MU284 dataset from the sampling package. Use a sample size of
20 for each of the following.
a) Show the sample drawn using simple random sampling without
replacement. Show the frequencies for each region (REG). Show the
percentages of these with respect to the entire dataset.
b) Show the sample drawn using systematic sampling. Show the
frequencies for each region (REG). Show the percentages of these with
respect to the entire dataset.
c) Calculate the inclusion probabilities using the S82 variable. Using these
values, show the sample drawn using systematic sampling. Show the
frequencies for each region (REG). Show the percentages of these with
respect to the entire dataset.
d) Order the data using the REG variable. Draw a stratified sample using
proportional sizes based on the REG variable. Show the frequencies for
each region (REG). Show the percentages of these with respect to the
entire dataset.
e) Compare the means of RMT85 variable for these four samples with the
entire data.
Submission:
Create a folder, CS544_HW5_lastName and place the following file in this
folder.
Provide the R code, HW5_lastName.R, with each portion of the code
clearly identified by the corresponding question. Prepare a corresponding
word document by pasting the output for each question
(HW5_lastName.docx)
Archive the folder (CS544_HW5_lastName.zip). Upload the zip file to
the Assignments section of Blackboard.
CS544 Module 6 Assignment
Part1) Strings (60 points)
Use the stringr functions for the following:
Initialize the vector of words from Lincoln’s Gettysburg address with the
following code:
file <- “https://kalathur.com/cs544/data/lincoln.txt”
words <- scan(file, what=character())
a) Detect and show all the words that have a punctuation symbol.
b) Replace all the punctuations in the corresponding words with an empty
string. Make this the new words data.
c) Show the frequencies of the word lengths in the above data. Plot the
distribution of these frequencies.
d) What are the words with the longest length?
e) Show all the words that start with the letter p.
f) Show all the words that end with the letter r.
g) Show all the words that start with the letter p and end with the letter r.
Part2) Data Wrangling (40 points)
Use the tidyverse library for the following:
Download the following csv file,
https://people.bu.edu/kalathur/usa_daily_avg_temps.csv
locally first and use read.csv to load the data into a data frame.
a) Convert the data frame into a tibble and assign it to the variable
usaDailyTemps.
b) What are the maximum temperatures recorded for each year? Show the
values and also the appropriate plot for the results.
c) What are the maximum temperatures recorded for each state? Show the
values and also the appropriate plot for the results.
d) Filter the Boston data and assign it to the variable bostonDailyTemps.
e) What are the average monthly temperatures for Boston? Show the
values and also the appropriate plot for the results.
Submission:
Create a folder, CS544_HW6_lastName and place the following file in this
folder.
Provide the R code, HW6_lastName.R, with each portion of the code
clearly identified by the corresponding question. Prepare a corresponding
word document by pasting the output for each question
(HW6_lastName.docx)
Archive the folder (CS544_HW6_lastName.zip). Upload the zip file to
the Assignments section of Blackboard.
CS544 Final Project
Picking the Data Set
Look into the following sites as an example and select a data set that interests you.
1. https://www.kaggle.com/datasets
2. https://www.kdnuggets.com/datasets/index.html
3. Any other source of your choice
Preparing the data
• Import the data set into R.
• Document the steps for the import process and any preprocessing had
to be done prior to or after the import. Any R code used in the process
should be included.
Analyzing the data
• Do the analysis as in Module3 for at least one categorical variable and at least one
numerical variable. Show appropriate plots for your data.
• Do the analysis as in Module3 for at least one set of two or more variables. Show
appropriate plots for your data.
• Pick one variable with numerical data and examine the distribution of the data.
• Draw various random samples of the data and show the applicability of the
Central Limit Theorem for this variable.
• Show how various sampling methods can be used on your data. What are your
conclusions if these samples are used instead of the whole dataset.
Presenting the Project
• You will do your project presentation in the classroom.
• Each presentation is for at most 10 minutes.
• The final files for the project will be due on Monday, June 25, 6 AM.
Submitting the Project
Upload a zip file (CS544Final_lastName.zip) containing all the code (R file), the
presentation document (PDF or PPT), and all the results in a Word/PDF Document.
Grading Rubric:
• Preparing the data and documenting the data preparation (10 points)
• Analyzing the data and documenting the same (60 points)
• Implementation of any extra feature(s) not mentioned in the specification
(10 points)
• Presenting the project in the classroom (20 points)