Description
Question 1 (15 points)
The dataset data.online.scores (in data.zip) provides the exam score records for students who take online courses. These records are sampled from a general population. Data
in each row are separated by tabs. The first column shows students’ IDs.
The second column
is students’ midterm scores and the third column is students’ final scores. Please give the
following statistical descriptions of the final scores. If the result is not integer, then round
it to 3 decimal places.
Purpose
• Have a better understanding of basic statistical descriptions of data.
Requirements
• For sub-questions (a), (b) and (c), you should write scripts to calculate statistical
descriptions. There is no restrictions on the language you use. You are not allowed to
calculate using calculators or by hands. You are required to submit your source code
for sub-questions (a), (b) and (c).
• For sub-question (d), you are required to answer the question in the PDF file you will
submit.
a. (6’) First quantile Q1, the median, and the third quantile Q3.
b. (3’) Mean.
c. (3’) Mode.
d. (3’) For the distribution of students’ final scores, is the data positively skewed or
negatively skewed? Explain why you could get your conclusion.
Question 2 (15 points)
In the following questions, you are required to evaluate the similarity/dissimilarity among
data samples. If the result is not integer, then round it to 3 decimal places.
Purpose
• Have a better understanding of measuring data similarity and dissimilarity.
Requirements
• For sub-questions (a) and (b), you should write important steps and the result in the
PDF file you will submit. Only giving a result will not get credits.
• For sub-question (c), you should explain clearly in the PDF file.
• For sub-question (d), you should write a script to calculate. There’s no restrictions on
the language you use. You are required to submit your source code for sub-question
(d).
a. (3’) Given two objects Obj1 and Obj2, each of them has 200 binary attributes. Table
1 is the contingency table for these two objects. Each cell in the table shows the
number of attributes where Obj 1 and Obj 2 have the corresponding combination of
values. E,g., for cell Obj 1 = 1 and Obj2 = 0, there are 28 attributes with such a
combination. Suppose all the attributes are asymmetric binary attributes, you are
required to calculate the Jaccard coecient of Obj1 and Obj2.
Obj 2
Obj 1
1 0
1 21 28
0 39 112
Table 1: Contingency Table for Obj1 and Obj 2
b. (6’) Given two points in the 3-D space, A = (3, 1, 2) and B = (1, 0, 8). Please
calculate the following distances between these two points.
1. Euclidean distance.
2. Manhattan distance.
3. Minkowski distance where h = 1.
c. (2’) Suppose we have two random points A and B in space, explain why the Euclidean
distance between A and B is always shorter than (or equal to) the Manhattan distance?
d. (4’) Given the dataset vectors.txt, you will find two vectors (A and B). Each vector
has 100 attributes (separated by tabs). Calculate the following distance between these
two vectors:
1. Minkowski distance where h = 2.
2. Minkowski distance where h = 3.
Question 3 (10 points)
Based on the data of students’ scores (file data.online.scores, contained in the file data.zip),
normalize the mid-term scores using z-score normalization (use empirical standard deviation for standard deviation).
Purpose
• Understand the intuition and usage of z-score normalization.
Requirements
• Write a script to normalize the data using z-score normalization, in any language of
your choice. You need to include the script file in your submission.
a. (5’) Compare the mean and empirical variance before and after normalization.
b. (5’) For original score of 90, what is the corresponding score after normalization?
Question 4 (30 points)
Purpose
• Understand the intuition and usage of Pearson correlation coecients and Principal
Component Analysis (PCA).
Requirement
• Apply the algorithms described in the lecture slides on a toy dataset
• Give explanations based on your understanding of the algorithms
• Use Matlab, MS Excel, or similar software applications for calculation and visualization.
Consider 10 data points in 2-D space as specified in the table below.
X 0.69 -1.31 0.39 0.05 1.29 0.49 0.19 -0.81 -0.31 0.71
Y 0.89 -1.11 0.59 0.45 1.19 0.69 0.25 -0.71 -0.21 0.71
a. (50
) What is the (Pearson) correlation coecient between X and Y in the data set
above? Show your calculations. What do you learn about the data set from the
quantity?
b. (30
) Based on the quantity and conclusion above, without actually applying PCA, can
you guess if PCA may or may not help to reduce the data size? Explain your guess by
the intuition of PCA.
c. (60
) What is the covariance matrix for the data set above? Show your calculation.
Hint: It is easy to miss a few steps. Follow the steps described in the lecture slides.
d. (60
) How many principal components does the dataset have? What are they? What
is the first principal component, i.e., the most important one? Show your calculation.
Hint: You can use Matlab or similar software applications to find eigenvectors.
e. (50
) Scatterplot all the data points and draw the lines showing the directions of all the
principal components. Hint: You may use Matlab or Excel to draw.
f. (50
) Suppose we only use the first principal component, i.e., the most important component, as the basis for the new space. Project the data points A = (0.05, 0.45) and
B = (0.49, 0.69) to the new space. Show your calculation. Draw the projections on
the figure in sub-question (e).
Mini Machine Problem 1 (15 points)
This MP borrows a quite considerable amount of material from a certain source. We will
publish the source after the submission’s due date because it contains answers for a few
questions. Don’t try to find the existing answers because the MP is not hard, and it is really
fun and useful. To finish the MP, please read this document carefully.
In this MP, we use the Matlab built-in data set carsmall, a data set containing information for 100 cars in 1970, 1976 and 1982. For this data set, we focus on 5 attributes:
Acceleration (the rate of change of velocity of a car), MPG (Miles Per Gallon, fuel e-
ciency), Displacement (the volume of the cylinder), Horsepower and Weight. We use the
Cylinders attribute (the number of cylinders) to group our observation.
You will be required to run some code provided to you in this PDF file. However, do not copy the
code from this file to Matlab directly since the encoding mechanism for some
special symbol in PDF is not supported by Matlab. You should type the code
into Matlab.
Purpose
• Learn the basic techniques for data visualization using Matlab.
Requirements
• This MP requires Matlab. Please do not use other softwares since that will make the
assignment harder for some questions. The software is free for UIUC students in UIUC
Webstore. And it is also available in EWS machines on campus. If you are not able to
access both sources, please let us know ASAP. Please also note that it may take you
only 1-2 hours to finish the MP, so if you don’t often use the heavy Matlab software,
you may want to use one of the EWS machines on campus.
• You should write all your answers (code, graphs and texts) in the PDF file you will
submit. For code and graphs, you could paste them to the file.
1. Load the data carsmall in Matlab using the following code.
load carsmall
X = [MPG,Acceleration,Displacement,Weight,Horsepower];
varNames = {’MPG’; ’Acceleration’; ’Displacement’; ’Weight’; ’Horsepower’};
2. (2’) Comet graph is an animated graph. To trace the data points on the screen for
the Displacement attribute, we use the following code to visualize the Displacement
attribute. Show the final comet graph in the PDF file you will submit by running
the following code on Matlab.
comet(Displacement)
xlabel(’Index of Car’)
ylabel(’Displacement’)
3. (5’) Drawing boxlpot is a popular way to visualize a distribution. The two whiskers
show the Min observation and the Max observation. The central line shows the median.
The edges of the box are the first quantile and the third quantile.
a. (1’) Run the following code on your Matlab to draw a boxplot for the Acceleration
attribute. Show the boxplot in the PDF file you will submit.
boxplot(Acceleration)
ylabel(’Acceleration’)
b. (4’) Write code to visualize the Acceleration attribute using the boxplot for
cars with di↵erent number of cylinders. In this graph, you group cars using the
Cylinders attribute (the number of cylinders). For each group of cars, you draw a
box to show the five-number summaries on Acceleration. All the boxes should
be drawn on the same graph.
In your graph, x-axis represents the number of
cylinders and the y-axis shows the Acceleration. You should also add the label
for x-axis (Cylinders) and y-axis (Acceleration). (Hint: only several lines of
codes are needed to finish this task. Try to use the boxplot(X,G) function where X
is the attribute to be visualized and G is the grouping attribute.) Show your code
and grouping boxplot in the PDF file you will submit.
4. (4’) 3-D scatter plots are popularly used to visualize 3 attributes at the same time.
a. (2’) Run the following code to draw a 3-D scatter plot. Show the 3-D plot in
the PDF file you will submit.
scatter3(Displacement,Cylinders,Horsepower,’filled’,’r’)
xlabel(’Displacement’)
ylabel(’Cylinders’)
zlabel(’Horsepower’)
b. (2’) By observing the graph you get, could you identify a pair of correlated attributes? Could you explain why the positive or negative correlation makes sense?
Give your answer in the PDF file you will submit. (Hint: You could rotate the
graph in Matlab when you try to find the correlation between two attributes on a
3-D graph.)
5. (4’) Interactive star plots are used to show the values of attributes for each observation.
In each star (observation), the spoke length is proportional to the value of that attribute
for that observation.
a. (2’) Run the following code. Show the graph in PDF file you will submit.
h = glyphplot(X(1:9,:), ’glyph’,’star’, ’varLabels’,varNames,…
’obslabels’,Model(1:9,:));
set(h(:,3),’FontSize’,8);
b. (2’) In the Matlab figure dialog menu, there is a button called data cursor (See
Figure 1. The data cursor item is in the red circle.) Based on the graph you get
in 5a, if you click on the data cursor button, and then click on any star (car), you
will get the value for each attribute of that car. Show the value of each attribute
for the star (car) at the top left corner of the graph you plotted in the Question
5a in the PDF file you will submit.
Figure 1: Matlab Figure Dialog Menu
Mini Machine Problem 2 (15 points)
This mini-MP asks you to play around with a few basic functionalities of Pentaho Kettle
(Spoon) software to do data preprocessing for a customer table. In particular, you need to
build a simple workflow that outputs pairs of last names that look similar. They currently
belong to di↵erent customers, possibly because of mistakes of the employees when inputting
the data.
Purpose
• Know how a tool specifically designed for data preprocessing may look like.
• Consider using the open-source software in your future work.
Requirements
• Do a few basic tasks with Spoon. You will have to install the software on your machine.
In particular, you may want to watch Long’s demonstration on the usage of Kettle
Spoon in the video lecture on 09/08/2015. It shows basic things about Spoon, and you
only need to combine and modify those things to finish this assignment.
You can download the software (⇡ 800 MB) at http://community.pentaho.com/projects/
data-integration/, and the tutorial about how to launch it at http://wiki.pentaho.
com/display/EAI/02.+Spoon+Introduction.
As we will need MySQL, you need to copy mysql-connector-java-5.1.36-bin.jar to the
lib folder of your kettle installation folder: http://dev.mysql.com/downloads/connector/
j/
We will use Sakila sample database from MySQL. We are particularly interested in the
Customer table. We uploaded it to our database, so you can use the database online, which
means you do not need to install MySQL server. You do not need knowledge of MySQL to
do this mini MP either.
To get started, open file cs412 minimp1.ktr in Spoon. You can find the file in the data.zip
file on the assignment page of the course website. After opening it, you will see the following
components:
• ReadSource: It downloads the table Customers from our online database. If you
double click on the component, you will see it contains a SQL query to obtain the
necessary information. The component is incomplete because it does not specify the
connection.
You will have to create a new one by clicking on “New…”, and then enter
the following information:
– Host name: engr-cpanel-mysql.engr.illinois.edu
– Database name: ltpham3 sakila
– Port number: 3306
– Username: ltpham3 cs412
– Password: cs412kevin
You can test the connection by clicking on “Test”, or clicking on “Preview” after double
clicking on the icon of the component.
• Lkp Lastname: It downloads a list of last name. You also need to specify the connection
you created for the component above.
• MatchLastName: It compares the last names from Lkp Lastname with the last names
from ReadSource. You feel free to choose one of the built-in algorithms for the comparison.
Your tasks are as followed:
1. (5’) Your first task is to make MatchLastName works by specifying the flow of data
from ReadSource to Lkp LastName, as well as filling in necessary information in the
component. We specified the flow of data from ReadSource to MatchLastName as an
example. You will also notice that we specified the min value 0.8 and max value 0.99
in MatchLastName. Can you explain why the max value must be 0.99 rather than 1.0?
You may try with max value 1.00 to see why we must do that.
2. (5’) Search for “Filter Rows” component in the Design tab on the left, and drag it to
the canvas. It helps you input rows from MatchLastName and output rows that satisfy
your criteria. Specify the filter with necessary criteria so that it will output the rows
containing information about the customers who have last names matching with those
of someone else. Report the screenshots of the workflow and its output.
3. (5’) The score column in the output above seems to be redundant. Search for component “Select values” in the Design tab on the left, and fill in necessary information
to remove the score column. Report the screenshots of the workflow and its output.