CS412: Introduction to Data Mining Assignment 5 solution

$30.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (1 vote)

1 Conceptual Questions (10 points)

a. (L3, 2’) Given the same parameters, does DBSCAN always output the same cluster
assignment or not? Briefly explain.

b. (L3, 2’) With improper choice of initial cluster centers, K-Means can give us a very
bad clustering output. Could you propose a method to alleviate the issue?

c. (L1, 2’) Between k-Medoids (PAM) and k-Means, which is more ecient? Briefly
explain.

d. (L1, 2’) OPTICS is superior to DBSCAN because we do not have to set the paramenter
✏. True or False? Briefly explain.

e. (L1, 2’) Is it true that Lazy learners (e.g., kNN) is more ecient than non-lazy learners
(e.g., decision trees)?

2 k-Nearest Neighbors (10 points)

You will use k-nearest-neighbor classifier to predict labels for new data points, and investigate under what situations kNN works well. The set of labeled data points are given in
Tabel 1.

Purpose
• Understand how kNN works and its pros and cons.

Requirements
• No need to use distance-weighted labels of nearest neighbors.
• Show your calculations for questions asking for a number.
id x y label
1 0 0 +1
2 0 1 +1
3 1 0 -1
4 1 1 -1
5 1.5 0.5 +1
6 2 0.5 -1
Table 1: Data for kNN

a. (L2, 2’) Plot the given labeled data in the x-y plane, and highlight the regions where
data will be classified as ‘+1’ by a 1NN classifier (i.e, kNN for k = 1).

b. (L2, 2’) Suppose we use a 3NN classifier based on 100 (labeled) data points uniformly
distributed in a d-dimensional unit cube, i.e. [0, 1]d. For d = 2 (i.e, the unit square),
what is the minimum size of a d-dimensional cube centered at q for it to cover at least
3 neighbors of q in expectation? Show your steps. (Note that there is no need to
consider the special case in which the query point q lies near the boundary of the unit
cube.)

c. (L2, 2’) Calculate the minimum cube size asked in (b) for d = 100.
d. (L3, 2’) According to your calculations in (b) and (c), do you see any problem in
applying a kNN classifier to data in high-dimensional spaces (e.g. d = 100)? Briefly
explain. (Hint: consider the underlying principle of why kNN works generally.)
e. (L3, 2’) What are the pros and cons of using a large k for the kNN classifier?

3 Perceptrons (12 points)

You will design neurons and neural networks to implement some specified functions.

Purpose
• Understand how neurons and neural network approximates functions.
• See the di↵erent modeling capacities of perceptrons and multiple-layer neural networks
(multiplayer perceptrons).

Requirements
• Show the neurons and neural networks using diagrams.
a. (L2, 2’) Given x1, x2 2 {0, 1}, design a neuron that implements the logical operation
AND. That is, the neuron takes x1 and x2 as input and computes y = x1 AND x2 as
output (y 2 {0, 1}) (Hint: use the activation function threshold(a)=1 if a > 0 and 0
otherwise.)

b. (L2, 2’) Design a neuron that outputs y = x1 OR x2.
c. (L2, 2’) Design a neuron that takes a single input x 2 {0, 1} and outputs y = NOT x.
d. (L3, 4’) Design a neural network that takes x1, x2 2 {0, 1} as input and computes
y = x1 XOR x2 as output (x1 XOR x2 = 1 if and only if x1 6= x2). (Hint: you can
reuse the neurons you designed in (a)–(c))

e. (L3, 2’) Is it possible to implement XOR using a single layer neural network (without
any hidden layer)? Explain your answer.

4 Hierarchical Agglomerative Clustering and B-Cubed
Evaluation (8 points)

Purpose
• Understand how AGNES and B-Cubed work.

Requirements
• In sub-question a, only draw the dendrogram.
• In sub-question b, only list the members of each cluster.
• In sub-question c, show detailed calculations of B-Cubed precision and recall.

Suppose we have 13 data points as listed and plotted in Figure 1. The ground truth (the
correct clustering) is also provided.
Point x y Ground-truth cluster
P1 1 3 C1
P2 1 2 C1
P3 2 1 C1
P4 2 2 C1
P5 2 3 C1
P6 3 2 C1
P7 4 3 C1
P8 6 3 C2
P9 4 5 C2
P10 5 4 C2
P11 5 5 C2
P12 6 4 C2
P13 6 5 C2
Figure 1: Data and ground truth

a. (L2, 4’) Draw the dendrogram using AGNES. Please use single link and Euclidean
distance as the dissimilarity measure.

b. (L2, 2’) If we want to cluster the dataset into 3 groups based on the dendrogram, what
are the members of each of the 3 groups?

c. (L2, 2’) Based on the given ground truth, what are the B-Cubed precision and recall
of the output?

5 K-Means (8 points)

Purpose
• Understand how k-Means works.

Requirements
• In sub-question a and b, for each iteration of k-Means:
– Annotate the data points in figure 1 to show which points belong to which clusters,
e.g., draw a red circle at each point belonging to cluster 1, and green and blue
circles for cluster 2 and cluster 3. (The file for the figure is provided in data.zip.)

– Plot the mean of each cluster with the same color you use for data points in this
cluster, but in di↵erent shape to di↵erentiate them from data points.

– Show the coordinates of the cluster centers in each iteration.
– Do not include scanned pictures. You might use annotation or image processing
tools such as Mac Preview or Microsoft Paint to annotate the file.

– Do not show numerical distances in each iteration.
• In sub-question c, any reasonable method with brief explanation is acceptable.
We will use the same data as question 4 (figure 1).

a. (L2, 4’) Perform k-means using Euclidean distance with k = 2 and the initial cluster
centers P1 and P2.

b. (L2, 2’) Perform k-means using Euclidean distance with k = 2 and the initial cluster
centers P2 and P12.

c. (L3, 2’) As you can see in sub-questions a and b, choices of initial cluster centers
can a↵ect the speed of the clustering process. Assume k = 2 and the data are 2-
dimensional, suggest a general and reasonable way to select initial cluster centers to
speed up the clustering process. Briefly justify your choice.

6 Machine Problem (50 points)

Purposes
• Get deeper understanding and working experience of DBSCAN algorithm and B-Cubed
evaluation through implementation and result visualization.
• Get hands-on experience with cluster analysis using Weka.

General Requirements
• This is the second and the last MP (not a mini one) we have in this course. It consists
of several programming tasks, which usually take more time than written assignments,
so please start early.

• This is an individual assignment, which means it is OK to discuss with your classmates
and TAs regarding the methods, but it is not OK to work together or share code.

• Libraries or programs of cluster analysis algorithms can be found on-line, but you are
prohibited from using these resources directly. This means you can not include external
libraries, or modify existing programs, since the purpose of this MP is to help you go
through DBSCAN step by step.

• You can use Java/C++/Python as your programming language. No other languages
are allowed.

• Each sub question will contain detailed requirements. Please note that most of them
require you to not only submit code but also write discussions, report results, or draw
graphs. If you submit code without answering those questions, you will receive 0 point.

• Put all your codes in a separate folder with the name NetId assign5 codes. Do
not use sub-folders inside this folder. All of your codes should have been successfully
compiled before submission. Do not include files other than the codes you write. Put
a single readme.txt file in the code folder to briefly describe your codes and how to
run them.

• Put the results you generate in another folder with the name NetId assign5 results.
• Compress folder NetId assign5 codes and NetId assign5 results into a single zip
file, and name it NetId assign5.zip. Submit both this zip file and Answer Document
NetId assign5 answer.pdf through Compass2g. Please do not zip the PDF file!

• If you copy source code from other students or external sources, you will receive a
serious penalty. We will run plagiarism detection software. Please do not cheat.

Description of Data
• You can find data-hw5.zip from the course website. It contains three files: data.txt,
data.arff, and truth.txt.

• The description of data.txt is as follows:
– The first line contains an integer n as the number of data points.
– Each of the n following lines corresponds to a 2-dimensional data point, which
contains two floating-point numbers as the coordinates of a point. They are
separated by a comma.

• File data.arff is a Weka-friendly version of data.txt.
• File truth.txt is the ground truth of clustering for the data in data.txt. It is handlabeled by domain experts. The description of truth.txt is as follows:
– The first line contains an integer n as the number of data points.
– The second line contains an integer m as the number of clusters.

– Each of the n following lines corresponds to a 2-dimensional data point in data.txt,
which contains three numbers, where the first two are the coordinates of the point,
and the last one is an integer c 2 {1 …m} indicating the point belongs to cluster
c, and c = 0 if the point is an outlier. The three numbers are separated by commas. Please note that the name/index of each cluster is unimportant. It is just a
notation to indicate which points belong to the same cluster.

Step 0: (00
) Normalizing Data.
The first step you need to do is preprocessing data. In this assignment, you should
min-max normalize each dimension of the data. The formula is as follows:
x normalized = x minX
maxX minX
We also provide you with the min-max-normalized data in file data-normalized-hw5.zip
from the course website.

Step 1: (230
, L3) Implementing DBSCAN.
In this step, you need to implement DBSCAN:
• Name: dbscan.java, or dbscan.py, or dbscan.cpp

• Input
– Dataset file, e.g., /home/mike/mike assign5/mike assign5 data/data.txt
– Output file, e.g., /home/mike/mike assign5/mike assign5 results/step1.txt
– Parameter M inP ts, e.g, 25
– Parameter ✏, e.g, 0.065
• The structure of output file:
– The first line contains minP ts
– The second line contains ✏
– The third line contains the number of clusters

– Each of the n following lines corresponds to a 2-dimensional data point in
data.txt, which contains three numbers, where the first two are the coordinates of the point, and the last one is integer c 2 {1 …m} if the point
belongs to cluster c, or c = 0 if the point is an outlier. The three numbers
are separated by comma characters.

As it is hard for humans to interpret the output, you need to visualize it. In particular,
do a scatterplot for all the data points, and color the points according to the clustering.
For example, all outlier points are black, all the points belonging to cluster 1 are red, all
the points belonging to cluster 2 are blue, etc. You can create the plot using MATLAB,
Excel or any tools you like. You do not need to include the code for creating the graph.

Requirements:
• Put your DBSCAN source code to folder NetId assign5 codes
• Run your program on the given data file with M inP ts = 25 and ✏ = 0.065 so that
we can check the correctness of your code by looking at the output file. Name
the output file as step1.txt, and put it to folder NetId assign5 results.
• Include in your Answer Document two scatterplots, one for your output and one
for the ground truth.

• Question to ponder A: How long does it take for your machine to run DBSCAN
on the given data? What is the time complexity of DBSCAN in the worst case
scenario? Can you describe a worst case scenario?

• Question to ponder B: Usually, a good clustering algorithm should put similar
points to the same cluster, and dissimilar points to di↵erent clusters. Based on
that intuition, what do you think of the clustering result? Given the same M inP ts,
should we increase or decrease ✏ to get more intuitive clustering results?

Step 2: (100
, L3) Parameter tuning for DBSCAN.
Given M inP ts = 25, find two new ✏ such that the numbers of clusters in the corresponding results are di↵erent from each other and from the ✏ in step 1.

Requirements:
• Run DBSCAN two times with the two ✏ described above, and create two output files with the format similar to step1.txt. Name them as step2a.txt and
step2b.txt, and put them to folder NetId assign3 results
• Create a table in the answer document to report the number of clusters, and the
number of outliers corresponding to each of the two new ✏ and the ✏ in step 1
(your table should have three rows excluding the header).
• Draw two scatter plots corresponding to the two new ✏. Include them in the
Answer Document.

• Question to ponder C: Which one of three ✏ is the best? Could you suggest a
general way to tune parameters M inP ts and ✏ for DBSCAN?

Step 3: (70
, L3) Implementing B-Cubed evaluation.
In this step, you need to implement B-Cubed evaluation:
• Name: bcubed.java, or bcubed.py, or bcubed.cpp

• Input: two parameters:
– Clustering output file, e.g., /home/mike/mike assign5/mike assign5 results/step1.txt
– Ground truth file, e.g., /home/mike/mike assign5/mike assign5 data/truth.txt
• Output: Two lines in stdout
– The first line contains precision
– The second line contains recall

Requirements:
• Put your B-Cubed source code to folder NetId assign5 codes
• Run B-Cubed with the outputs from Step 1 and Step 2. Create a table to report
the precision and recall corresponding to each of the outputs.
• Question to ponder D: Can you suggest a way to combine B-Cubed precision and
recall into a single measure so that it would be easier to compare di↵erent results?

Step 4: (100
, L2) Doing cluster analysis by Weka.
In class, we have demonstrated how to do this. You can watch the online lecture video
to review the process. We also provide the procedure below.
(a) Open Weka.
(b) Choose Explorer.
(c) Click on Preprocess tag.

(d) Click on Open file, choose data.arff , click on Choose and wait until Weka
finishes loading the data.
(e) Click on Cluster tag. Here you can find common algorithms for cluster analysis,
such as OPTICS, DBSCAN, etc.

Requirements:
• Run SimpleKMeans with di↵erent number of clusters. Report the best number
of clusters, and the plot showing the corresponding clustering result. You can
generate the plot by right clicking on the entry of the Result list, and choose
Visualize cluster assignments.

• Run DBSCAN with the best minP ts and ✏ in Step 2. Show the plot for the
corresponding clustering result.
• Question to ponder E: Compare the results from k-Means and DBSCAN. Is kMeans or DBSCAN more suitable for the provided dataset? Why?