Sale!

ISYE 6740 Homeworks 1 to 4 solution

$80.00 $48.00

Original Work ?

Download Details:

  • Name: HWs-pqozbx.zip
  • Type: zip
  • Size: 14.02 MB

Category: Tags: , You will Instantly receive a download link upon Payment||Click Original Work Button for Custom work

Description

5/5 - (1 vote)

ISYE 6740 Homework 1

1 Clustering. [100 points total. Each part is 25 points.]
[a-b] Given N data points xn(n = 1, . . . , N), K-means clustering algorithm groups them into K clusters by
minimizing the distortion function over {r
nk, µk}
J =
X
N
n=1
X
K
k=1
r
nkkx
n − µ
k
k
2
,
where r
nk = 1 if xn belongs to the k-th cluster and r
nk = 0 otherwise.
(a) Prove that using the squared Euclidean distance kx
n − µ
kk
2 as the dissimilarity function
and minimizing the distortion function, we will have
µ
k =
P
n
r
nkx
n
P
n
r
nk .
That is, µ
k
is the center of k-th cluster. [5 pts]
(b) Prove that K-means algorithm converges to a local optimum in finite steps. [5 pts]
[c-d] In class, we discussed bottom-up hierarchical clustering. For each iteration, we need to find two clusters
{x1, x2, . . . , xm} and {y1
, y2
, . . . , yp} with the minimum distance to merge. Some of the most commonly used
distance metrics between two clusters are:
• Single linkage: the minimum distance between any pairs of points from the two clusters, i.e.
min
i=1,…,m
j=1,…,p
kxi − yjk
• Complete linkage: the maximum distance between any parts of points from the two clusters, i.e.
max
i=1,…,m
j=1,…,p
kxi − yjk
• Average linkage: the average distance between all pair of points from the two clusters, i.e.
1
mp
Xm
i=1
Xp
j=1
kxi − yjk
1
(c) When we use the bottom up hierarchical clustering to realize the partition of data, which
of the three cluster distance metrics described above would most likely result in clusters most
similar to those given by K-means? (Suppose K is a power of 2 in this case). [5 pts]
(d) For the following data (two moons), which of these three distance metrics (if any) would
successfully separate the two moons? [5 pts]
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2

ISYE 6740 Homework 2

1 Image compression using clustering [40 points]
In this programming assignment, you are going to apply clustering algorithms for image compression.
To ease your implementation, we provide a skeleton code containing image processing part. homework2.m
is designed to read an RGB bitmap image file, then cluster pixels with the given number of clusters K. It
shows converted image only using K colors, each of them with the representative color of centroid. To see
what it looks like, you are encouraged to run homework2(‘beach.bmp’, 3) or homework2(‘football.bmp’,
2), for example.
Your task is implementing the clustering parts with two algorithms: K-means and K-medoids. We
learned and demonstrated K-means in class, so you may start from the sample code we distributed.
The file you need to edit is mykmeans.m and mykmedoids.m, provided with this homework. In the files,
you can see it calls Matlab function kmeans initially. Comment this line out, and implement your own in
the files. You would expect to see similar result with your implementation of K-means, instead of kmeans
function in Matlab.
K-medoids
In class, we learned that the basic K-means works in Euclidean space for computing distance between data
points as well as for updating centroids by arithmetic mean. Sometimes, however, the dataset may work
better with other distance measures. It is sometimes even impossible to compute arithmetic mean if a feature
is categorical, e.g, gender or nationality of a person. With K-medoids, you choose a representative data point
for each cluster instead of computing their average.
Given N data points xn(n = 1, …, N), K-medoids clustering algorithm groups them into K clusters
by minimizing the distortion function J =
PN
n=1
PK
k=1 r
nkD(xn, µk
), where D(x, y) is a distance measure
between two vectors x and y in same size (in case of K-means, D(x, y) = kx − yk
2
), µ
k
is the center of k-th
cluster; and r
nk = 1 if xn belongs to the k-th cluster and r
nk = 0 otherwise. In this exercise, we will use the
following iterative procedure:
• Initialize the cluster center µ
k
, k = 1, …, K.
• Iterate until convergence:
– Update the cluster assignments for every data point xn: r
nk = 1 if k = arg minj D(xn, µj
), and
r
nk = 0 otherwise.
– Update the center for each cluster k: choosing another representative if necessary.
There can be many options to implement the procedure; for example, you can try many distance measures
in addition to Euclidean distance, and also you can be creative for deciding a better representative of each
cluster. We will not restrict these choices in this assignment. You are encouraged to try many distance
measures as well as way of choosing representatives.
1
Formatting instruction
Both mykmeans.m and mykmedoids.m take input and output format as follows. You should not alter this
definition, otherwise your submission will print an error, which leads to zero credit.
Input
• pixels: the input image representation. Each row contains one data point (pixel). For image dataset, it
contains 3 columns, each column corresponding to Red, Green, and Blue component. Each component
has an integer value between 0 and 255.
• K: the number of desired clusters. Too high value of K may result in empty cluster error. Then, you
need to reduce it.
Output
• class: cluster assignment of each data point in pixels. The assignment should be 1, 2, 3, etc. For
K = 5, for example, each cell of class should be either 1, 2, 3, 4, or 5. The output should be a column
vector with size(pixels, 1) elements.
• centroid: location of K centroids (or representatives) in your result. With images, each centroid
corresponds to the representative color of each cluster. The output should be a matrix with K rows
and 3 columns. The range of values should be [0, 255], possibly floating point numbers.
Hand-in
Both of your code and report will be evaluated. Upload codes with your implementation. In your report,
answer to the following questions:
1. Within the K-medoids framework, you have several choices for detailed implementation. Explain how
you designed and implemented details of your K-medoids algorithm, including (but not limited to)
how you chose representatives of each cluster, what distance measures you tried and chose one, or when
you stopped iteration.
2. Attach a picture of your own. We recommend size of 320 × 240 or smaller.
3. Run your K-medoids implementation with the picture you chose above, with several different K. (e.g,
small values like 2 or 3, large values like 16 or 32) What did you observe with different K? How long
does it take to converge for each K?
4. Run your K-medoids implementation with different initial centroids/representatives. Does it affect
final result? Do you see same or different result for each trial with different initial assignments? (We
usually randomize initial location of centroids in general. To answer this question, an intentional poor
assignment may be useful.)
5. Repeat question 2 and 3 with K-means. Do you see significant difference between K-medoids and
K-means, in terms of output quality, robustness, or running time?
Note
• You may see some error message about empty clusters even with Matlab implementation, when you
use too large K. Your implementation should treat this exception as well. That is, do not terminate
even if you have an empty cluster, but use smaller number of clusters in that case.
• We will grade using test pictures which are not provided. We recommend you to test your code with
several different pictures so that you can detect some problems that might happen occasionally.
2
• If we detect copy from any other student’s code or from the web, you will not be eligible for any credit
for the entire homework, not just for the programming part. Also, directly calling Matlab function
kmeans or other clustering functions is not allowed.
2 Spectral clustering [40 points]
1. Consider an undirected graph with non-negative edge weights wij and graph Laplacian L. Suppose
there are m connected components A1, A2, . . . , Am in the graph. Show that there are m eigenvectors of
L corresponding to eigenvalue zero, and the indicator vectors of these components IA1
, . . . , IAm span
the zero eigenspace.
2. Real data: political blogs dataset. We will study a political blogs dataset first compiled for the
paper Lada A. Adamic and Natalie Glance, “The political blogosphere and the 2004 US Election”, in
Proceedings of the WWW-2005 Workshop on the Weblogging Ecosystem (2005). The dataset nodes.txt
contains a graph with n = 1490 vertices (“nodes”) corresponding to political blogs. Each vertex has a
0-1 label (in the 3rd column) corresponding to the political orientation of that blog. We will consider
this as the true label and try to reconstruct the true label from the graph using the spectral clustering
on the graph. The dataset edges.txt contains edges between the vertices.
Here we assume the number of clusters to be estimated is k = 2. Using spectral clustering to find the
2 clusters. Compare the clustering results with the true labels. What is the false classification rate
(the percentage of nodes that are classified incorrectly).
3 PCA: Food consumption in European area [20 points]
The data food-consumption.csv contains 16 countries in the European area and their consumption for 20 food
items, such as tea, jam, coffee, yoghurt, and others. There are some missing data entries: you may remove
the rows “Sweden”, “Finland”, and “Spain”.
Implement PCA by writing your own code.
1. Find the first two principal directions, and plot them.
2. Compute the reduced representation of the data point (which are sometimes called the principal components of the data). Draw a scatter plot of two-dimensional reduced representation for each country.
Do you observe some pattern?
3

ISYE 6740 Homework 3

1. Density estimation: Psychological experiments. (50 points)
The data set n90pol.csv contains information on 90 university students who participated in a psychological experiment designed to look for relationships between the size of different regions of the brain
and political views. The variables amygdala and acc indicate the volume of two particular brain regions
known to be involved in emotions and decision-making, the amygdala and the anterior cingulate cortex;
more exactly, these are residuals from the predicted volume, after adjusting for height, sex, and similar
body-type variables. The variable orientation gives the students’ locations on a five-point scale from 1
(very conservative) to 5 (very liberal).
(a) Form 2-dimensional histogram for the pairs of variables (amygdala, acc). Decide on a suitable
number of bins so you can see the shape of the distribution clearly.
(b) Now implement kernel-density-estimation (KDE) to estimate the 2-dimensional with a two-dimensional
density function of (amygdala, acc). Use a simple multi-dimensional Gaussian kernel, for x = 
x1
x2

∈ R
2
, where x1 and x2 are the two dimensions respectively
K(x) = 1


e

(x
2
1+x
2
2
)
2 .
Recall in this case, the kernel density estimator (KDE) for a density is given by
p(x) = 1
m
Xm
i=1
1
h
K

x
i − x
h

where x
i are two-dimensional vectors, h > 0 is the kernel bandwidth. Set an appropriate h so
you can see the shape of the distribution clearly. Plot of contour plot (like the ones in slides) for
your estimated density.
(c) Plot the condition distribution of the volume of the amygdala as a function of political orientation: p(amygdala|orientation = a), a = 1, . . . , 5. Do the same for the volume of the acc. Plot
p(acc|orientation = a), a = 1, . . . , 5. You may either use histogram or KDE to achieve the goal.
2. Implementing EM algorithm for MNIST dataset. (50 points)
Implement the EM algorithm for fitting a Gaussian mixture model for the MNIST dataset. We reduce
the dataset to be only two cases, of digits “2” and “6” only. Thus, you will fit GMM with C = 2. Use
the data file data.mat or data.dat on Canvas. True label of the data are also provided in label.mat and
label.dat
The matrix images is of size 784-by-1990, i.e., there are totally 1990 images, and each column of the
matrix corresponds to one image of size 28-by-28 pixels (the image is vectorized; the original image
can be recovered, e.g., using MATLAB code, reshape(images(:,1),28, 28).
1
(a) Select from data one raw image of “2” and “6” and visualize them, respectively.
(b) Use random Gaussian vector with zero mean as initial means, and identity matrix as initial
covariance matrix for the clusters. Please plot the log-likelihood function versus the number of
iterations to show your algorithm is converging.
(c) Report the finally fitting GMM model when EM terminates: the weights for each component, the
mean vectors (please reformat the vectors into 28-by-28 images and show these images in your
submission). Ideally, you should be able to see these means corresponds to “average” images. No
need to report the covariance matrices.
(d) (Optional). Use the pic to infer the labels of the images, and compare with the true labels. Report
the miss classification rate for digits “2” and “6” respectively. Perform K-means clustering with
K = 2. Find out the miss classification rate for digits “2” and “6” respectively, and compare with
GMM. Which one achieves the better performance?

ISYE 6740 Homework 4

1. Basic optimization. (50 points.)
The background of logistic regression will be discussed in the next lecture. Here, we just focus on
finding out the property of the optimization problem, related to training a logistic regression.
Consider a simplified logistic regression problem. Given n training samples (xi
, yi), i = 1, . . . , n. The
data xi ∈ R, and yi ∈ {0, 1}. To train/fit a logistic regression model for classification, we solve the
following optimization problem, where θ ∈ R is a parameter we aim to find:
max
θ
`(θ), (1)
where the log-likelhood function
`(θ) = Xn
i=1
{− log(1 + exp{−θxi}) + (yi − 1)θxi} .
(a) (15 points) Derive the gradient of the cost function `(θ) in (1) and write a pseudo-code for
performing gradient descent to find the optimizer θ

. This is essentially what the training
procedure does. pseudo-code means you will write down the steps of the algorithm, not necessarily
any specific programming language.
(b) (15 points) Present a stochastic gradient descent algorithm to solve the training of logistic
regression problem (1).
(c) (20 points) We will show that the training problem in basic logistic regression problem
is concave. Derive the Hessian matrix of `(θ) and based on this, show the training problem (1)
is concave. Explain why the problem can be solved efficiently and gradient descent will achieve a
unique global optimizer, as we discussed in class.
2. Comparing Bayes, logistic, and KNN classifiers. (50 points)
In lectures we learn three different classifiers. This question is to implement and compare them. We are
suggest use Scikit-learn, which is a commonly-used and powerful Python library with various machine
learning tools. But you can also use other similar library in other languages of your choice to perform
the tasks.
1
Part One (Divorce classification/prediction). (30 points)
This dataset is about participants who completed the personal information form and a divorce predictors scale.
The data is a modified version of the publicly available at https://archive.ics.uci.edu/ml/datasets/
Divorce+Predictors+data+set (by injecting noise so you will not replicate the results on uci website). There are 170 participants and 54 attributes (or predictor variables) that are all real-valued. The
dataset q3.csv. The last column of the CSV file is label y (1 means “divorce”, 0 means “no divorce”).
Each column is for one feature (predictor variable), and each row is a sample (participant). A detailed
explanation for each feature (predictor variable) can be found at the website link above. Our goal is
to build a classifier using training data, such that given a test sample, we can classify (or essentially
predict) whether its label is 0 (“no divorce”) or 1 (“divorce”).
Build three classifiers using (Naive Bayes, Logistic Regression, KNN). Use the first 80% data for
training and the remaining 20% for testing. If you use scikit-learn you can use train test split to split
the dataset.
(a) Report testing accuracy for each of the three classifiers. Comment on their performance: which
performs the best and make a guess why they perform the best in this setting.
(b) Use the first two features to train three new classifiers. Plot the data points and decision boundary of each classifier. Comment on the difference between the decision boundary for the three
classifiers. Please clearly represent the data points with different labels using different colors.
Part Two (Handwritten digits classification). (20 points) Repeat the above using the MNIST
Data in our Homework 3. Here, give “digit” 6 label y = 1, and give “digit” 2 label y = 0. All the
pixels in each image will be the feature (predictor variables) for that sample (i.e., image). Our goal
is to build classifier to such that given a new test sample, we can tell is it a 2 or a 6. Using the first
80% of the samples for training and remaining 20% for testing. Report the classification accuracy on
testing data, for each of the three classifiers. Comment on their performance: which performs the best
and make a guess why they perform the best in this setting.
3. Deriving M-step in EM for GMM. (Bonus question: 10 points)
Consider the Q function in EM for GMM:
Q(θ|θk) = Xn
i=1
X
C
c=1
pi,c log πc +
Xn
i=1
X
C
c=1
pi,c log φ(xi
|µc, Σc)
where θ = {πc, µc, Σc}c=1,…,C , and pi,c ∝ π
(k)
c φ(xi

(k)
c , Σ
(k)
c ), i = 1, . . . , n, c = 1, . . . , C depend on the
parameters from the previous iteration. Here φ(x|µ, Σ) denotes the pdf of a multi-variate Gaussian
with mean vector µ and covariance matrix Σ.
Solve πc, µc and Σc that maximize the Q(θ|θk) function. In other words, we want to find
θk+1 := arg max
θ
Q(θ|θk)
subject to X
C
c=1
πc = 1.
Show that the solution θk+1 is given by the following expression
π
(k+1)
c =
Pn
i=1 pi,c
n
µ
(k+1)
c =
Pn
i=1
P
pi,cxi
c
i=1 pi,c
2
Σ
(k+1)
c =
Pn
i=1 pi,c(xi − µ
(k)
c )(xi − µ
(k)
c )
T
Pn
i=1 pi,c
(Hint: Consider the Lagrangian function for solving this constrained optimization problem. You only
need to introduce one Lagrangian multiplier because you only have one constraint. Then solve it from
there.)
4. Naive Bayes for spam filtering. (Bonus question: 20 points)
In this problem we will use the Naive Bayes algorithm to fit a spam filter by hand. This will enhance
your understanding to Bayes classifier and build intuition.
Spam filters are used in all email services to classify received emails as “Spam” or “Not Spam”. A
simple approach involves maintaining a vocabulary of words that commonly occur in “Spam” emails
and classifying an email as “Spam” if the number of words from the dictionary that are present in the
email is over a certain threshold. We are given the vocabulary consists of 15 words
V = {secret, offer, low, price, valued, customer, today, dollar, million, sports, is, for, play, healthy, pizza}.
We will use Vi to represent the ith word in V . As our training dataset, we are also given 3 example
spam messages,
• million dollar offer
• secret offer today
• secret is secret
and 4 example non-spam messages
• low price for valued customer
• play secret sports today
• sports is healthy
• low price pizza
Recall that the Naive Bayes classifier assumes the probability of an input x = [x1, x2, . . . , xn]
T depends
on its class y. In our case the input vector x corresponding to each message has length n = 15 equal
to the number of words in the vocabulary V , where each entry xi
is equal to the number of times word
Vi occurs in x.
(a) Calculate P(y = 0) and P(y = 1) from the training data, where y = 0 corresponds to spam
messages, and y = 1 corresponds to non-spam messages.
(b) List the feature vector x for each spam and non-spam message.
(c) In the Naive Bayes model, the likelihood of a sentence with feature vector x given a class c is
P(x|y = c) = Yn
k=1
θ
xk
c,k
where θc,k ∈ (0, 1) is the weight of word k in class c, which satisfies Pn
k=1 θc,k = 1, ∀c. Calculate
the maximum likelihood estimates of θ0,1, θ0,7, θ1,1, θ1,15 by maximizing P(x|y = c) with respect
to θc,k and given data. (Hint: Consider the Lagrangian function for solving this constrained
optimization problem. You only need to introduce one Lagrangian multiplier because you only
have one constraint. Consider log-likelihood (i.e., taking log of the cost function). Then solve it
from there.)
(d) Given a new message “today is secret”, decide whether it is spam or not spam, based on the Naive
Bayes classifier, learned from the above data.