COMPSCI527-Homework3 solution

$29.99

Original Work ?

Download Details:

  • Name: hw3.zip
  • Type: zip
  • Size: 1.32 MB

Category: You will Instantly receive a download link upon Payment||Click Original Work Button for Custom work

Description

5/5 - (2 votes)

Forthisassignmentyouwillneedthefiledata.matandafew MATLAB functions,allavailableinazipfileonthehomeworkpage. 1. The K-means algorithm is described in Section 13.4.4 of the textbook. It takes I data points Z = {z1,…,zI}in Rd and K points m1,…,mK inRd and computes a local minimum ˆ S of the function
f(S) =
K X k=1
min µk∈Rd X z∈Sk
kz−µkk2
where S = {S1,…,SK}varies over the setS of all partitions of Z. The partition ˆ S is a local minimum in the sense that no move of a singlepoint zi fromonesetof ˆ S toanotheryieldsavalueof f thatislowerthan f(ˆ S),nordoesanyinfinitesimalchangeoftheresulting centroids µ1,…,µK. See algorithm 1 for a pseudo-code listing of the K-means algorithm.
Algorithm1. The K-means algorithm Input: z1,…,zI,m1,…,mK for k = 1,…,K do µk ← mk . Initialize the means endfor fc ←∞ . fc will become the current value of f(S) repeat for k = 1,…,K do Sk ←{z | z ∈ Z andkz−µkk2 ≤kz−µ`k2 for ` 6= k} . Partition the points in Z by their closest mean endfor for k = 1,…,K do if Sk 6= ∅then µk ← 1 |Sk|Pz∈Sk z . µk is the centroid of Skendif endfor fp ← fc . Remember the previous value of f(S) fc ← f({S1,…,SK}) . Compute the current value of f(S)until fp ≤ fc + τ . Stop if improvement goes below threshold τOutput: µ1,…,µK,S1,…,SK
(a) Implement from scratch a MATLAB function
function [M, R] = kmeans(Z, M) that takes a d×I matrix Z whose columns are the points in Z and a d×K matrix M whose columns are the initial means mk and runs the K-means algorithm. The output d×K matrix M has the final means µk as its columns, and the K ×I logical matrix R has entries R(k,i) = true if zi ∈ Sk false otherwise
COMPSCI 527 — Duke — October 20, 2014
Use a termination threshold (see theuntilclause in algorithm 1)
τ = sqrt(eps)
where eps is a predefined constant in MATLAB. [Warning: If you compare this matrix with what the textbook calls the responsibilities in chapter 7, then your matrix is the transpose of that: R(k,i) = rik.] Hand in your implementation of kmeans and a plot of the result of running
load data [M, R] = kmeans(blobs, M0); showClusters(blobs, M, R, M0, 1, ’blobsKmeans’);
The function showClusters is provided with this assignment and will produce a PDF file named blobsKmeans.pdf. Also turn in the values in M. (b) Run your code on the cigars data in data.mat using M0(:, 1:2) to initialize. Show the output matrix M and your results with showClusters. (c) Run your code on the cigars data in data.mat again, this time using M1 (also provided) to initialize. Show the output matrix M and your results with showClusters. (d) Explain in what way the results in your previous two answers show that kmeans does not work well with these data points. In particular, are these shortcomings due to the fact that the K-means algorithm computes a local (as opposed to global) minimum? If so, how? If not, what else is responsible?
(e) The K-meansalgorithmgetstriviallyintroublewhentwoormoreofthecentroidsintheinitialmatrix M0 areequaltoeachother. Inthisway, thecoincidingcentroidsdefineexactlythesamepartition, andtheyneversplitapart. Theresultingbehaviorisexactly the same as if the duplicate means were removed, and the algorithm effectively returns a result for a lower value of K. However, this is not the only way in which poor initialization leads to bad results. Construct a set Z of points on the plane and a partition S = {S1,S2,S3,S4}of Z such that the following properties hold: 1. There are four tight, well-separated clusters C1, C2, C3, C4 in Z, and the cluster centroids are µ∗ 1, µ∗ 2, µ∗ 3, µ∗ 4. 2. If mk is the centroid of Sk, then the set{m1,m2,m3,m4}is different from the set{µ∗ 1,µ∗ 2,µ∗ 3,µ∗ 4}. 3. Ifthe K-meansalgorithmisrunon Z withinitialcentroids M0 = [m1,m2,m3,m4], thenthealgorithmstopsafterasingle iteration of therepeatloop in algorithm 1 with µ1 = m1 , µ2 = m2 , µ3 = m3 , µ4 = m4 (in words, the algorithm makes no progress). Describe your construction clearly and succinctly (a drawing may help), and give the coordinates of the points in S1, S2, S3, S3, of their centroids m1, m2, m3, m4, and of the centroids µ∗ 1, µ∗ 2, µ∗ 3, µ∗ 4 of the clusters C1, C2, C3, C4. If you prefer, give formulas for all these coordinates. The clusters need not be large, but each of them must have more than two points. Explain why the algorithm makes no progress with your input. [Hints: It is possible to construct an example with three clusters, but with four it’s easier to draw. Symmetry helps. There are many possible answers. You may want to use your kmeans function to verify that your construction is correct. However, please do not hand in the result of this check.]
(f) In what way does the construction in the previous question highlight a problem with the K-means algorithm?
2. TheEMdensityestimationalgorithmformixturesofnormaldistributionsisdescribedinsection7.4.2ofthetextbook. Ittakes I data points Z = {z1,…,zI}in Rd; K points m1,…,mK in Rd; and K positive real scalars σ2 1,…,σ2 K and computes a local maximum ˆ θ of the likelihood function
F(θ) =
I Y i=1
p(zi|θ) where p(zi|θ) =
K X k=1
λkNormzi[µk,Σk]
is a mixture of normals and where θ = [λ1,…,λK,µ1,…,µK,Σ1,…,ΣK]T is the parameter vector. The vector ˆ θ is a local maximum in the sense that no infinitesimal change of θ away from ˆ θ produces an increase of F(θ). There is a numerical twist to this algorithm. The likelihood F(θ) is the product of positive terms: all the normals are positive, and notall λk canbe zero(becausetheyadduptoone). However, the values ofthenormals canbe close tozeroif zi isfarfrom µk forall k
COMPSCI 527 — Duke — October 20, 2014
where λk 6= 0. In that case, p(zi|θ) ≈ 0, and if there are enough such small values, their product can cause numerical underflow. That is, the exact value of F(θ) is never zero, but it may bee too small to be stored in a double variable, and the EM algorithm will make no progress as a consequence. To prevent this problem, we instead minimize the negative logarithm of F(θ),
f(θ) = −logF(θ) = −
I X i=1
log
K X k=1
λkNormzi[µk,Σk] .
The minus sign is not important. It is introduced to eliminate the minus signs that arise when taking the logarithm of a Gaussian distribution. In addition, the minus sign makes EM a minimization problem, just like K-means. The important part from a numerical standpoint is the logarithm in the definition of f(θ). The number, say, 10−500 is too small to be stored in a double variable, but its (natural) logarithm−500log10 is about -1,151, and that is not too small. In industrial-quality code,onewouldhavetomakesurethatEMstillworkswheneven f(θ) istoosmall,butwewillnotworryaboutthisinthisassignment. You may assume that a single value of p(zi|θ) can be represented in a double variable, even if their product for i = 1,…,I cannot. See algorithm 2 for a pseudo-code listing of the EM algorithm, cast as a local minimization algorithm.
Algorithm2. The EM algorithm Input: z1,…,zI,m1,…,mK,σ2 1,…,σ2 K for k = 1,…,K do λk ← 1/K . Initialize the mixture coefficients to an uninformative prior µk ← mk . Initialize the means Σk ← σ2 k Id . Initialize the covariances. Id is the d×d identityendfor fc ←∞ . fc will become the current value of f(θ) repeat for k = 1,…,K do for i = 1,…,I do rki ← λkNormzi[µk,Σk] PK j=1 λjNormzi[µj,Σj] . Compute the responsibilities. This is the E step endfor endfor for k = 1,…,K do . The three assignments in this loop are the M step λk ← 1 IPI i=0 rki . λk is the coefficient of the k-th mixture component µk ←PI i=1 rkiziP I i=1 rki . µk is the sample mean of the k-th mixture component Σk ←PI i=1 rki(zi−µk)(zi−µk)T PI i=1 rki . Σk is the sample covariance of the k-th mixture component endfor fp ← fc . Remember the previous value of f(θ) fc ← f(λ1,…,λK,µ1,…,µK,Σ1,…,ΣK) . Compute the current value of f(θ)until fp ≥ fc + τ . Stop if improvement falls below τOutput: λ1,…,λK,µ1,…,µK,Σ1,…,Σk,r11,…,rKI
(a) Implement from scratch a MATLAB function
[lambda, M, Sigma, R] = EM(Z, M, sigma2)
that takes the same arguments Z and M as kmeans in addition to a vector sigma2 of K positive scalars to initialize EM. Your function EM uses the variances σ2 k in sigma2 to generate the initial values of the d×d mixture covariance matrices as Σk = sigma2(k)∗eye(d) where eye(d) generates the d×d identity matrix and d=size(Z,1). The output lambda of EM is a K-dimensional vector with the mixture coefficients λk. The output M is a d×K matrix that collects the estimated centroids of the mixture components. The output Sigma is a d×d×K array such that Sigma(:,:,k) = Σk , thefull k-thcovariancematrixestimatedbytheEMalgorithm. The K×I outputmatrix R containstheresponsibilitiescomputed by EM. Use the same termination threshold as in kmeans.
COMPSCI 527 — Duke — October 20, 2014
[Warning: If you compare this matrix with the responsibilities the textbook defines in chapter 7, then your matrix is the transpose of that: R(k,i) = rik.] Hand in your implementation of EM and a plot of the result of running
load data [lambda, M, Sigma, R] = EM(blobs, M0, ones(1, 3)); showMixture(blobs, lambda, M, Sigma, R, M0, 1, ’blobsEM’)
The function showMixtures is provided with this assignment, and will produce a PDF file named blobsEM.pdf. The resulting plot is in color, so you see more clearly what happens on screen, but it is OK to turn in a black-and-white print. Also give the resulting values of lambda, M, and Sigma. (b) Run your function EM with first argument cigars, initializing with M0(:, 1:2).
(c) Is this result more satisfactory than using K-means on the same data? Explain briefly.
(d) With either K-means or EM you need to know the number of clusters ahead of time. Show plots (using showMixture) of running your EM function on the blobs data with K = 2 and then with K = 4. Use M0(:, 1:2)) and M04 (provided) to initialize. (e) Runyourfunction EM withfirstargument bananas (provided),initializingwith M0(:, 1:2). Turnin M andtheplotgenerated with showMixture.
(f) In what ways is the result in your answer to the previous question unsatisfactory? Is this a limitation of the EM algorithm or of modeling data with a mixture of normal distributions? Explain briefly.
3. The issues with EM (and therefore with K-means) highlighted by some of your answers in the previous problem motivate us to find a clustering method (that is, an algorithm that groups data into tight groups for some definition of “tight”) that can detect non-convex clusters and does not require knowing the number of clusters ahead of time. One such method is based on the mean shift mode-seeking algorithmintroducedbyK.FukunagaandL.D.Hostetler(IEEETransactionsonInformationTheory,21(1):32-40,1975). Meanshiftis not a clustering algorithm in itself, but can be used to make one. So please bear with me. Let Kh(z) be a Gaussian kernel defined as follows:
Kh(z) = e−(kzk h )2
where h > 0 is called the bandwidth of the kernel (Kh(z) is not a probability density, as it is not normalized to integrate to one). Given a set Z = {z1,…,zI}of data points inRd, the quantity
φh(z) =
I X i=1
Kh(z−zi) ,
is a measure of the local density of the data in a neighborhood of z: If there are many data points zi near z, then the value of φh(z) is large. The vector µh(z) =PI i=1 ziKh(z−zi)P I i=1 Kh(z−zi) is an average of the data zi weighted by a decreasing function of their distance from z, and can therefore be interpreted as the local centroid (or mean) of the data around z. If the data is locally Gaussian, the density at the centroid µh(z) is no less than the density at z. Figure 1 illustrates this point. Thus, we have found (i) a way to measure the data density around any point z ∈ Rd and (ii) a new point z0 = µh(z) where the densityisequaltoorgreaterthanthatatz. Theseobservationsyieldanastonishinglysimplealgorithmthatseeksthemodeofthedensity of the data in Z: Start anywhere (at some initial point zstart) and keep shifting from z to the local mean µh(z) (which becomes the new z), until z and µh(z) coincide. Algorithm 3 summarizes this procedure. Of course, this algorithm is local, and which mode it finds depends on zstart. In addition, what “local” means depends on the value of the bandwidth parameter h.
(a) Implement from scratch a MATLAB function with header
function [z, zh] = meanShift(zstart, Z, h)
COMPSCI 527 — Duke — October 20, 2014
z
z’
Figure 1. The large circle suggests a Gaussian function centered at z (small hollow circle). The distribution of points (black dots) under this Gaussian is lopsided, and the weighted mean z0 (small hollow square) of the data does not coincide with z. If z is moved (arrow) to point z0, then the local density increases.
Algorithm3. The mean shift algorithm Input: z1,…,zI,zstart,h > 0 z0 ← zstart repeat z ← z0 z0 ←PI i=1 ziKh(z−zi)P I i=1 Kh(z−zi) untilkz−z0k≤ τ . Stop if there is not enough improvement Output: z0
that computes a local mode of the data in a set Z starting at some starting point zstart, and with bandwidth h. The input zstart isa d×1 vectorwiththestartingpoint;the d×I inputmatrix Z contains I datapointsinitscolumns;andthepositiveinputscalar h isthebandwidth. Theoutput z isa d×1 vectorwiththelocalmodeofthedata. Ifthealgorithmtakes k stepstoreachthemode, then the output zh is a d×k matrix that records the path traversed from zstart to z. Use termination threshold τ = h/1000 .
Hand in your code and the two plots resulting from the calls
zstart = [0 1.5]’; [z1, zh1] = meanShift(zstart, blobs, 0.2); [z2, zh2] = meanShift(zstart, blobs, 2); showPaths({zh1, zh2}, blobs, ’paths’);
The function showPaths is provided with this assignment, and will produce a PDF file named paths.pdf. It is OK to hand in a black-and-white print. Also state how many iterations it took for each of the two paths. (b) Explain the apparently unintuitive result for the greater bandwidth h = 2.
4. The mean shift algorithm can be used to construct a mean shift clustering algorithm. The recipe is again very simple: run the mean shift algorithm I times, starting at each of the data points in turn, and record the final mode for each run. Points that are in the same cluster (at a scale measured by the bandwidth h of the mean shift algorithm) will converge to the same mode, so we can group all the points in Z by the resulting modes: if two points lead to the same mode, they belong to the same cluster. Algorithm 4 summarizes.
(a) Write a MATLAB function with header
function [U, R] = meanShiftCluster(Z, h) thattakesa d×I array Z ofdataandapositivebandwidth h andreturnsa d×K array U with K modesandalogical K×I array R of cluster memberships, in the same style as kmeans. K is the number of clusters that meanShiftCluster finds.
COMPSCI 527 — Duke — October 20, 2014
Algorithm4. The mean shift algorithm Input: Z = {z1,…,zI}and h > 0 for i = 1,…,I do M(:,i) ← meanShift(zi) . meanShift terminates when there is less than τ motion endfor U ←near-unique columns(M) . Two columns in M are the same if their Euclidean distance is less than τ/10 K ←number of columns(U) . The algorithm determines the number of clusters on its own for k = 1,…,K do for i = 1,…,I do if M(:,i) = U(:,k) then R(k,i) ← 1else R(k,i) ← 0endif endfor endfor Output: Modes U and logical matrix R of cluster memberships
Use termination threshold
τ = h/1000 for meanShift, and declare two modes to be the same if they are more than 10τ apart. Handinyourcodeandaplotthatuses showClusters toshowthemodesandtheclustersfoundonthe bananas datasetwith a bandwidth h = 0.8. [You do not have the argument M0, so you can pass the empty matrix instead.] Also report the running time of your code in seconds, and the number of clusters found. [Hints: While the algorithm is simple in principle, its computational cost is high: For each starting point zi in Z and at each iteration of mean shift it is necessary to compute the distance between z and every point in Z. So with k iterations per point on average mean shift clustering takes O(kI2) distance computations. Toreducethecomputationalburden,notethatpointszi thatareveryfarfromz(relativetoh)leadtonear-zerovaluesofKh(z−zi), and can be ignored. Ways to take advantage of this fact are beyond the scope of this course, so you may use the function rangesearch available in the MATLAB Statistics Toolbox. If you do not have that toolbox, you can download a similar function from the Mathworks web site by searching for ”Yi Cao fast range search”. Ignore distances that are greater than 2h.] (b) Look (carefully) at your result in your previous answer and comment on the extent to which mean shift clustering addresses the problems of EM. Discuss trade-offs on the selection of the bandwidth h.