COMPSCI527-Homework2 solution

$29.99

Original Work ?

Download Details:

  • Name: hw2.zip
  • Type: zip
  • Size: 11.99 MB

Category: You will Instantly receive a download link upon Payment||Click Original Work Button for Custom work

Description

5/5 - (3 votes)

Some questions in this assignment require writing MATLAB code. Please intersperse concise comments with your code whenever useful, but do not go overboard with comments. Try to use matrix notation whenever possible, so as to avoid explicit for loops over large arrays. These are very inefficient in MATLAB. Looping over smaller arrays (up to a few hundred entries) is OK. 1. Inhomework1weextensivelyusedtheMATLAB functionsumtocomputemarginaldistributions(andthroughthesealsoconditional distributions)fromjointdistributions. Forinstance,ifarray pxy containsajointdistributionforrandomvariables X and Y (withvalues of X increasing with the column index of pxy), then the marginal distributions of X and Y can be computed as follows:
px = sum(pxy, 1); py = sum(pxy, 2); With continuous probabilities, we cannot do this, because integrals now replace sums. There are two ways to proceed. One is analytic: start with probabilities expressed by formulas (or, in MATLAB, by functions that compute formulas given input values of the random variables) and then integrate by calculus. The other way is numerical: assume that probability distributions are given as arrays of finely spaced samples, and then use numerical integration to approximate the values of the integrals. This problem explores some of these issues. You may assume that function samples are always given over uniform grids. That is, if values of a function are given for x = x0,…,xm and y = y0,…,yn, then xi −xi−1 = ∆x for all i = 1,…,m and yj −yj−1 = ∆y for all j = 1,…,n . However, ∆x may be different from ∆y. (a) The trapezoidal rule computes an approximate definite integral of a function of a single variable as follows Z b a f(x)dx ≈ ∆x 2 m X i=1 (f(xi−1) + f(xi)) where a = x0 , b = xm and xi −xi−1 = ∆x = b−a m . Thiscomputationapproximatesthefunction f byafunctionthatislinearbetweensamples, andaddsuptheareaofthetrapezoids with vertices xi−1,xi,f(xi),f(xi−1) for i = 1,…,m. To compute an integral over two variables, one can then use Fubini’s theorem Z b a dxZ d c dy f(x,y) =Z b a “Z d c f(x,y)dy#dx . So one first computes the internal integral for every value of xi by the trapezoidal rule to obtain m + 1 approximate samples g(x0),…,g(xm) of the function g(x) =Z d c f(x,y)dy , and then one uses the trapezoidal rule again to approximate Z b a dxZ d c dy f(x,y) =Z b a g(x)dx . Write a MATLAB function with header
function i = integrate(p, dx, dy)
that computes integrals over one or two dimensions using the trapezoidal rule. Specifically, p is either a vector or a two-dimensional array of values, and either or both of the positive scalars dx or dy are specified. The function integrate works as follows:

• If p is a vector (you can use isvector to check), then dy must be either set to the empty matrix [] or omitted altogether. Either way, integrate approximates the integral of the function whose samples in p are assumed to be spaced by an interval ∆x equal to the positive scalar dx. • If p is a two-dimensional array, dx is a positive number, and dy is either unspecified or set to the empty matrix, then integrate uses the trapezoidal rule to integrate each row of p (as you would when computing a marginal distribution) to produce a column vector of results. • Ifpisatwo-dimensionalarray,dxisistheemptymatrix,anddyisapositivenumber,thenintegrateusesthetrapezoidal rule to integrate each column of p to produce a row vector of results. • Ifpisatwo-dimensionalarrayandbothdxanddyaresettopositivescalarvalues,thenintegrateusesFubini’stheorem and the trapezoidal rule to integrate the function whose samples in p are assumed to be spaced by dx and dy in the x and y directions. The result in this case is a single number. Leaving both dx and dy unspecified, or specifying both of them when p is a vector, should cause an error. Use no explicit for loops in your code. Turn in your code for integrate and a plot of the result of evaluating the marginal distribution p(x) from the joint distribution p(x,y) computed by running the function pXYa (the ’a’ stands for “analytic”) provided in the zip file that comes with this assignment on the x and y values generated by the following commands:
x = linspace(-0.1, 2.1, 201); y = linspace(0, 1, 101)’;
(note that x is a row vector and y is a column vector). There are various ways to calculate dx and dy from x and y for use in integrate. Since the spacing in x and y is regular, it should not matter what calculation you use. Nexttoyourplot,alsoshowaplotonthesamerangeofxvaluesofthefunctionpXaprovidedwiththisassignment. Thisfunction computes the marginal analytically, and you can just call pXa(x) to get all the values at once. Label all the axes. (b) What is the root-mean-square discrepancy per sample (kpxn−pxak2/201) between the two versions (numerical and analytical) of the marginal distribution p(x)? (c) Compute (numerically) the conditional probability p(y|x) from the joint probability density p(x,y) given earlier. Feel free to use either the analytical or the numerical version of the marginal. Wherever the marginal in the definition of p(y|x) is less than δ = 0.001
set the conditional probability equal to zero. Show your MATLAB code snippet and a 20-value contour plot of the result. If the result is in array pygxn (n for “numerical”), this plot is obtained with the following command:
contour(x, y, pygxn, 20, ’Color’, ’k’)
(d) How can you tell immediately from the contour plot in your previous answer that the two random variables X and Y are not independent? State what you would expect the contour plot to look like if X and Y were independent.
(e) Find (numerically) samples of a new joint probability distribution p0(x,y) that has the same marginals in x and y as the distribution p(x,y) from earlier, but where the two random variables are mutually independent. Then compute samples ppygx of the conditional probability p0(y|x) from the new joint distribution. Explain your solution, and show your code snippet (starting with pxya, x, y, and delta assumed to be already available from before) and the contour plot of the new conditional probability distribution. This plot is obtained with a command similar to the one you used for pygxn. (f) To apply the trapezoidal rule for the computation of a one-dimensional integral takes m sums if the input function is given over m + 1 samples. If there are m + 1 samples in each of x and y, how many sums does it take to compute a 2-dimensional integral using Fubini’s theorem and the trapezoidal rule?
(g) How many steps does it take to compute a d-dimensional integral using Fubini’s theorem and the trapezoidal rule? . (h) WhatdoesyouranswertothepreviousquestionsuggestabouttheapplicabilityofFubini’stheoremandthetrapezoidalruletothe computation of high-dimensional integrals? The answer to this question leads to Monte Carlo integration, a fascinating topic that is beyond the scope of this course.

2. This problem guides you through the implementation of a simple Bayes classifier for handwritten digits. All the required materials are in the zip file that comes with this assignment. The MNIST handwritten digit database was developed by NYU’s Yann LeCun in the late Nineties to provide a benchmark for software that recognizes isolated handwritten digits. The site https://yann.lecun.com/exdb/mnist/ describes the database, which is contained in the four files in the data directory in the homework zip file. An annotated training database is used for learning a classifier, and a similarly annotated testing database is used to evaluate the classifier’s performance. If you make the code directory in the zip file your MATLAB working directory and type
[train, test] = readMNISTDatabase(’../data’); then the two resulting structures train and test will contain the training and testing parts of the database. The train structure has a field train.image with 60,000 black-and-white images of size 28×28 pixels of type uint8 arranged in a single array of size 28×28×60,000. Each image shows a single handwritten digit between 0 and 9, properly sized and centered. There is also a structure train.label with the corresponding 60,000 labels (each label is a uint8 number between 0 and 9). The test structure contains an annotated database used for testing. Its format is the same as that of train, except that there are 10,000 images and labels. (a) Assumingyoudidnotdeleteormovethefile decoder.mat inthe code directory,thefunction code.m willcompresseachof the images in a database into a 64-dimensional vector of numbers. For instance, if you type
cimg = code(train.image); the function will produce a 64×60,000 matrix of double numbers, each column being the encoding of one of the images in train.image. How many times shorter are the vectors in cimg, compared to the original images strung into vectors? The length of a vector is simply the number of its entries.
(b) The function code does double-duty. If you type
rimg = code(cimg, size(train.image)); then the resulting 28 × 28 × 60,000 array rimg will contain reconstructions of the images in train.image from their 64number compressed representations. So code behaves like an encoder if it is called with one argument, and like a decoder if it is called with two (with appropriate contents). Display pairs of side-to-side images (train.image(:, :, k(i)), rimg(:, :, k(i))) for all entries k(i) of the vector k obtained with the following command:
[˜, k] = unique(train.label);
Thiswillgiveyouonepairofimagesforeachdistinctdigit. Noneedtoshowyourcode,justthe20pictures. Pleasetrytoplaceall picturesononepage,andletusknowwheretolookforthem. Thesepicturesareself-explanatory,sothereisnoneedforcaptions. To save printer ink, you may want to display the negatives (black on white rather than white on black): If img is the image, then display uint8(255) – img.
(c) Hopefully, the pictures in your previous answer will have convinced you that the compression achieved by code preserves the information in the original images well enough for recognition. If you want to know how this compression works, read section 13.5, Dimensionality Reduction in the textbook. However, you do not need to understand this material for this assignment. You will now use the compressed representations in cimg and the labels in train.label to estimate the parameters of a generative model for this classification problem. Specifically, the prior p(w) is a categorical distribution (section 3.3 of the textbook) over the 10 labels. The likelihood p(x|w) of each digit w ∈{0,…,9}is assumed to be a normal distribution, defined in equation (5.1) in the textbook. For simplicity, we assume that the covariance matrix Σw is diagonal:
Σw =
     
σ2 w1 0 ··· 0 0 σ2 w2 … . . . . . . … … 0 0 ··· 0 σ2wd
     
where d is the dimension of each column x in cimg. This diagonal form is equivalent to assuming that the components of x are mutually independent given the digit w. Equation (5.1) then simplifies to the following expression:
p(x|w) =
1 (2π)d/2qQd i=1 σ2wi
exp −1 2
d X i=1
(xi −µwi)2 σ2wi != Norm[µw,Σw] .
Note the slight notational redundancy in the denominator: Of course, v u u t d Y i=1 σ2wi = d Y i=1
σwi ,
but perhaps the redundant notation avoids mistakes later. The vector µw = [µw1,…,µwd]T is the mean of the distribution (the superscript T denotes transposition: we view the mean as a column vector, even if we write is as a row vector to save space). It would be wasteful to store Σw as a matrix, so we introduce a column vector sw that contains the diagonal entries of this matrix: sw = [σ2 w1,…,σ2wd]T (variances, not standard deviations!). Since we have plenty of data, Maximum Likelihood estimation (section 4.4.1) of the 2d parameters in µw and sw is perfectly adequate. Also,becausethecomponentsof x areassumedtobeindependentofeachothergiven w,wecanusetheformula(4.12) in the textbook to estimate each entry of µw and formula (4.13) to estimate each entry of sw. Of course, we need to estimate a pair of vectors (µw,sw) for each digit w ∈{0,…,9}. Write a MATLAB function with header
function [likelihood, prior] = normalModel(X, L) that takes a d×n matrix X of compressed vectors (such as cimg) and a row vector L of n labels (such as train.label) and usesformulas(4.12)and(4.13)inthetextbooktoproduceastructure likelihood withtwofields M and S,andacolumnvector prior (replace the symbol I in the formulas with n). Both fields likelihood.M and likelihood.S are d×` matrices, where ` = 10 is the number of labels (digits). Column likelihood.M(:, w+1) is an estimate of the mean µw for digit w. Similarly, column likelihood.S(:, w+1) is an estimate of the variances in sw for digit w. Turn in your code. [Hints: Makesureyoucast X todouble,toavoidroundingandoverflowissues. Thefunction normalModel israthergeneral,so youdon’twanttohardwireanynumber(suchas `)init;onewaytoobtainalistofalllabels(andfromitalso `)istowrite label = unique(L);. If you think how to rewrite formulas (4.12) and (4.13) in vector form, you will see that this function can be written with just a loop over the digits by using the built-in functions mean and std (remember to square standard deviations to obtain variances).] (d) The function drawRandom provided with this assignment draws samples at random from the generative model you just defined. Call the function as follows
[sample, slabel] = drawRandom(likelihood, prior, [28 28 64]);
to generate 64 sample images from the model, together with labels that tell what likelihood each image was drawn from. Make a single composite of these 64 images (arranged in 8 rows and 8 columns) and show the negative of the result (negative, again, meanstodisplaytheimagesinuint8(255) – sampleinsteadofthoseinsample). Oneachimage,overlay(withthetext function) the corresponding slabel (make the overlay legible, please). The images will not look like digits, but should share someoftheoverallmorphologicalfeaturesofdigits. Perhapstheywilleachlooklikevariouspiecesofdigitsgarbledtogether,with the “right” digit somewhat emphasized. Turn in your composite image.
(e) The square root of the sum in the exponent of the normal distribution, δ(x,µw) =v u u t d X i=1 (xi −µwi)2 σ2wi

is called the Mahalanobis distance between vectors x and µw relative to the diagonal covariance Σw. This definition generalizes to d dimensionsthenotionofmeasuringhowmanystandarddeviationsavalue x isfromareferencevalue µw: If δ(x,µw) islow, thenthelikelihoodthat x comesfrom Norm[µw,Σw] ishigh,andviceversa. Wecanusethisnotiontomeasurehowfarapartthe means µw are from each other by measuring the `2 quantities δ(µw,µ0w). If this distance is small, then we expect digit w to be confused often with digit w0. Note that the distance δ is not symmetric in its arguments, that is, δ(µw,µ0w) is in general different from δ(µ0w,µw), because the covariance of the second likelihood is used in the formula. So it is possible that digit w may often be mistaken for digit w0 and not vice versa. As a first step in computing a matrix of all Mahalanobis distances between the means of your digit model, and also for later use, write a MATLAB function
function [v, delta] = normalValue(X, m, s) thattakesad×nmatrixXofdata,ad-dimensionalmeanvectorm,andad-dimensionalvectorsofvariances,andoutputstworow vectors v and delta, each of length n. The vector v contains the values—computed at each of the columns of the data matrix X—of the normal distribution Norm[m,Σ], where Σ is the diagonal matrix that has on its diagonal the values in the vector s. The output vector delta contains the corresponding Mahalanobis distances (do not forget the square root). Turn in your code. [Hint: This requires no explicit for loops.] (f) Write a MATLAB function with header
function D = distances(L) that computes the ` × ` matrix of Mahalanobis distances δ(µw,µ0w) between the digit means. Make sure to use the entries in sw0 = diag(Σw0)—and not those in sw—in the definition of δ. Turn in your code, as well as a printout of the matrix D for the likelihood you computed with your function normalModel. Please keep only the first two digits after the decimal period, to keep the printout small. (g) Write a MATLAB function with header
function label = classify(img, likelihood, prior)
that takes an uncompressed array img of images, such as test.image, and a generative model in the form of a likelihood and prior as specified earlier, and produces a vector label with as many entries as img has images. Entry label(k) is the digit (a number between 0 and 9) that the Bayes classifier with the model developed earlier (on the compressed images) assigns to the image img(:, :, k). Turn in your code. (h) Write a MATLAB function with header
function [E, errorRate, pCgT] = errorStats(computedLabel, trueLabel)
thattakes asinput twovectors ofequal length n. The vector computedLabel is aset of labelsestimated bya classifier, and the vector trueLabel is a set of labels, such as those in test.label, that are known to be true. The function computes an `×` array E, a scalar errorRate between 0 and 1, and an array of probabilities pCgT the same size as E. Specifically, entry E(i, j) of E is the number of times that an image that was truly of digit i-1 was classified as digit j-1 instead (the entries in E should add up to n). The scalar errorRate is the fraction (not the percentage) of wrong entries in computedLabel. Entry pCgT(i, j) of array pCgT is the conditional probability, estimated from the data in the input vectors, that a digit was classified as digit i-1 given that the digit was truly j-1. Turninyourcode,andthevaluesyourfunctioncomputesontheresultofyourclassifieronthesettestoftestdata. Alsoexpress the error rate as a percentage. Use three decimal digits after the period for pCgT. [Hints: Both errorRate and pCgT can be computed from E. Computing E may require a nested for loop on its entries.] (i) How can you check that pCgT is a valid conditional probability and that it represents p(computed|true), and not p(true|computed)? (j) State-of-the-art isolated-digit classifiers achieve error rates around 0.2 percent, so they do much better than the very simple classifier developed in this exercise. The improvements come mainly from the design of image features that are much more sophisticated than those our code function computes. Assuming that the classifier error rate is p, and that errors on different digits are mutually independent, what is the probability pZ that the classifier gets a five-digit ZIP code wrong? Evaluate your answer for p = errorRate (your computed rate) and for p = 0.002 (the best available rate today).

(k) The US Post Office makes about 400 million deliveries each day. How many zip codes would be classified erroneously each day if the state-of-the-art digit classifier were used?
(l) In light of your previous answer, it is useful for your classifier to output not only a label per digit image, but also the value of the posterior p( ˆ w|x) for the label ˆ w chosen by the classifier for input feature x. What is the range of values that this posterior can take? Explain your answer.
(m) How could the US Post Office automatic ZIP code scanner use the value of the posterior p( ˆ w|x)? Give a qualitative answer, noneed for formulas.
(n) The considerations above assume that errors on adjacent digits are mutually independent. Is this assumption valid? Why or why not?