Description
CS498 Homework 1 Naives Bayes
- Problem 1I strongly advise you use the R language for this homework (but word is out on Piazza that you could use Python; note I don’t know if packages are available in Python). You will have a place to upload your code with the submission.The UC Irvine machine learning data repository hosts a famous collection of data on whether a patient has diabetes (the Pima Indians dataset), originally owned by the National Institute of Diabetes and Digestive and Kidney Diseases and donated by Vincent Sigillito. You can find this data at https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes. You should look over the site and check the description of the data. In the “Data Folder” directory, the primary file you need is named “pima-indians-diabetes.data”. This data has a set of attributes of patients, and a categorical variable telling whether the patient is diabetic or not. For several attributes in this data set, a value of 0 may indicate a missing value of the variable.
- Part 1A Build a simple naive Bayes classifier to classify this data set. We will use 20% of the data for evaluation and the other 80% for training. There are a total of 768 data-points.You should use a normal distribution to model each of the class-conditional distributions. You should write this classifier yourself (it’s quite straight-forward).Report the accuracy of the classifier on the 20% evaluation data, where accuracy is the number of correct predictions as a fraction of total predictions.
- Part 1B Now adjust your code so that, for attribute 3 (Diastolic blood pressure), attribute 4 (Triceps skin fold thickness), attribute 6 (Body mass index), and attribute 8 (Age), it regards a value of 0 as a missing value when estimating the class-conditional distributions, and the posterior. R uses a special number NA to flag a missing value. Most functions handle this number in special, but sensible, ways; but you’ll need to do a bit of looking at manuals to check.Report the accuracy of the classifier on the 20% that was held out for evaluation.
- Part 1C Now use the caret and klaR packages to build a naive bayes classifier for this data, assuming that no attribute has a missing value. The caret package does cross-validation (look at train) and can be used to hold out data. You should do 10-fold cross-validation. You may find the following fragment helpfultrain (features, labels, classifier, trControl=trainControl(method=’cv’,number=10))The klaR package can estimate class-conditional densities using a density estimation procedure that I will describe much later in the course. I have not been able to persuade the combination of caret and klaR to handle missing values the way I’d like them to, but that may be ignorance (look at the na.action argument).Report the accuracy of the classifier on the held out 20%
- Part 1-D Now install SVMLight, which you can find at https://svmlight.joachims.org, via the interface in klaR (look for svmlight in the manual) to train and evaluate an SVM to classify this data. For training the model, use:svmlight (features, labels, pathsvm)You don’t need to understand much about SVM’s to do this as we’ll do that in following exercises. You should NOT substitute NA values for zeros for attributes 3, 4, 6, and 8.Using the predict function in R, report the accuracy of the classifier on the held out 20%Hint If you are having trouble invoking svmlight from within R Studio, make sure your svmlight executable directory is added to your system path. Here are some instructions about editing your system path on various operating systems: https://www.java.com/en/download/help/path.xml You would need to restart R Studio (or possibly restart your computer) afterwards for the change to take effect.
- Problem 2For this assignment, you should do your coding in R once again, but you may use libraries for the algorithms themselves.The MNIST dataset is a dataset of 60,000 training and 10,000 test examples of handwritten digits, originally constructed by Yann Lecun, Corinna Cortes, and Christopher J.C. Burges. It is very widely used to check simple methods. There are 10 classes in total (“0” to “9”). This dataset has been extensively studied, and there is a history of methods and feature construc- tions at https://en.wikipedia.org/wiki/MNIST_database and at the original site, https://yann.lecun.com/exdb/mnist/ . You should notice that the best methods perform extremely well.There is also a version of the data that was used for a Kaggle competition. I used it for convenience so I wouldn’t have to decompress Lecun’s original format. I found it at https://www.kaggle.com/c/digit-recognizer .If you use the original MNIST data files from https://yann.lecun.com/exdb/mnist/ , the dataset is stored in an unusual format, described in detail on the page. You should begin by reading over the technical details. Writing your own reader is pretty simple, but web search yields readers for standard packages. There is reader code for R available (at least) at https://stackoverflow.com/questions/21521571/how-to-read-mnist-database-in-r . Please note that if you follow the recommendations in the accepted answer there at https://stackoverflow.com/a/21524980 , you must also provide the readBin call with the flag signed=FALSE since the data values are stored as unsigned integers. You need to use R for this course, but for additional reference, there is reader code in MATLAB available at https://ufldl.stanford.edu/wiki/index.php/Using_the_MNIST_Dataset .
Regardless of which format you find the dataset stored in, the dataset consists of 28 x 28 images. These were originally binary images, but appear to be grey level images as a result of some anti-aliasing. I will ignore mid grey pixels (there aren’t many of them) and call dark pixels “ink pixels”, and light pixels “paper pixels”; you can modify the data values with a threshold to specify the distinction, as described here https://en.wikipedia.org/wiki/Thresholding_(image_processing) . The digit has been centered in the image by centering the center of gravity of the image pixels, but as mentioned on the original site, this is probably not ideal. Here are some options for re-centering the digits that I will refer to in the exercises.
-
-
- Untouched: Do not re-center the digits, but use the images as is.
- Bounding box: Construct a 20 x 20 bounding box so that the horizontal (resp. vertical) range of ink pixels is centered in the box.
- Stretched bounding box: Construct a 20 x 20 bounding box so that the horizontal (resp. vertical) range of ink pixels runs the full horizontal (resp. vertical) range of the box. Obtaining this representation will involve rescaling image pixels: you find the horizontal and vertical ink range, cut that out of the original image, then resize the result to 20 x 20. Once the image has been re-centered, you can compute features.
Here are some pictures, which may help
-

- Part 2AInvestigate classifying MNIST using naive Bayes. Fill in the accuracy values for the four combinations of Gaussian v. Bernoulli distributions and untouched images v. stretched bounding boxes in a table like this. Please use 20 x 20 for your bounding box dimensions.
Accuracy Gaussian Bernoulli Untouched images Stretched bounding box Which distribution (Gaussian or Bernoulli) is better for untouched pixels? Which is better for stretched bounding box images?
- Part 2B Investigate classifying MNIST using a decision forest. For this you should use a library. For your forest construction, try out and compare the combinations of parameters shown in the table (i.e. depth of tree, number of trees, etc.) by listing the accuracy for each of the following cases: untouched raw pixels; stretched bounding box. Please use 20 x 20 for your bounding box dimensions. In each case, fill in a table like those shown below.
Accuracy depth = 4 depth = 8 depth = 16 #trees = 10 #trees = 20 #trees = 30
-
CS498 Homework 2 SVM
- Problem 1You may use any programming language that amuses you for this homework.The UC Irvine machine learning data repository hosts a collection of data on adult income, donated by Ronny Kohavi and Barry Becker. You can find this data at https://archive.ics.uci.edu/ml/datasets/Adult For each record, there is a set of continuous attributes, and a class “less than 50K” or “greater than 50K”. There are 48842 examples. You should use only the continuous attributes (see the description on the web page) and drop examples where there are missing values of the continuous attributes. Separate the resulting dataset randomly into 10% validation, 10% test, and 80% training examples.
Write a program to train a support vector machine on this data using stochastic gradient descent. You should not use a package to train the classifier (that’s the point), but your own code. You should ignore the id number, and use the continuous variables as a feature vector. You should scale these variables so that each has unit variance. You should search for an appropriate value of the regularization constant, trying at least the values [1e-3, 1e-2, 1e-1, 1]. Use the validation set for this search. You should use at least 50 epochs of at least 300 steps each. In each epoch, you should separate out 50 training examples at random for evaluation (call this the set held out for the epoch). You should compute the accuracy of the current classifier on the set held out for the epoch every 30 steps. You should produce:
- A plot of the accuracy every 30 steps, for each value of the regularization constant.
- A plot of the magnitude of the coefficient vector every 30 steps, for each value of the regularization constant.
- Your estimate of the best value of the regularization constant, together with a brief description of why you believe that is a good value.
- Your estimate of the accuracy of the best classifier on the 10% test dataset data
CS498 Homework 3 PCA
Problem 1
CIFAR-10 is a dataset of 32×32 images in 10 categories, collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. It is often used to evaluate machine learning algorithms. You can download this dataset from https:// www.cs.toronto.edu/~kriz/cifar.html.
- For each category, compute the mean image and the first 20 principal components. Plot the error resulting from representing the images of each category using the first 20 principal components against the category.
- Compute the distances between mean images for each pair of classes. Use principal coordinate analysis to make a 2D map of the means of each categories. For this exercise, compute distances by thinking of the images as vectors.
- Here is another measure of the similarity of two classes. For class A and class B, define E(A | B) to be the average error obtained by representing all the images of class A using the mean of class A and the first 20 principal components of class B. Now define the similarity between classes to be (1/2)(E(A | B) + E(B | A)). If A and B are very similar, then this error should be small, because A’s principal components should be good at representing B. But if they are very different, then A’s principal components should represent B poorly. In turn, the similarity measure should be big. Use principal coordinate analysis to make a 2D map of the classes. Compare this map to the map in the previous exercise? are they different? why?
CS498 Homework 4 vector quantization
Problem 1
You can find a dataset dealing with European employment in 1979 at https://lib.stat.cmu.edu/DASL/Stories/EuropeanJobs.html. This dataset gives the percentage of people employed in each of a set of areas in 1979 for each of a set of European countries. Notice this dataset contains only 26 data points. That’s fine; it’s intended to give you some practice in visualization of clustering.
- Use an agglomerative clusterer to cluster this data. Produce a dendrogram of this data for each of single link, complete link, and group average clustering. You should label the countries on the axis. What structure in the data does each method expose? it’s fine to look for code, rather than writing your own. Hint: I made plots I liked a lot using R’s hclust clustering function, and then turning the result into a phylogenetic tree and using a fan plot, a trick I found on the web; try plot(as.phylo(hclustresult), type=’fan’). You should see dendrograms that “make sense” (at least if you remember some European history), and have interesting differences.
- Using k-means, cluster this dataset. What is a good choice of k for this data and why?
Problem 2
Do exercise 6.2 in the Jan 15 version of the course text
Questions about the homework
- Can we use linear vector quantization functions like lvqinit, lvqtest, lvq1 available in the ‘class’ library in R for this exercise?Answer sure; don’t know how well they work for this, as haven’t used the package. For this one, it may be simpler to build your own than to understand the package
- How should we handle test/train splits?Answer You should not test on examples that you used to build the dictionary, but you can train on them. In a perfect world, I would split the volunteers into a dictionary portion (about half), then do a test/train split for the classifier on the remaining half. You can’t do that, because for some signals there are very few volunteers. For each category, choose 20% of the signals (or close!) to be test. Then use the others to both build the dictionary and build the classifier.
- When we carve up the signals into blocks for making the dictionary, what do we do about leftover bits at the end of the signal?Answer Ignore them; they shouldn’t matter (think through the logic of the method again if you’re uncertain about this)
CS498 Homework 5 Regression
You may do this homework in groups of up to 3 contributors. Groups of 1 or of 2 are just fine, too. A group can consist of any mixture of any type of student (MSCS-DS/Online/Face). We do not offer coordination services for complex group interactions, and you may want to take this into account when forming your group.
CS498 Homework 6 Linear regression
Linear regression with various regularizers The UCI Machine Learning dataset repository hosts a dataset giving features of music, and the latitude and longitude from which that music originates here. Investigate methods to predict latitude and longitude from these features, as below. There are actually two versions of this dataset. Either one is OK by me, but I think you’ll find the one with more independent variables more interesting. You should ignore outliers (by this I mean you should ignore the whole question; do not try to deal with them). You should regard latitude and longitude as entirely independent.
-
- First, build a straightforward linear regression of latitude (resp. longitude) against features. What is the R-squared? Plot a graph evaluating each regression.
- Does a Box-Cox transformation improve the regressions? Notice that the dependent variable has some negative values, which Box-Cox doesn’t like. You can deal with this by remembering that these are angles, so you get to choose the origin. why do you say so? For the rest of the exercise, use the transformation if it does improve things, otherwise, use the raw data.
- Use glmnet to produce:
- A regression regularized by L2 (equivalently, a ridge regression). You should estimate the regularization coefficient that produces the minimum error. Is the regularized regression better than the unregularized regression?
- A regression regularized by L1 (equivalently, a lasso regression). You should estimate the regularization coefficient that produces the minimum error. How many variables are used by this regression? Is the regularized regression better than the unregularized regression?
- A regression regularized by elastic net (equivalently, a regression regularized by a convex combination of L1 and L2). Try three values of alpha, the weight setting how big L1 and L2 are. You should estimate the regularization coefficient that produces the minimum error. How many variables are used by this regression? Is the regularized regression better than the unregularized regression?
- Logistic regression The UCI Machine Learning dataset repository hosts a dataset giving whether a Taiwanese credit card user defaults against a variety of features here. Use logistic regression to predict whether the user defaults. You should ignore outliers, but you should try the various regularization schemes we have discussed.
CS498 Homework 7 EM Topic models
EM Topic models The UCI Machine Learning dataset repository hosts several datasets recording word counts for documents here. You will use the NIPS dataset. You will find (a) a table of word counts per document and (b) a vocabulary list for this dataset at the link. You must implement the multinomial mixture of topics model, lectured in class. For this problem, you should write the clustering code yourself (i.e. not use a package for clustering).
-
- Cluster this to 30 topics, using a simple mixture of multinomial topic model, as lectured in class.
- Produce a graph showing, for each topic, the probability with which the topic is selected.
- Produce a table showing, for each topic, the 10 words with the highest probability for that topic.
- Image segmentation using EM You can segment an image using a clustering method – each segment is the cluster center to which a pixel belongs. In this exercise, you will represent an image pixel by its r, g, and b values (so use color images!). Use the EM algorithm applied to the mixture of normal distribution model lectured in class to cluster image pixels, then segment the image by mapping each pixel to the cluster center with the highest value of the posterior probability for that pixel. You must implement the EM algorithm yourself (rather than using a package). Test images are here, and you should display results for all three of them. Till then, use any color image you care to.
- Segment each of the test images to 10, 20, and 50 segments. You should display these segmented images as images, where each pixel’s color is replaced with the mean color of the closest segment
- We will identify one special test image. You should segment this to 20 segments using five different start points, and display the result for each case. Is there much variation in the result? The test image is the sunset image


