Description
Assignment overview. This assignment is designed to practice Python and to introduce you the sklearn library for machine learning in Python. This Assignment requires you to install sklearn, tensoflow, and download the Iris, and Wine data set to learn and predict some items by using algorithms implemented in sklearn. We also practice to display NMIST data that we need later.
Questions:
- [10 marks] MNIST (https://yann.lecun.com/exdb/mnist/) is a famous dataset and benchmark for pattern recognition. It contains images of hand written numbers that are normalized and centered in a 28×28 pixel array. The data set was derived from a NIST (National Institute of Standards and Technology) data set, hence the acronym for Modified NIST. There are many ways to load this data set. An easy way is to use routines from tensorflow to load 50 images into a Python tuple with
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets(‘MNIST_data’, one_hot=True)
batch = mnist.train.next_batch(50)
Write a program that displays some of these examples.
- [20 marks] In the example code for the Iris data classification, we used 10-fold cross-validation to evaluate the accuracy of predictions with a linear SVM. In the example, we used the sklearn method cross_val_score.
a. Explain briefly what k-fold cross validation is and what it is used for.
b. Write a script that does a k-fold cross-validation without using the cross_val_score function and compares the results with the sklearn function. - Compare the cross-validated results of the SVM and RF and comment on which method is better.
- [20 marks] Please download the zip file and extract it to the directory for this assignment. Read through the wine_names.txt file and come to understand the problem and the wine data contained in the wine.train dataset. Train one of the models SVM, MLP, or RF to develop the best possible model for classifying the wine data in the hold-out test data set of 58 records in the wine.test file given the training data. In other words, you must submit a list of 58 classifications (as a separate *.csv file) for the hold-out test set in the same order as received. We will use your answers to score how well your model performs.
Describe briefly your methodology for determining the best model and submit your final prediction program as well as the .csv file with the labels. Everyone will be ranked based on how well they do on their classification of the hold-out test set and 2 additional marks will be given to the best 10% of submissions.