Description
In this homework, each student will solve their own machine learning problem using one of the
three given datasets and explain the results of 3 different machine learning algorithms. You will
work in groups using several machine learning algorithms.
1. 3 students should form a team, and sign up on the Google Form for one data set (choice
of either 1, 2 or 3). We only allow 15 teams per data set. If there are already 15 teams
signed up for one data set, no more teams can select that. If you have problems forming
a group, please use Piazza to find team members.
Please use the following Google Sheet to sign up your team:
https://docs.google.com/spreadsheets/d/1VKfuFAcerdW2eFAqH2uSkhglTB1eJrcl0BpE0ZLIk8/edit?usp=sharing
2. 40pts For every dataset, you must apply neural networks and decision tree algorithms.
Each member in the team should try one more different machine learning algorithm
that has not been applied by another team member, such that each team runs 5
different prediction algorithms in total.
a. The whole assignment must be implemented in Wolfram/Mathematica. We
recommend using the Mathematica Desktop version on your machine for best
performance. Please use the newest version to ensure everyone can run each
other’s code.
b. Provide a discussion of the problem you are solving, how you set up the data,
comparison of the results from the three learning algorithms.
c. Visualizations of your results including these 3 performance metrics (ROC curve,
accuracy, precision/recall)
d. Explain why you chose the third machine learning algorithm and what
conclusions you were able to prove based on your results
3. 20pts Each student must prepare a video (max. 5 min duration) to present the findings
using a presentation of 3 slides.
4. 35pts Each student must prepare a poster and present it either on July 18 or on July 20.
5. 5pts Each student must prepare 3 true/false or multiple-choice questions that can be
used to assess a student’s knowledge of machine learning techniques used in this
project. We will select the best questions to appear on the final exam.
Data Sets
The following datasets must be separated into test and training data as follows: every 5th
sample belongs to test data, the remaining samples belong to training data.
For example, if data is a List, the two following Mathematica commands separate it into test
and training data:
test = Take[data, {1, -1, 5}]
train = Drop[data, {1, -1, 5}]
1. CIFAR-100
https://datarepository.wolframcloud.com/resources/CIFAR-100_1
This is a computer vision problem where you need to classify 50,000 images (32 by 32 pixels)
into 100 possible categories.
2. Tornadoes in the U.S., 1950-2015
https://datarepository.wolframcloud.com/resources/United%2BStates%2Btornadoes%2B1950-2015
The goal of this problem is to predict the magnitude of tornados based on the F-scale rating (F0
to F9). While there exists a clear description of the F-scale methodology, it will be interesting to
see how the different classification algorithms stack up against the defined metric.
3. Amazon review sentiment analysis:
https://www.kaggle.com/bittlingmayer/amazonreviews
This dataset is an extract from the Amazon Reviews Kaggle competition. The goal is to perform
sentiment analysis to determine whether a review is positive or negative. We have provided a
CSV file on D2L which contains the binary label (positive/negative) and the corresponding text
for the 400,000 reviews.
The CSV file can be read using the command:
Import[“reviews.csv”, “Table”, FieldSeparators – “|”]
Software tools
Mathematica installation:
Wolfram Language:
https://wolframlanguage.org
Mathematica tutorial:
https://www.math.mtu.edu/~msgocken/pdebook2/mathtut2.pdf
Submission:
You must submit by July 16 11:59pm:
1. the link to your 5-min video
2. Mathematica notebook with your dataset, preprocessing, classifications, results, and
discussion of the problem you are solving, how you set up the data, comparison of the
results, and explanation of what conclusions you were able to prove.
3. your 3-slide presentation which covers
a. the dataset and problem
b. 3 machine learning techniques you used and why
c. visualizations of your results including these 3 performance metrics (ROC curve,
accuracy, precision/recall)
4. your exam questions (TF or multiple choice).
We will later send out a Google Form where you can submit your homework.