CS 4641 Final Project: Experimental Design for Supervised Learning solution

$25.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (1 vote)

Introduction

This project is different from the homeworks you’ve worked on previously in the semester. There is no
skeleton code file or list of questions to answer. You are welcome to use libraries which implement the
algorithms you’ll be applying. But the primary difference is that this project is much more open ended.

The
purpose is for you to apply what you’ve learned throughout this semester to a problem that interests you,
and report on what you find. The remainder of this document outlines a basic report structure and some
required components, but the bulk of the work will be in designing, running, and interpreting the results
of experiments which compare the performance of several machine learning algorithms on a dataset of your
choice.

There is no “correct” result, because what you find will depend entirely on the problem you’re
trying to solve and the experiments you design. What we will be looking for in grading this assignment is
whether you include everything we ask you to, whether your experimental methodology is well designed and
described, and whether your conclusions are supported by the results of your experiments.

1 Description

Pick one dataset that is interesting from a ML perspective, and important to you. There should be a
supervised learning signal available (e.g. class labels or regression targets), and the dataset should be nontrivial. Non-triviality is subjective, but here are some heuristics:

• There should be at least 1000 datapoints.
• There should be at least 5 features for each datapoint.
• The datapoints should have a non-trivial distribution. For example, linearly separable data is not
very interesting.

• If in doubt, ask yourself, “Will the TAs agree that this interesting?”
• For the purposes of this assignment, datasets we have used in previous homeworks won’t be considered
non-trivial.

Using any implementation (yours, library, code off the internet) compare the performance of the following
algorithms:

If you are doing regression:
• Kernel Ridge Regression with a non-linear kernel1
• k-Neighbors Regression 2
• Neural Network with at least 2 hidden layers
(preferably more)

If you are doing classification:
• Random Forests with Bagging3 or Boosting4
• Support Vector Machines with a non-linear kernel
• Neural Network with at least 2 hidden layers
(preferably more)

2 Analysis

Your report should be a PDF with a maximum of 15 pages or 5000 words (whichever is shorter). It should
include the following sections.
Introduction to the dataset Explain what the data consist of, what the underlying supervised learning
problem is, and why it’s important to you. Explain the metric(s) that you will use to measure performance.

Description of the algorithms Briefly describe each of the algorithms you will be testing. Identify
the hyperparameters that need to be tuned in the next section, and explain how they are related to the
complexity of the hypothesis class.

Tuning hyperparameters Describe a set of experiments for tuning the relevant hyperparameters for each
method using cross-validation (or another appropriate tuning technique). Include figures (graphs and/or
tables) showing the performance of each algorithm in terms of the metric(s) introduced in the first section.

Justify the claim that the data has a non-trivial distribution. Report the final hyperparameters that you
chose, the performance the chosen hyperparameters achieve, and how much computation time was necessary
to run each experiment.

Comparing algorithm performance Describe a set of experiments for comparing the performance of
each tuned model with the others on a held-out test set (unseen during tuning). Be sure that your results
include confidence intervals or other appropriate measures of statistical significance. Based on the results,
draw a conclusion about which algorithm works best and why. Support your assertions with figures.

Conclusion Compare the algorithms based on metrics other than performance (e.g. training time, number
of hyperparameters). Use this, along with performance, to make a claim about what algorithm you would
use in practice on real world data from the same domain.

Acknowledgements List any external sources you used to help you with this assignment. This includes
software libraries, other students, and websites. When in doubt, cite it!

Submission format
Please submit one zipfile in the format final-GTusername.zip, replacing “GTusername” with your username (for example “final-bhrolenok3.zip”). This zipfile should include:
• report.pdf A PDF with your written analysis. See “Analysis” above. To maintain some anonymity
when grading, please do not put your name on the PDF itself.
1https://scikit-learn.org/stable/modules/kernel_ridge.html
2https://scikit-learn.org/stable/modules/neighbors.html#regression
3https://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees
4https://scikit-learn.org/stable/modules/ensemble.html#adaboost

• README A text file that explains how to reproduce your experiments. See “Reproducibility” below.
• Anything else needed to reproduce your experiments (code, raw data, preprocessed data, etc.).

3 Reproducibility

Reproducibility is an important part of experimental design. Accordingly, the expectations for reproducibility
for this project are higher than they were for the previous assignments.

Obtaining the data
If at all possible, please include your dataset(s) in your zip file. The Canvas file size limit is 500 MB, so try
to choose datasets that are smaller than this. If uploading your dataset is not possible, please explain in
your README how to obtain the data. Try to include a URL. If even this is not possible (e.g. your dataset is
protected by privacy or intellectual property laws), please obtain prior permission from the instructor.

Re-running your experiments
Your README should also explain how, starting from the raw data and with only a handful of terminal
commands, we can re-generate all of the graphs and tables in your report.pdf. In particular, it should
explain
• how to install software dependencies

• what terminal commands to run to re-generate your graphs and figures.
Remember, this should require as input only the raw data (e.g. loading pre-trained regression parameters
from a file is not allowed).
Appendix

Extra credit
At their own discretion the TAs may offer a few points of extra credit for exceptionally good reports. Some
qualities that make a good report include:
• Conciseness: A short report will get more extra credit than a long one.
• Formatting: A well-formatted PDF will make the reader well-disposed toward you.

Examples of interesting datasets
You may use any dataset(s) that you like for this assignment, but here are some ideas to get you started.
• https://www.dataquest.io/blog/free-datasets-for-projects/
• https://www.cooldatasets.com/
• https://github.com/awesomedata/awesome-public-datasets