MIE 451/1513 Lab and Assignment 2: Machine Learning (ML) S. Sanner solution

$30.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (7 votes)

1 Before the Introductory lab
In the lab you’ll familiarize yourself with
1. scikit-learn which is a powerful machine learning library for python.
2. pandas which is a powerful python data analysis toolkit
3. numpy and scipy which are powerful computing packages for python
4. matlibplot which is a python 2D plotting library
The scikit-learn documentation can be found on its website at http://scikit-learn.org/
stable/
The pandas documentation can be found on its website at http://pandas.pydata.org/pandas-docs/
stable/
The numpy and scipy documentations can be found on its website at http://docs.scipy.org/
doc/
The matlibplot documentation can be found on its website at http://matplotlib.org/
You just need to execute the import statements at the top of the lab ipynb notebook to import
these standard Python libraries.
2 In the Introductory lab
In the Introductory lab section, we will go through the process of loading a dataset called 20
Newsgroups, and run a baseline classification using logistic regression.
The data you need is distributed on the course website but should be identical to the data found
at http://qwone.com/~jason/20Newsgroups/ with the following description:
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of
my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder:
Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in
text applications of machine learning techniques, such as text classification and text
clustering.
3 Main Assignment
In the introductory lab, you have been exposed to the 20 newsgroup dataset. The lab included
loading and basic preparation of the data and a baseline classification using logistic regression and
covered the libraries used in this assignment.
In this assignment, we will analyze different choices when configuring your classifier:
2
Feature Set Feature selection may not only improve classification performance (especially when
data is sparse), but it can also improve computational performance. For this assignment, you
can experiment with any features and feature selection technique, although you will probably
find single term (unigram) features and a frequency-based feature selection approach to work
well.
Feature Encoding As two simple variations of feature encoding, consider a boolean {0, 1} vector
encoding (as produced in the lab) as well as a term frequency (TF) encoding (which you need
to produce yourself).
Amount of Data While generally you should use all training data available, it is instructive to
vary the amount of training data provided to an algorithm to assess the impact of the amount
of data on learning performance.
Hyperparameters All algorithms typically have at least one hyperparameter that needs tuning.
We should use cross-validation for tuning hyperparameters in practice, but in this assignment,
we will simply analyze performance as a function of hyperparameters.
We provide a template notebook for the coding part that you must use in your submission.
Autograding requires that all function names are not modified and that the entire notebook execute
in one pass from bottom to top in Colab environment (please verify before submitting). Please fill
in the missing part of the functions in the template notebook to provide the experimental results.
The function you need to implement for each part of the assignment is clearly noted in the question
description and the return value is validated using asserts in the end of the functions in the template
notebook.
Please note the following:
• For all questions except Q3, use a hyperparameter that you find performs well (can be the
default value). In general one should use nested cross-validation (CV) for hyperparameter
tuning, but we avoid that here due to the time-consuming nature of nested CV.
• Do NOT change the function name. We need that for grading.
• Do NOT use/refer to global variables in the functions. This will cause your code to fail during
autograding because we execute functions independent of any content outside the function.
However, you can (and should) call other functions when needed.
Please answer the following questions:
Q1. Binary Encoding
(a) The function binary_baseline_data takes in a list of files and runs a baseline evaluation
based on a binary encoding of the most commonly words as feature set. Based on the above
description of choices, please describe the feature set, the amount of data, and the hyper
parameters used in this baseline
(b) Try to improve the results of the baseline by improving (only) the feature set. You can use all
the techniques covered in the IR lab to improve your features (e.g., stemming, lemmatization,
lowercasing, stopwords; you can use NLTK for this purpose). Your code should be written in
the provided function binary_improved_data (input and return values should be similar to
binary_baseline_data).
3
(c) Calculate the train accuracy and test accuracy of your new function (partial code provided).
How did the results change?
(d) Different train-test splits can lead to different results. In order to get a more robust estimation of the performance of your classifier, we want to calculate the mean and the 95%
confidence interval on the accuracy of the classifier over a set of multiple runs with random
splits. Notice that the function train_test_split takes an argument random state that
can be used to create different (random) splits by passing a random value to this argument.
Please implement the function random_mean_ci that creates multiple random splits of your
dataset (the argument num_tests will determine the number of splits to evaluate) and returns a tuple (train_mean, train_ci_low, train_ci_high, test_mean, test_ci_low,
test_ci_high) that represent the mean and the low and high ends of the 95% confidence
interval for both the training accuracy and the test accuracy. We recommend you use
test size=0.3 in train test split.
Note the following:
• To generate random numbers for the random_state, you can use the following code
random.randint(1,1000) that generate a random integer in the range 1 to 1000.
• The code to calculate the mean and confidence interval is provided, given a lists of
accuracy results (the variables train_results, test_results) for the different random
splits.
(e) Run the above function for 10 iterations (num_tests=10, see provided code). What do the
average and 95% confidence intervals tell you? Are they more informative than a single trial?
Yes or no, and why? [2 sentences.]
(f) Implement a function random_cm that produces a confusion matrix that is based on multiple
random splits. Such matrix is created by summing the confusion matrices for the different
splits. Build a confusion matrix based on the results of 10 iteration (produced, as before,
by calling train_test_split function with random random_state values. Note that partial
code is provided that includes the summation of the different confusion matrices.
(g) Show the confusion matrix for 10 random splits (num_tests=10, see provided code). Are
some classes more easily confused with others? Which ones and why? [2 sentences.]
Q2. Number of Features
In this question, you will vary this number of words used as features and see how it affects the
results. Please be careful to only use a single train/test split for this evaluation.
(a) Calculate the train accuracy and the test accuracy when using p percent of the features,
p ∈ [10%, 20%, 40%, 60%, 80%, 100%]. The function feature_num has partial code you need
to complete. It returns a dataframe of the results.
(b) Use the provided code to plot the results. Explain any trends you see (average over multiple
trials if trends are not clear). [1 sentence.]
Q3. Hyperparameter Tuning
(a) Calculate the train accuracy and the test accuracy for different values for the hyperparameter
C: [10−3
, 10−2
, …, 100
, …, 103
]. The function hyperparameter has partial code you need to
complete. It returns a dataframe of the results.
4
(b) Use the provided code to plot the results (we use a logarithmic x axis). Explain any trends
you see (average over multiple trials if trends are not clear). [1 sentence.]
Note: In practice, you need to tune hyper-parameter on the validation set only, not the test set!
Q4. Feature Encoding
In this question, you will evaluate the effect of using term-frequency (TF) encoding instead of a
binary encoding on this dataset.
(a) Implement a TF encoding in the function tf_improved_data. You should use your improved
function binary_improved_data from Q1 (b) as a base, and change the encoding from binary
to TF.
(b) Compare the two encodings by comparing the mean accuracy and 95% confidence interval. Use the function random_mean_ci from Q1 (d) and run enough trials to obtain nonoverlapping 95% CIs on the average accuracy of each method. (Note this is technically
statistically unsound experimentation but it will suffice for our basic analysis here.) Which
method performs better on this dataset? Why do you think this occurs? [1 sentence.]
Q5. Comparison vs. Naive Bayes
In this question, you will compare mean accuracy and 95% confidence interval of the logistic
regression classifier to a naive bayes (NB) classifier.
(a) Implement a naive bayes classifier evaluated over multiple random splits in the function
nb_random_mean_ci. You should use your random_mean_ci function from Q1 (d) as a base,
and change the classifier from logistic regression to NB. Use the encoding (binary or TF) you
found to be better.
(b) Run enough trials to obtain non-overlapping 95% CIs on the average accuracy of each classifier. (Again, this is technically statistically unsound but it will suffice for our analysis.)
Which method performs better on this dataset? Why do you think this occurs? [1 sentence.]
Q6. Binary Logistic Regression
In this question you will build a binary logistic regression that is trained to classify the target
sci.med vs. any other target. Use the binary encoding of features of this question.
(a) Implement the function binary_med_data that return the features and targets dataframe. In
this question there are only two possible targets: 1 for sci.med and 0 for any other label.
You should use the code in binary_improved_data as a base, and change the targets to be
binary.
(b) Using the function random_mean_ci in Q1 (d), calculate the average accuracy and 95% confidence interval over ten iterations (num_tests=10, see provided code). What do the average
and 95% confidence intervals tell you? How do they compare to the multiclass logistic regression in Q1 [1 sentences.]
5