# CSCI 5525: Machine Learning Homework 4 solution

\$30.00

Category:

## Description

1. (35 points) In this problem, we consider Adaboost. Implement the Adaboost algorithm with
100 weak learners and apply it to the cancer dataset described above. For the weak learners,
use decision stumps (1-level decision trees). You must implement the decision stumps from
scratch. Use information gain as the splitting measure.
Please submit (a) summary of methods and results report and (b) code:
(a) Summary of methods and results: Briefly describe the approaches used above, along
with relevant equations. Report a plot of the classification error on both the train and
test sets as the number of weak learners increase. (One plot where the x-axis is the
number of weak learners from 1 to 100 and the y-axis is the classification error.)
(b) Code: Submit the file adaboost.py which contains the function def adaboost(dataset:
str) -> None:. The function takes in a string of the dataset filename and does not return anything but must print out to the terminal (stdout) the train and test classification
error rates as the number of weak learners increase.
2. (35 points) In this problem, we consider Random Forests. Implement the Random Forest
algorithm with 100 decision stumps. You must implement the decision stumps from scratch.
Use information gain as the splitting measure. Apply your Random Forest implementation
to the cancer dataset described above and do the following:
(i) Use m = 3 random attributes to determine the split of your decision stumps. Learn a
model for an increasing number of decision stumps in the ensemble. Compute the train
and test set classification error as the number of decision stumps increases.
1
https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29
(ii) Vary the number of random attributes m from {2, . . . , p = 10} and fit a model using
100 decision stumps. Compute the train and test set classification error as the number
of random features m increases.
Please submit (a) summary of methods and results report and (b) code:
(a) Summary of methods and results: Briefly describe the approaches used above, along
with relevant equations. Report a plot of the classification error on both the train and
test sets for both (i) and (ii) above. (A total of 2 plots where for (i) the x-axis is the
number of decision trees and y-axis is the classification error, and (ii) the x-axis is the
number of random features m and y-axis is the classification error.)
(b) Code: Submit the file rf.py which contains the function def rf(dataset: str) ->
None:. The function takes in a string of the dataset filename and does not return anything
but must print out to the terminal (stdout) the train and test classification error rates
for (i) and (ii) above.
3. (30 points) In this problem, we consider k-means for image segmentation. We can use
k-means to cluster pixels with similar (color) values together to generate a segmented or
compressed version of the original image. Implement the k-means algorithm and apply it to
the provided image “umn csci.png”. For each k = {3, 5, 7}, generate a segmented image and
compute the cumulative loss (i.e., distortion measure from the lecture notes). (Note, it may
be helpful to test on a smaller version of the image “umn csci.png” to ensure your code works
but report final results on the full version.)
Please submit (a) summary of methods and results report and (b) code:
(a) Summary of methods and results: Briefly describe the approaches used above, along
with relevant equations. For each value of k = {3, 5, 7}, report the final (i.e., after kmeans has converged) segmented image and a plot of the cumulative loss during training.
(This will be 3 segmented images and 3 plots of the loss where the x-axis is the training
iteration number and y-xais is the loss value.)
(b) Code: Submit the file kmeans.py which contains the function def kmeans(image:
str) -> None:. The function takes in a string of the image to segment and does not
return anything but must print out to the terminal (stdout) the cumulative loss at each
iteration during training.
Additional instructions: Code can only be written in Python 3.6+; no other programming
languages will be accepted. One should be able to execute all programs from the Python command
Each function must take the inputs in the order specified in the problem and display the textual
output via the terminal and plots/figures should be included in the report.
For each part, you can submit additional files/functions (as needed) which will be used by the
main file. In your code, you cannot use machine learning libraries such as those available from
scikit-learn for learning the models. However, you may now use scikit-learn for cross validation
and computing misclassification errors. You may also use libraries for basic matrix computations
and plotting such as numpy, pandas, and matplotlib. Put comments in your code so that one can
Your code must be runnable on a CSE lab machine (e.g., csel-kh1260-01.cselabs.umn.edu).
One option is to SSH into a machine. Learn about SSH at these links: https://cseit.umn.edu/
and https://cseit.umn.edu/knowledge-help/remote-linux-applications-over-ssh.
Instructions
Follow the rules strictly. If we cannot run your code, you will not get any credit.
• Things to submit
1. hw4.pdf: The report that contains the solutions to Problems 1, 2, and 3 including the
summary of methods and results.
2. adaboost.py: Code for Problem 1.
3. rf.py: Code for Problem 2.
4. kmeans.py: Code for Problem 3.