Description
1. Introduction
In this lab you will implement a Bayes Classifier and the Adaboost algorithm that
improves the performance of a weak classifier by aggregating multiple hypotheses generated across different distributions of the training data. Some predefined functions for
visualization and basic operations are provided, but you will have to program the key
algorithms yourself.
In this exercise we will work with three well known machine learning datasets for classification and see how different modeling assumptions affect the classification. Each dataset
is provided in the form of two csv files: datasetnameX.txt for the feature vectors and
datasetnameY.txt for the corresponding labels. You will be provided with functions
that will import the datasets into Python.
The datasets we will work with are the
following:
Figure 1. Example of Iris flower, by David Iliff. License: CC-BY-SA 3.0.
1.1. Iris. This dataset contains 150 instances of 3 different types of iris plants. The
feature consists of 2 attributes describing the characteristics of the Iris. Our task is to
classify instances as belonging to one of the types of Irises.
1.2. Vowel. This dataset contains 528 instances of utterances of 11 different vowels.
Our task is to classify instances as belonging to one of the types of the vowels.
Figure 2. Visualization of the different classes of vowels.
Figure 3. A few example images from the Olivetti faces dataset.
1.3. (Voluntary) Olivetti Faces. This dataset contains 400 different images of the
faces of 40 different persons, each person is represented by 10 images. The images
were taken at different times with varying lighting and facial expressions against a dark
homogeneous background. They are all grayscale, and 64×64 pixels. Our task is to
classify instances as belonging to one of the people.
More information on the different datasets can be found at https://archive.ics.uci.edu/ml/.
2. Code skeleton & Python packages
You will use Python for this lab assignment. You have the option to use either pure
Python or a Jupyter notebook. There are code skeletons available for both in lab3.py
and lab3.ipynb respectively. At the beginning of both is a short description of Jupyter
and how to install it.
You will need the following Python packages for this exercise: numpy, scipy, matplotlib
and sklearn. For nicer plots, you may also use the seaborn library. To enable this,
uncomment lines 12-13 in labfuns.py.
3. REALLY Useful Tips For Python and Numpy
3.1. First of all.
• Use logical indexing and broadcasting.
• Iterate over classes not data points.
• Try to avoid for loops when possible. If you cannot think in matrix multiplications, do the for loop first then simplify.
• Remember that 2/3 might not be the same as 2./3., that is, integer division
vs. float division.
• Comment your code.
3.2. Numpy Creation of Matrices. Here are some simple instructions for matrices.
It is good to preallocate the matrix before usage to avoid resizing, etc.
import numpy as np
# Dimensions of Matrices
rows = 7
cols = 4
depth = 5
# Creating matrices
A = np.zeros((rows,cols)) # 2D Matrix of zeros
print(A.shape)
>> (7,4)
A = np.zeros((depth,rows,cols)) # 3D Matrix of zeros
A = np.ones((rows,cols)) # 2D Matrix of ones
A = np.array([(1,2,3),(4,5,6),(7,8,9)]) # 2D 3×3 matrix with values
# Turn it into a square diagonal matrix with zeros of-diagonal
d = np.diag(A) # Get diagonal as a row vector
d = np.diag(d) # Turn a row vector into a diagonal matrix
print(d)
>> array([[1, 0, 0],
[0, 5, 0],
[0, 0, 9]])
# This works
print(1.0/A)
>> array([[ 1. , 0.5 , 0.33333333],
[ 0.25 , 0.2 , 0.16666667],
[ 0.14285714, 0.125 , 0.11111111]])
3.3. Numpy reshaping. Sometimes you will need to convert a vector of size (N,) to a
matrix (N,1) or a matrix to vector
a = np.array((1,2,3))
print(a.shape)
>>(3,)
print(a.reshape(-1,1).shape)
>>(3,1)
print(a.reshape(1,-1).shape)
>>(1,3)
LAB 3: BAYESIAN LEARNING AND BOOSTING 4
print(a.reshape(1,-1).reshape(-1).shape)
>>(3,)
3.4. Numpy Logical Indexing. Given a data vector X of row vectors and a label
vector y we want to extract the data points for one of the labeled classes. If we assume
that y consists of four different classes and we want to extract all vectors from X for the
specific classes for the corresponding label we can do the following:
X,y = getData()
classes = np.unique(y) # Get the unique examples
# Iterate over both index and value
for jdx,class in enumerate(classes):
idx = y==class # Returns a true or false with the length of y
# Or more compactly extract the indices for which y==class is true,
# analogous to MATLAB’s find
idx = np.where(y==class)[0]
xlc = X[idx,:] # Get the x for the class labels. Vectors are rows.
3.5. Broadcasting. Sometimes we want to subtract or add a row vector u to a matrix
X consisting of row vectors. Instead of iterating over all rows in X we can make use
of something in numpy called broadcasting. Broadcasting vectorizes arithmetic array
operations such that looping occurs in C instead of Python.
It can be used in the
following way:
# Subtract the vector using for loop
X = np.array([(1,2,3),(4,5,6),(7,8,9)])
u = np.array((1,2,3))
for i_row in range(0,X.shape[0]):
X[i_row,:] = X[i_row,:] – u
print(X)
>> array([[0, 0, 0],
[3, 3, 3],
[6, 6, 6]])
# Subtract using broadcasting
X = np.array([(1,2,3),(4,5,6),(7,8,9)])
u = np.array((1,2,3))
X = X – u
print(X)
>> array([[0, 0, 0],
[3, 3, 3],
[6, 6, 6]])
4. Bayesian Learning
4.1. Bayesian model fitting. In Bayesian model fitting we wish to infer the model
parameters, α, given the data, D. Using Bayes’ theorem we can write the posterior for
α as,
P(α|D) = P(D|α)P(α)
P(D)
. (1)
Here P(D|α) is the likelihood of the data under our model with the given parameters.
P(α) is the prior probability of the parameters and P(D) is a normalizing constant, the
model evidence. In words this becomes,
P osterior =
Likelihood × P rior
Evidence (2)
4.2. Bayesian Classification. In a classification scenario we want to classify a set
of points as belonging to one of a given set of classes, C. To classify a point x
∗ as
belonging to the class k we want to compute the class posterior for each of the classes.
A straightforward application of Bayes’ theorem gives,
p(k|x
∗
) = pk(x
∗
|k) p(k)
P
k
0∈C
pk
0(x∗|k
0) p(k
0)
, (3)
where pk(x
∗
|k) is the class conditional density, p(k) is the prior probability of a point
belonging to class k and the sum in the denominator is a normalizing constant. To
classify a point we pick the class that has max posterior probability.
4.3. Modeling. In this lab we will assume that the density that best models the data
for each of the classes, indexed by k, is multivariate Gaussian,
pk(x|k) = pk(x|µk
, Σk) = 1
p
(2π)
d|Σk|
exp
−
1
2
(x − µk
)Σk
−1
(x − µk
)
T
(4)
where x is a d-dimensional row vector, µk
is the mean vector, Σk is the covariance
matrix and |Σk| the determinant.
In a fully Bayesian treatment we would place a prior over the parameters and marginalize them out, but to simplify things we instead choose to find the parameters by the
maximum likelihood (ML) estimate. We also assume that all of the data points are independent and identically distributed (i.i.d).
We write the likelihood for the ML-estimate
as,
arg max
µk
,Σk
L(µk
, Σk|Dk) = pk(Dk|µk
, Σk) ∝
Y
N
{i|ci=k}
pk(xi
|µk
, Σk), (5)
where Dk are the points in D belonging to class k.
A less complicated form of the Gaussian to work with is the log transform, the loglikelihood,
ln(pk(x|µk
, Σk)) = −
1
2
ln(|Σk|) −
1
2
(x − µk
)Σ
−1
k
(x − µk
)
T −
d
2
ln(2π) (6)
To maximize log likelihood we take the derivate with respect to both µk and Σk and
equate to zero,
d ln(L)
dµk
= 0,
d ln(L)
dΣk
= 0. (7)
After some manipulation we arrive at the following expressions for the ML-estimate of
µk and Σk. ci denotes the class of the i:th training instance:
µk =
P
{i|ci=k} xi
Nk
(8)
Σk =
1
Nk
X
{i|ci=k}
(xi − µk
)
T(xi − µk
). (9)
For simplicity, we will assume that all of the feature dimensions are uncorrelated, with
no off diagonal covariance elements. This allows us to further simplify the expression for
the covariance matrix at diagonal indices (m, m) and (m, n), m 6= n respectively to
Σk (m, m) = 1
Nk
X
{i|ci=k}
(xi (m) − µk
(m))2
, and Σk (m, n) = 0 for m 6= n. (10)
This is knowm as the naive Bayes assumption, making it a Naive Bayes Classifier.
4.4. Assignment 1. Write a function, mlParams(X,labels), that computes the MLestimates of µk and Σk for the different classes in the dataset. X here is a set of row
vectors, and labels are the class labels for each of the data points (again, ignore the
W argument for now).
The function should return a C × d-array mu that contains the
class means, a C ×d×d-array sigma that contains the class covariances. The covariance
should be implemented using your own code and not by applying a library function.
Use the provided function, genBlobs(), that returns Gaussian distributed data points
together with class labels, to generate some test data. Compute the ML-estimates for
the data and plot the 95%-confidence interval using the function plotGaussians.
4.5. Classification. The next step is to program the discriminant function based on
the log posterior, δk(x) = ln(p(k|x)) for predicting the class of an unseen instance, x
∗
.
Using Eq. 3 and the log transform we can write the discriminant function as,
δk(x
∗
) = ln(p(k|x
∗
)) = ln(pk(x
∗
|k)) + ln(p(k)) − lnX
l∈C
pl(x
∗
|l)
= −
1
2
ln(|Σk|) −
1
2
(x
∗ − µk
)Σ
−1
k
(x
∗ − µk
)
T −
d
2
ln(2π) + ln(p(k)) − lnX
l∈C
pl(x
∗
|l)
= −
1
2
ln(|Σk|) −
1
2
(x
∗ − µk
)Σ
−1
k
(x
∗ − µk
)
T + ln(p(k)) + C.
(11)
Since we have diagonal covariance matrices Σk, there is no need to compute the inverse of
the full matrices. Rather, these products can be computed for each dimension separately.
When classifying new data points, we can ignore C when comparing the values given by
Eq. 11 as it will not vary with our test data or class assignments.
We compute the class prior, p(k), as the frequency of the occurrences of the different
classes,
p(k) = Nk
N
. (12)
4.6. Assignment 2.
(1) Write a function computePrior(labels) that estimates and returns the class
prior in X (ignore the W argument).
(2) Write a function classifyBayes(X,prior,mu,sigma) that computes the discriminant function values for all classes and data points, and classifies each point
to belong to the max discriminant value. The function should return a length N
vector containing the predicted class value for each point.
4.7. Assignment 3. We now have all functions we need for doing the training and
classification. Use the provided function testClassifier to test the accuracy for the
vowels and iris datasets. testClassifier runs a loop that does the following things:
0. Uses the provided random partitioning function to split the dataset into a training and test dataset.
1. Trains your classifier on the training partition.
2. Evaluate the performance of the classifier on the test partition.
Run testClassifier for the datasets and take note of the accuracies. Use plotBoundary
to plot the decision boundary of the 2D iris dataset.
Answer the following questions:
(1) When can a feature independence assumption be reasonable and when not?
(2) How does the decision boundary look for the Iris dataset? How could one improve
the classification results for this scenario by changing classifier or, alternatively,
manipulating the data?
5. Boosting
5.1. Boosting a Weak Classifier. Boosting aggregates multiple hypotheses generated
by the same learning algorithm invoked over different distributions of training data into
a single composite classifier.
Boosting generates a classifier with a smaller error on the
training data as it combines multiple hypotheses which individually have a larger error
(but lower than 50%). Boosting requires unstable classifiers whose learning algorithms
are sensitive to changes in the training examples.
The idea of boosting is to repeatedly apply a weak learning algorithm on various distributions of the training data and to aggregate the individual classifiers into a single
overall classifier. After each iteration the distribution of training instances is changed
based on the error the current classifier exhibits on the training set.
The weight ωi of an
instance (xi
, ci) specifies its relative importance, which can be interpreted as if the training set would contain ωi
identical copies of the training example (xi
, ci). The weights ωi
of correctly classified instances (xi
, ci) are reduced, whereas those of incorrectly classified
instances are increased.
Thereby the next invocation of the learning algorithm will focus
on the incorrect examples.
In order to be able to boost the Bayes classifier, the algorithm for computing the MAP
parameters and the discriminant function has to be modified such that it can deal with
fractional (weighted) instances.
Assume that ωi
is the weight assigned to the i:th training
instance. Without going into a straightforward detailed derivation Equations 8, 10 for
the MAP parameter with weighted instances become:
µk =
P
{i|ci=k} ωixi
P
{i|ci=k} ωi
(13)
Σk (m, m) = 1
P
{i|ci=k} ωi
X
{i|ci=k}
ωi(xi (m) − µk
(m))2
, and Σk (m, n) = 0 for m 6= n.
(14)
5.2. Assignment 4: Extend the old mlParams function to mlParams(X, labels, W)
that handles weighted instances. Again X is N × d matrix of feature vectors, labels a
length N vector containing the corresponding labels and W is a N × 1 matrix of weights.
The signature should look like
def mlParams(X, labels, W)
…
return mu, sigma
Here, the types of return parameters mu and sigma are identical to the old mlParams.
The function computes the maximum posterior parameters µk and Σk for a dataset D
according to Equations 13-14. Assume the usual data format for the first two parameters.
Test your function mlParams(X, labels, W), for a uniform weight vector with ω =
1/N. The MAP parameters should be identical to those obtained with the previous
version of mlParams.
5.3. The Adaboost algorithm. The Adaboost algorithm repeatedly invokes a weak
learning algorithm, for example the BayesClassifier that we implemented, and after
each round updates the weights of training instances (xi
, ci). In the following, t will
denote the iteration round of the algorithm.
Further, we assume that ω is always
normalized, P
i ωi = 1. Adaboost proceeds as follows (repeating steps 1-4 for every t).
0. Initialize all weights uniformly ω
1
i = 1/N.
1. Train weak learner using distribution ω
t
.
2. Get weak hypothesis h
t and compute its error
t with respect to the weighted distribution ω
t
. In case of the Bayes classifier a single hypothesis h
t
is represented
by the the learned parameters (µ
t
k
, Σt
k
).
The weighted error is given by
t =
X
N
i=1
ω
t
i
1 − δ(h
t
(xi), ci)
,
where h
t
(xi) is the classification of instance xi made by the hypothesis h
t
. The
function δ(h
t
(xi), ci) is 1 if h
t
(xi) = ci and 0 otherwise.
3. Choose α
t =
1
2
ln(1 −
t
) − ln(
t
)
.
4. Update weights according to
ω
t+1
i =
ω
t
i
Zt
×
e
−α
t
if h
t
(xi) = ci
e
α
t
if h
t
(xi) 6= ci
,
where Z
t
is a normalization factor ensuring that P
i ω
t+1
i = 1.
The overall classification of the boosted classifier of an unseen instance x is obtained by
aggregating the votes casted by each individual classifier. As we have higher confidence
in classifiers that have a low error (large α
t
), their votes count relatively more.
The final
classification H(x) is the class cmax that receives the majority of votes
H(x) = cmax = arg max
ci
X
T
t=1
α
t
δ(h
t
(x), ci) (15)
5.4. Assignment 5:
(1) Modify computePrior to have the signature computePrior(labels, W), taking
the boosting weights ω into account. We can look at the weights as taking a
particular training point xi
into account ωi times. So if previously there w
Nk points in a particular class, we should now think about “how many times we
count” each point. Note that the prior probabilities should still sum to one.
(2) Implement the Adaboost algorithm and apply it to the Bayes classifier. Design a function trainBoost(base classifier, X, labels, T) that generates
a set of boosted hypotheses, where the parameter T determines the number of hypotheses. Use the modified computePrior(labels, W). The signature in Python
should look like
def trainBoost(base_classifier, X, labels, T):
…
return classifiers, alphas
Note that a new classifier of type base classifier can be trained by calling
new classifier = base classifier.trainClassifier(X, y, W), which can
then be used for classification by calling new classifier.classify(X).
(3) Design a function
def classifyBoost(X, classifiers, alphas, Nclasses):
…
return yPred
that classifies the instances in data by means of the aggregated boosted classifier
according to Equation 15. The resulting classifications are returned in the vector
yPred.
Observe: The return parameter alphas, a length T list, holds the classifier vote weights
α
t
. classifiers should be a length T list of trained classifiers. yPred will be a length
N vector of predicted class labels. Note that you have to compute and store all the
hypotheses generated with classifiers[t].classify(data, labels, W) for each of
the different distributions ω
t+1 and later aggregate their classifications to obtain the
overall classification.
Compute the classification accuracy of the boosted classifier on some data sets using
testClassifier from labfuns.py and compare it with those of the basic classifier on
the vowels and iris data sets (see Assignment 3):
(1) Is there any improvement in classification accuracy? Why/why not?
(2) Plot the decision boundary of the boosted classifier on iris and compare it with
that of the basic. What differences do you notice? Is the boundary of the boosted
version more complex?
(3) Can we make up for not using a more advanced model in the basic classifier
(e.g. independent features) by using boosting?
You may use the function plotBoundary provided in labfuns.py to plot the decision
boundary for different datasets and parameters.
5.5. Assignment 6. We have implemented a class DecisionTreeClassifier based
upon skLearns decision tree classifier. The skLearn implementation is similar to the one
used in the first lab, however, here the values are continuous and we use the default Gini
index to compute the split.
Test the decision tree classifier on the vowels and iris data sets. Repeat but now by
passing it as an argument to the BoostClassifier object. Answer questions 1-3 in
assignment 5 for the decision tree.
5.6. Assignment 7. If you had to pick a classifier, naive Bayes or a decision tree or
the boosted versions of these, which one would you pick? Motivate from the following
criteria:
• Outliers
• Irrelevant inputs: part of the feature space is irrelevant
• Predictive power
• Mixed types of data: binary, categorical or continuous features, etc.
• Scalability: the dimension of the data, D, is large or the number of instances,
N, is large, or both.
6. The End
If you have followed the instructions you should be done now. Please report your findings
carefully by saving your plots and classification results. Use these to reason about the
questions in the lab description.
For the interested, there is also the voluntary assignment at the end. This showcases
how we can use the already implemented code to classify and visualize real world data.
7. Voluntary Assignment
Note that this part is completely voluntary!
Figure 4. Example of visualizeOlivettiVectors output.
7.1. Classifying Faces. Now we can use the implemented classifiers to achieve something a bit more tangible. For this section, we will refer mostly to the code. In
labfuns.py, there is a function called visualizeOlivettiVectors, which takes in
columns of training vectors from the Olivetti data set together with a single test point.
It then visualizes these images together. The idea is that we will take a test point at
random, classify it, and then visualize it together with the training vectors belonging to
that class to see how well the classification worked.
To begin with, we should try some different combinations of classifiers and boosted
classifiers to see how good results we can expect. While a boosted BayesClassifier
might be too computationally demanding to try out, you can try the basic bayes classifier
as well as DecisionTreeClassifier with and without boosting. In some combinations,
you should be able to approach accuracies of 90%.
When you have identified a good combination, simply use it with the supplied code to
visualize the classifications. An example output is shown in Figure 4.