Description
PART I
(10 points)
1. Consider a regression problem of trying to estimate the function f: X -> y where
X is a vector of feature attributes and y is a continuous real valued output variable.
You would like to use a bagging model where you first create M bootstrap samples
and then use them to create M different models β h1, h2, β¦, hM. You can assume that
all models are of the same type.
The error for each of the models would be described as:
ππ(π₯) = π(π₯) β βπ(π₯)
where x is the input data and hi is the model created using ith bootstrap sample
The expected value of the squared error for any of the models will be defined as:
πΈ(ππ
(π₯)
2
) = πΈ[(π(π₯) β βπ
(π₯))
2
]
The average value of the expected squared error for each of the models acting
individually is defined as:
πΈππ£π =
1
πβπΈ(ππ
(π₯)
2
)
π
π=1
Now, you decide to aggregate the models using a committee approach as follows:
βπππ(π₯) =
1
πββπ
π
π=1
(π₯)
The error using the aggregated model is defined as:
πΈπππ(π₯) = πΈ[{
1
πββπ
π
π=1
(π₯) β π(π₯)}
2
]
which can be simplified as:
πΈπππ(π₯) = πΈ[{
1
πβππ
π
π=1
(π₯)}
2
]
where we used the value of ο₯i is defined above.
Prove that
πΈπππ =
1
π
πΈππ£π
provided you make the following assumptions:
1. Each of the errors have a 0 mean
πΈ(ππ(π₯)) = 0 for all i
2. Errors are uncorrelated
πΈ(ππ
(π₯)ππ(π₯)) = 0 for all i β j
(10 points)
2. Jensen’s inequality states that for any convex function f:
π(βπππ₯π) β€ βπππ(π₯π)
π
π=1
π
π=1
In question 1, we had assumed that each of the errors are uncorrelated i.e.
πΈ(ππ
(π₯)ππ(π₯)) = 0 for all i β j
This is not really true, as the models are created using bootstrap samples and have
correlation with each other. Now, let’s remove that assumption. Show that using
Jensen’s inequality, it is still possible to prove that:
πΈπππ β€ πΈππ£π
(10 points)
3. Deriving the training error for AdaBoost:
In class, we discussed the steps of Adaboost algorithm. Recall that the final
hypothesis for a Boolean classification problem at the end of T iterations is given
by:
π»(π₯) = π πππ(βπΌπ‘βπ‘(π₯))
π
π‘=1
The above equation says that the final hypothesis is the weighted hypothesis
generated at the end of each individual step.
Also recall that the weight for the point i at step t+1 is given by:
π·π‘+1
(π) =
π·π‘
(π)
ππ‘
Γ π
βπΌπ‘βπ‘
(π)π¦(π)
where:
π·π‘(π) is the normalized weight of point i in step t
βπ‘(π) is the hypothesis (prediction) at step t for point i
πΌπ‘
is the final βvoting powerβ of hypothesis βπ‘
π¦(π) is the true label for point i
ππ‘
is the normalization factor at step t (it ensures that the weights sum up to 1.0)
Note that at step 1, the points have equal weight
π·1 =
1
π
where N is the total number of data points.
At each of the steps, the total error of βπ‘ will be defined as ππ‘ =
1
2
β πΎπ‘
, which is a
way of saying that the error will be better than 50% by a value πΎπ‘
.
Prove that at the end of T steps, the overall training error will be bounded by:
exp(β2 β πΎπ‘
π 2
π=1
)
That is, the overall training error of the hypothesis H will be less than or equal to
the amount indicated above.
Part II (70 points)
Tweets Clustering using k-means
Twitter provides a service for posting short messages. In practice, many of the tweets are very
similar to each other and can be clustered together. By clustering similar tweets together, we can
generate a more concise and organized representation of the raw tweets, which will be very
useful for many Twitter-based applications (e.g., truth discovery, trend analysis, search ranking,
etc.)
In this assignment, you will learn how to cluster tweets by utilizing Jaccard Distance metric and
K-means clustering algorithm.
Objectives:
Compute the similarity between tweets using the Jaccard Distance metric.
Cluster tweets using the K-means clustering algorithm.
Introduction to Jaccard Distance:
The Jaccard distance, which measures dissimilarity between two sample sets (A and B). It is
defined as the difference of the sizes of the union and the intersection of two sets divided by the
size of the union of the sets.
For example, consider the following tweets:
Tweet A: the long march
Tweet B: ides of march
|A Γ B | = 1 and |A U B | = 5, therefore the distance is 1 β (1/5)
In this assignment, a tweet can be considered as an unordered set of words such as {a,b,c}. By
“unordered”, we mean that {a,b,c}={b,a,c}={a,c,b}=…
Jaccard Distance Dist(A, B) between tweet A and B has the following properties:
It is small if tweet A and B are similar.
It is large if they are not similar.
It is 0 if they are the same.
It is 1 if they are completely different (i.e., no overlapping w
Here is the reference for more details about Jaccard Distance:
https://en.wikipedia.org/wiki/Jaccard_index
Hint: Note that since the tweets do not have numerical coordinates as in Euclidean space, you
might want to think of a sensible way to compute the “centroid” of a tweet cluster. This could be
the tweet having minimum distance to all of the other tweets in a cluster.
Exercise:
Implement the tweet clustering function using the Jaccard Distance metric and K-means
clustering algorithm to cluster redundant/repeated tweets into the same cluster. Remember that
you have to write your own code for K-means clustering. It is acceptable to use external
libraries for data loading and pre-processing only.
Python is the preferred language for this
assignment. If you want to use any other language, clearly specify how to compile and run your
code in the README file.
Note that while the K-means algorithm is proved to converge, the algorithm is sensitive to the k
initial selected cluster centroids (i.e., seeds) and the clustering result is not necessarily optimal on
a random selection of seeds.
Steps of the exercise:
(1) We are going to use the following dataset for this exercise:
https://archive.ics.uci.edu/ml/datasets/Health+News+in+Twitter
Follow the βData Folderβ link and unzip the given file. You will file a folder containing tweets
that contain links to various news sources e.g. the file βusnewshealth.txtβ contains tweets that
refer to articles published in US News. You have to choose one such file and proceed.
(2) Perform the following pre-processing steps:
– Remove the tweet id and timestamp
– Remove any word that starts with the symbol @ e.g. @AnnaMedaris
– Remove any hashtag symbols e.g. convert #depression to depression
– Remove any URL
– Convert every word to lowercase
(3) Perform K-means clustering on the resulting tweets using at least 5 different values of K and
report your results in the format below
Note that the sum of squared error is defined as:
where K is the number of clusters and mi is the centroid of the ith cluster.
Γ₯Γ₯= Γ
=
K
i x C
i
i
SSE dist m x
1
2
( , )
Value of
K
SSE Size of each cluster
10 200 1: 10 tweets
2: 25 tweets
3: 20 tweets
β¦.
10: 100 tweets
β¦.
What to Turn In for Part II :
(1) Table of results as mentioned earlier.
(2) The source code including a README file indicating how to run your code.