Description
THE PROBLEM Spam is e-mail that is both unsolicited by the recipient and sent in substantively identical form to many recipients. Spam is dealt with on the receiving end by automated spam filter programs that attempt to classify each message as either “spam” (undesired mass email) or “ham” (mail the user would like to receive). A Naïve Bayes Spam Filter uses a Naive Bayes classifier to identify spam email. Many modern mail programs implement Bayesian spam filtering. Examples include SpamAssassin and SpamBayes. In this lab, you will create a spam filter based on Naïve Bayes classification. THE MATH OF A NAÏVE BAYES CLASSIFIER A Naïve Bayesian classifier takes objects described by a feature vector and classifies each object based on the features. Let V be the set of possible classifications. In this case, that will be “spam” or “ham.” { } ,V spam ham= (1)
Let a1 through an be Boolean values. We will represent each mail document by these n Boolean values. Here, ai will be “true” if the ith word in our dictionary is present in the document, and ai will otherwise be false. ai ∈ Dictionary (2) If we have a dictionary consisting of the following set of words: {“feature”, “ham”, “impossible”,”zebra”}, then the first paragraph of this section would be characterized by a vector of four Boolean values (one for each word). paragraph1:a1 =T∧a2 =T∧a3 =F∧a4 =F We write the probability of paragraph1 as: P(a1 =T∧a2 =T∧a3 =F∧a4 =F) When we don’t know the values for the attrubutes yet, we write… P(a1∧a2∧a3∧a4) …but we don’t mean the probability these variables are TRUE, we mean the probability these variables take the observed values in the text we’re encoding. A Bayesian classifier would find the classification for the spam that produces the maximal probability for the following equation. This is the Maximum A Posteriori (MAP) classification. 1 2a rgmax ( | … ) j MAP j n v V v P v a a a ∈ = ∧ ∧ (3)
EECS 349 Machine Learning Homework 5
Now…to estimate values for our probabilities, it is going to be more convenient to flip these probabilities around, so that we can talk about the probability of each attribute, given that something is spam. We can do this by applying Bayes rule.
1 2
1 2 ( … | ) ( )
argmax ( … )j n j j
MAP
v V n P a a a v P v
v
P a a a∈ ∧ ∧
=
∧ ∧
(4)
Since we are comparing a single text over the set of labels V in this equation, we can remove the divisor (which represents the text we’ve encoded), since it will be the same, no matter what label vj is set to. 1 2a rgmax ( … | ) ( ) j MAP n j j v V v P a a a v P v ∈ = ∧ ∧ (5)
A Naïve Bayes classifier is “naïve” because it assumes the likelihood of each feature being present is completely independent of whether any other feature is present. 1 2 argmax ( | ) ( | )…… ( | ) ( )
argmax ( ) ( | ) j j NB j j n j j v V j i j v V i v P a v P a v P a v P v P v P a v ∈ ∈ = = ∏
(6)
Now, given this mathematical framework, building a spam classifier is fairly straightforward. Each attribute ai is a word drawn from a dictionary. To classify a document, generate a feature vector for that document that specifies which words from the dictionary are present and which are absent. Note about this encoding: repeating a word 100 times in a document won’t cause the answer to “Is word X present in document Y ?” to be any more true than if the word is there 1 time. This is true BOTH when you build your probabilities in the dictionary AND when you encode an email to see if it spam.
Then, for each attribute (such as the word “halo”), look up the probability of that attribute being present (or absent), given a particular categorization (“spam” or “ham”) and multiply them together. Don’t forget, of course, to determine the base probability of a particular category (how likely is it for any given document to be spam?). Voila, the categorization that produces the highest probability is the category selected for the document.
Oh…one other thing. Your dictionary will have thousands of words in it. If you multiply thousands of words together, what happens to the probabilities? Can you represent them? You might consider using equation (7), instead of (6), when implementing your code.
vNB =argmax vj∈V
log(P(vj))+ log P(ai |vj ( ) i ∑
# $ %
& ‘ ( (7)
A TRAINING CORPUS To train your system, we suggest you use the spam corpus used by the people who built the Spam Assassin program. To get this spam corpus, go to the following link: https://spamassassin.apache.org/publiccorpus/
You should pick a spam file and a ham file for you to test and train your system. Two that seem to work fairly well are 20030228_spam.tar.bz2, 20021010_easy_ham.tar.bz2, but you don’t HAVE to use those files in your experiments. You can unzip these files using the standard unix utility tar (with the options xjf) or from windows using WinRAR (which can be found here: https://www.rarlab.com/download.htm).
EECS 349 Machine Learning Homework 5
STARTER CODE We have provided you a starter framework for your work. In a file called spamfilter.py, is a a def makedictionary, a def spamsort and a main that you can call from the command line. These functions are the skeleton for what you need to write. They don’t do the full job, though. The first function is the one to make a dictionary. def makedictionary( spam_directory, ham_directory, dictionary_filename): The parameter spam_directory is a character string containing the path to a directory that should contain only ASCII text files that are spam. The parameter ham_directory is a character string containing the path to a directory containing only ham ASCII text files. The start code will read in each ASCII file, separate it into blank-delineated strings and add each unique string (a word) to a python dictionary. Associated with each word in this dictionary are two probabilities: • The probability of observing that word at least once in a spam document. • The probability of observing the word at least once in a ham document. makedictionary returns to the calling function two things: the dictionary (words) and the prior probability of spam, based on the training data (spam_prior_probability). It also saves the dictionary to an ASCII file named dictionary_filename, placed in the current directory. NOTE THAT THE STARTER CODE DOESN’T ACTUALLY CALCULATE THE PROBABILITIES AND JUST INITIALIZES THEM ALL TO 0. spamsort is a Python function that classifies each document in a directory containing ASCII text documents as either spam (undesired junk mail) or ham (desired document). def spamsort( mail_directory, spam_directory, ham_directory, dictionary, spam_prior_probability): The parameter mail_directory specifies the path to a directory where every file is an ASCII text file to be classified as spam or not. The parameters spam_directory,ham_directory specify the destination directories for the mail. The parameter dictionary specifies the dictionary constructed by makedictionary. The parameter spam_prior_probability specifies the base probability that any given document is spam. The function classifies each document in the mail directory as either spam or ham. Then, it copies the file to spam_directory if it is spam. It copies the file to the ham_directory if it is ham (not spam). When done, the number of mails in the mail_directory should be exactly the same as the sum of the number of mails in spam_directory and ham_directory, since all the files will have been copied to either spam_directory or ham_directory. NOTE: THE STARTER CODE CLASSIFIES THINGS BY ASSIGNING THE MOST PROBABLE CLASS, ACCORDING TO PRIOR PROBABILITY. YOUR WILL NEED TO MAKE SPAMSORT INTO A NAÏVE BAYES CLASSIFIER. PROBLEMS TO SOLVE 1) (2 points) The starter code, as given to you, assigns a probability of 0 to every word. You need to replace the line that assigns a 0 to every probability with a reasonable estimate using the approach described in “The Math of a Naïve Bayes Classifier”. For each word in the dictionary, your system should determine the probability of observing that word in a spam document and the probability of observing that word in a ham document. Note: the probability of observing a word in a document of a particular class is calculated by counting how many documents in that class contain one or more instances of that word. If I have 100 documents, where 99 of them have no instances of “smurf” and 1 document has 100 instances of “smurf,” the probability of smurf is 0.01, NOT 1.0.
EECS 349 Machine Learning Homework 5
2) (3 points) The starter code simply uses prior probability to classify files as spam or ham. You need to replace the code that always decides to use the most likely class with code that uses the probabilities in the dictionary to implement a Naïve Bayes classifier as described in “The Math of a Naïve Bayes Classifier”.
3) (2 points) Would equation (6) and (7) always return the same results, if there were no issues with underflow or precision? Give a proof to support your answer. What happens if there is underflow? Do they still give the same result? Why or why not?
4) (3 points) Design and run an experiment to compare your spam filter to one using prior probability.
A) Datasets: In your experiment, what data set did you use to build your dictionary? What data set did you use to test the resulting spam filter? Were they the same? Why did you choose the data sets you chose? How many files are in your data? What is the percentage of spam in the data? Is your data representative of the kinds of spam you might see in the real world? Why or why not? B) Your Testing Methodology: Explain how you tested the spam filter to see if it worked better than just using the prior probability of spam learned from your training set. Did you do cross validation? If so, how many folds? If not, why not? What is your error measure? C) Experimental Results: Does your spam filter do better than just using the prior probability of spam in your data set? Give evidence to back up your assertion. This may include well-labeled graphs, tables and statistical tests. The idea is to convince a reader that you really have established that your system works (or doesn’t).