How to learn spam email detection?

Question

I want to learn how a spam email detector is done. I'm not trying to build a commercial product, it'll be a serious learning exercise for me. Therefore, I'm looking for resources, such as existing projects, source code, articles, papers etc that I can follow. I want to learn by examples, I don't think I am good enough to do it from scratch. Ideally, I'd like to get my hand dirty in Bayesian.

Is there anything like that that? Programming language isn't a problem for me.

score 6 · Accepted Answer · answered Jun 01 '15 at 15:14

First of all check this carefully. You'll find a simple dataset and some papers to review.

BUT as you want to start a simple learning project I recommend to not going through papers (which are obviously not basic) but try to build your own bayesian learner which is not so difficult.

I personally suggest Andrew Moore's lecture slides on Probabilistic Graphical Models which are freely available and you can learn from them simply and step by step.

If you need more detailed help just comment on this answer and I'll be glad to help :)

Enjoy baysian learning!

score 6 · Answer 2 · answered Jun 01 '15 at 16:03

In Andrew Ng's Machine Learning Course on Coursera (in someways the flagship course for Coursera) the programmers exercise for Support Vector Machines was an example doing a spam classifier. The lectures are great, famous even, and well worth watching.

There is also this posted course from him:

http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex6/ex6.html

score 2 · Answer 3 · answered Jun 01 '15 at 16:32

There is a basic introduction to the Bayesian method for spam detection in the book "Doing Data Science - Straight Talk from the Frontline" by Cathy O'Neil, Rachel Schutt.

The chapter is good, because it explains why other common data science models don't work for spam classifiers. The whole book uses R throughout, so only pick it up if you are interested in working with R.

It uses the Enron email set as training data, since it has emails divided into spam/not spam already.

Jordan A · Answer 4 · 2015-06-01T19:21:15.973

This is actually a really in-depth problem that many people and companies have worked on. Here are some basics:

First, we need to represent the data well. This involves somehow representing each document as a vector in $d$-dimensional space. Ideally, in this space, we want samples that have the same label to be nearby in euclidean distance, and samples that are different labels to be far away in euclidean distance. This step can be really hard, but one tried-and-true representation is called Term Frequency-Inverse Document Frequency (tf-idf). Here, each dimension in the space represents a particular word, and the value in that dimension for a particular sample basically represents the normalized number of times that word occurs in the document. You could read more about that here. There's a pretty good scikit-learn implementation of this representation if you want to try it out.

Now the data is in a useful space, but a really high-dimensional space. I'd recommend reducing this dimensionality somehow, but that's a whole subject for another thread.

Finally you could train some algorithm to classify the samples (which is what the other answers are about). There are lots of great choices - neural networks, adaboost, SVMs, Naive Bayes, and graphical classification models will all give you good results. Many of these also have implementations in scikit-learn.

But the best algorithms leverage the fact that this problem is actually a transfer learning one. That is, the distributions from which the training and testing data come might not be exactly the same - because the sorts of things one person thinks are spam might be different than the sorts of things another person thinks are spam.

Can you elaborate on your last paragraph ("transfer learning")? Could you provide any links or names? — Valentas, Jun 01 '15 at 20:49
In conventional machine learning, we have some data that comes from a particular probability distribution. Then we learn some kind of model on that data, hoping that the model will generalize to examples not seen during training. This will only work if these unseen samples come from the same probability distribution, so we assume this is the case. In transfer learning, we don't make that assumption. Here's a survey paper on the field. — Jordan A, Jun 01 '15 at 23:24
And how do you come up with d variable? Is it a fixed number that is chosen by a scientist? — Martin Vseticka, Jun 29 '15 at 15:29
A common approach is to have $d$ words, and each of the $d$ elements in the vector represent the frequency with which that word occurs in the text. There are only so many unique words used in all of the samples you're considering, so there's a definite upper-bound on $d$. Researchers usually also remove certain kinds of words that they don't think will be useful for classification, like "the," "and," "it," etc. — Jordan A, Jun 29 '15 at 16:28

How to learn spam email detection?

4 Answers4