This is actually a really in-depth problem that many people and companies have worked on. Here are some basics:
First, we need to represent the data well. This involves somehow representing each document as a vector in $d$-dimensional space. Ideally, in this space, we want samples that have the same label to be nearby in euclidean distance, and samples that are different labels to be far away in euclidean distance. This step can be really hard, but one tried-and-true representation is called Term Frequency-Inverse Document Frequency (tf-idf). Here, each dimension in the space represents a particular word, and the value in that dimension for a particular sample basically represents the normalized number of times that word occurs in the document. You could read more about that here. There's a pretty good scikit-learn implementation of this representation if you want to try it out.
Now the data is in a useful space, but a really high-dimensional space. I'd recommend reducing this dimensionality somehow, but that's a whole subject for another thread.
Finally you could train some algorithm to classify the samples (which is what the other answers are about). There are lots of great choices - neural networks, adaboost, SVMs, Naive Bayes, and graphical classification models will all give you good results. Many of these also have implementations in scikit-learn.
But the best algorithms leverage the fact that this problem is actually a transfer learning one. That is, the distributions from which the training and testing data come might not be exactly the same - because the sorts of things one person thinks are spam might be different than the sorts of things another person thinks are spam.
d
variable? Is it a fixed number that is chosen by a scientist? – Martin Vseticka Jun 29 '15 at 15:29