How can I classify text considering word order, instead of just using a bag-of-words approach?

Question

I've made a Naive Bayes classifier that uses the bag-of-words technique to classify spam posts on a message board. It works, but I think I could get much better results if my models considered the word orderings and phrases. (ex: 'girls' and 'live' may not trigger a high spam score, even though 'live girls' is most likely junk). How can I build a model that takes word ordering into account?

I've considered storing n-grams (check-out-these, out-these-live, these-live-girls), but this seems to radically increase the size of the dictionary I keep score in and causes inconsistency as phrases with very similar wording but different order will slip through.

I'm not tied to Bayesian classification, but I'd like something that someone without a strong background in statistics could grok and implement.

n-gram models are often the way to go when considering word order. see http://en.wikipedia.org/wiki/N-gram — brentlance, Oct 03 '14 at 19:25

score 7 · Answer 1 · answered Oct 03 '14 at 18:05

There is a very simple hack to incorporate word order in an existing bag-of-words model implementation. Treat some of the phrases, such as the frequently occurring bi-grams (e.g. New York) as a unit, i.e. a single word instead of treating them as separate entities. This will ensure that "New York" is different from "York New". You could also define higher order word shingles such as for n=3,4 etc.

You could use the Lucene ShingleFilter to decompose your document text into shingles as a pre-processing step and then apply the classifier on this decomposed text.

import java.io.*;
import org.apache.lucene.analysis.core.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.shingle.ShingleFilter;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.util.*;
import org.apache.lucene.analysis.util.*;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.charfilter.*;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;

class TestAnalyzer extends Analyzer {

    TestAnalyzer() {
        super();
    }

    protected TokenStreamComponents createComponents( String fieldName, Reader reader ) {
        String token;
        TokenStream result = null;

        Tokenizer source = new WhitespaceTokenizer( Version.LUCENE_CURRENT, reader );
        result = new ShingleFilter(source, 2, 2);

        return new TokenStreamComponents( source, result );

    }
}

public class LuceneTest {

    public static void main(String[] args) throws Exception {

        TestAnalyzer analyzer = new TestAnalyzer();

        try {
            TokenStream stream = analyzer.tokenStream("field", new StringReader("This is a sample sentence."));
            CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);

            stream.reset();

            // print all tokens until stream is exhausted
            while (stream.incrementToken()) {
                System.out.println(termAtt.toString());
            }

            stream.end();
            stream.close();
         }
         catch (Exception ex) {
             ex.printStackTrace();
         }

    }
}

score 5 · Answer 2 · edited Apr 13 '17 at 12:44

5

Try out some generative models like HMM. Just check the following link: https://stats.stackexchange.com/questions/91290/how-do-i-train-hmms-for-classification

edited Apr 13 '17 at 12:44

Community

1

answered Oct 03 '14 at 16:07

Yavar

223
2
8

score 3 · Answer 3 · answered Oct 04 '14 at 19:37

There are a bunch of techniques. You have already mentioned n-gram, then there is word combination and others. But the main problem (at least from your point of view) is that as the feature becomes more complex (like n-gram) the feature count increases dramatically. This is manageable. Basically before classification you must score your features and then threshold at a certain score. this way the features (or in your case n-grams) that are scored below a certain level are omitted and the feature count becomes manageable. as for the scoring. There are numerous ways (which to select is dependent on your application) to score the features. You can begin with "BiNormal separation", "chi square", "Information Gain" and etc. I don't know if this answer helps you but if you are interested i can elaborate...

I forgot, in word combination you put a window of size m on the text and extract each combination of n words. of course n

How can I classify text considering word order, instead of just using a bag-of-words approach?

3 Answers3