2
Who teaches English?

Now, after tokenizing, stemming.. it gives me

Who, teach, English

In my list of word, I have a word called

teacher

Lemmatizing, stemming teacher gives teacher and lemmatizing, stemming teaches gives teach

Even, calculating edit_distance will not solve this.. As, edit_distance comes out to be 2.

Now, What do I do to have teacher and teach treated as similar? Similarly, there may be other cases with extra 's' at the end. Is there some stemmer that solves this problem? Is there any solution?

Other similar example can be: instructor and instructs

Sabbiu Shah
  • 753
  • 1
  • 6
  • 9

2 Answers2

2

Use an aggressive stemmer. The Lancaster Stemmer is one the most aggressive and popular stemmers around.

Here is the Python code:

from nltk.stem.lancaster import LancasterStemmer

lancaster_stemmer = LancasterStemmer() assert 'teach' == lancaster_stemmer.stem('teacher') == lancaster_stemmer.stem('teaches')

Brian Spiering
  • 21,136
  • 2
  • 26
  • 109
1

Check out Fasttext. Fasttext works similarly to word2vec in that you can create word embeddings, however, it actually analyzes character n-grams, to force the syntactic similarity to what you're thinking about.

j.a.gartner
  • 1,215
  • 1
  • 9
  • 18