What can be done so that 'teacher' and 'teaches' are treated similar?

Question

Who teaches English?

Now, after tokenizing, stemming.. it gives me

Who, teach, English

In my list of word, I have a word called

teacher

Lemmatizing, stemming teacher gives teacher and lemmatizing, stemming teaches gives teach

Even, calculating edit_distance will not solve this.. As, edit_distance comes out to be 2.

Now, What do I do to have teacher and teach treated as similar? Similarly, there may be other cases with extra 's' at the end. Is there some stemmer that solves this problem? Is there any solution?

Other similar example can be: instructor and instructs

Brian Spiering · Accepted Answer · 2021-05-05T20:06:32.493

2

Use an aggressive stemmer. The Lancaster Stemmer is one the most aggressive and popular stemmers around.

Here is the Python code:

from nltk.stem.lancaster import LancasterStemmer
lancaster_stemmer = LancasterStemmer()
assert 'teach' == lancaster_stemmer.stem('teacher') == lancaster_stemmer.stem('teaches')

edited May 05 '21 at 20:06

answered Jul 03 '17 at 23:21

Brian Spiering

21,136
2
26
109

score 1 · Answer 2 · answered Jun 28 '17 at 17:26

1

Check out Fasttext. Fasttext works similarly to word2vec in that you can create word embeddings, however, it actually analyzes character n-grams, to force the syntactic similarity to what you're thinking about.

answered Jun 28 '17 at 17:26

j.a.gartner

1,215
1
9
18

I tried using fasttext. But, found some problem. @j.a.gartner, Could you please see this query – Sabbiu Shah Jun 30 '17 at 02:17
It is now solved! – Sabbiu Shah Jun 30 '17 at 11:16
Sorry, didn't see until you had already solved it. – j.a.gartner Jun 30 '17 at 15:06

What can be done so that 'teacher' and 'teaches' are treated similar?

2 Answers2