Understanding the effect of num_words of Tokenizer in Keras

Question

Consider the following code:

from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words = 5000)
tokenizer.fit_on_texts(texts)
print('Found %d unique words.' % len(tokenizer.word_index))

When I run this, it prints:

Found 88582 unique words.

My question is, isn't num_words the parameter that controls the number of words in the mapping dictionary known as tokenizer.word_index? Then why it still holds 88582 words when I explicitly asked it to keep only 5000 words?

The library is (probably) printing total unique words found, but it would use only the top 5000 by frequency. The documentation is little unclear. — hssay, Aug 20 '18 at 06:46

score 2 · Answer 1 · answered Aug 12 '19 at 11:24

2

The problem is with the way things are documented. Check this link: https://stackoverflow.com/questions/46202519/keras-tokenizer-num-words-doesnt-seem-to-work

answered Aug 12 '19 at 11:24

Prince

21
2

Understanding the effect of num_words of Tokenizer in Keras

1 Answers1