How do I load FastText pretrained model with Gensim?

Question

I tried to load fastText pretrained model from here Fasttext model. I am using wiki.simple.en

from gensim.models.keyedvectors import KeyedVectors

word_vectors = KeyedVectors.load_word2vec_format('wiki.simple.bin', binary=True)

But, it shows the following errors

Traceback (most recent call last):
  File "nltk_check.py", line 28, in <module>
    word_vectors = KeyedVectors.load_word2vec_format('wiki.simple.bin', binary=True)
  File "P:\major_project\venv\lib\sitepackages\gensim\models\keyedvectors.py",line 206, in load_word2vec_format
     header = utils.to_unicode(fin.readline(), encoding=encoding)
  File "P:\major_project\venv\lib\site-packages\gensim\utils.py", line 235, in any2unicode
    return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte

Question 1 How do I load fasttext model with Gensim?

Question 2 Also, after loading the model, I want to find the similarity between two words

 model.find_similarity('teacher', 'teaches')
 # Something like this
 Output : 0.99

How do I do this?

Is gensim an absolute requirement? In the end, I just wound up going with the fasttext library directly, since I really just needed the words to get transformed — information_interchange, May 10 '20 at 16:24

Sabbiu Shah · Accepted Answer · 2019-02-13T16:33:21.297

27

Here's the link for the methods available for fasttext implementation in gensim fasttext.py

from gensim.models.wrappers import FastText

model = FastText.load_fasttext_format('wiki.simple')

print(model.most_similar('teacher'))
# Output = [('headteacher', 0.8075869083404541), ('schoolteacher', 0.7955552339553833), ('teachers', 0.733420729637146), ('teaches', 0.6839243173599243), ('meacher', 0.6825737357139587), ('teach', 0.6285147070884705), ('taught', 0.6244685649871826), ('teaching', 0.6199781894683838), ('schoolmaster', 0.6037642955780029), ('lessons', 0.5812176465988159)]

print(model.similarity('teacher', 'teaches'))
# Output = 0.683924396754

edited Feb 13 '19 at 16:33

answered Jun 30 '17 at 10:44

Sabbiu Shah

753
1
6
9

2

I get DeprecationWarning: Call to deprecated `load_fasttext_format` (use load_facebook_vectors. So I am using from gensim.models.fasttext import load_facebook_model – hru_d Oct 29 '19 at 22:35
@Sabbiu, could you please specify the model you used? I'm receiving ModuleNotFoundError: No module named 'gensim.models.wrappers' from Gensim version 4.1.1. – Oleg Melnikov Oct 27 '21 at 05:35

Akash Kandpal · Answer 2 · 2018-11-18T12:16:11.987

16

For .bin use: load_fasttext_format() (this typically contains full model with parameters, ngrams, etc).

For .vec use: load_word2vec_format (this contains ONLY word-vectors -> no ngrams + you can't update an model).

Note:: If you are facing issues with the memory or you are not able to load .bin models, then check the pyfasttext model for the same.

Credits : Ivan Menshikh (Gensim Maintainer)

edited Nov 18 '18 at 12:16

answered Jul 23 '18 at 08:22

Akash Kandpal

261
2
6

1

"For .bin.... you can continue training after loading."
This is not true, as documentation states: "Due to limitations in the FastText API, you cannot continue training with a model loaded this way."

https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastText.load_fasttext_format
– Andriy Drozdyuk Nov 18 '18 at 04:57
1

This is no longer true: DeprecationWarning: Deprecated. Use gensim.models.KeyedVectors.load_word2vec_format instead. – mickythump Aug 17 '19 at 16:02
@mickythump Can you please suggest some edits here? – Akash Kandpal Sep 28 '20 at 12:57

jcaliz · Answer 3 · 2021-04-30T19:47:44.947

10

Update 04/2020

load_fasttext_format() is now deprecated, the updated way is to load the models is with gensim.models.fasttext.load_facebook_model() or gensim.models.fasttext.load_facebook_vectors() for binaries and vecs respectively.

For example:

from gensim.models.fasttext import load_facebook_model
wv = load_facebook_model('<path_to.bin.gz>')

edited Apr 30 '21 at 19:47

answered Apr 23 '20 at 02:11

jcaliz

201
2
5

It is the .bin file, not the .gz file – information_interchange May 10 '20 at 16:07
1

models.fasttext.load_facebook_vectors(path) can read a .bin.gz file at present – user108569 Apr 30 '21 at 18:42
Good to know, thanks for the feedback. – jcaliz Apr 30 '21 at 19:47

score 5 · Answer 4 · answered May 10 '20 at 18:49

5

I really wanted to use gensim, but ultimately found that using the native fasttext library worked out better for me. The following code you can copy/paste into google colab and will work, out of the box:

pip install fasttext

import fasttext.util
fasttext.util.download_model('en', if_exists='ignore')  # English
ft = fasttext.load_model('cc.en.300.bin')

Works for out of vocab words too:

ft.get_word_vector("another")
ft.get_word_vector("dkjeri37id20hnd")

answered May 10 '20 at 18:49

information_interchange

151
1
4

but how to load this BIN file for the embedding layer of keras NN? – MrRaghav Aug 06 '20 at 17:40
I have found a way! I used .vec file and loaded it. – MrRaghav Aug 07 '20 at 11:12

score 2 · Answer 5 · answered Apr 05 '18 at 20:12

The FastText binary format (which is what it looks like you're trying to load) isn't compatible with Gensim's word2vec format; the former contains additional information about subword units, which word2vec doesn't make use of.

There's some discussion of the issue (and a workaround), on the FastText Github page. In short, you'll have to load the text format (available at https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md).

Once you've loaded the text format, you can use Gensim to save it in binary format, which will dramatically reduce the model size, and speed up future loading.

https://github.com/facebookresearch/fastText/issues/171#issuecomment-294295302

score 0 · Answer 6 · edited Dec 19 '20 at 22:07

Let’s use a pre-trained model rather than training our own word embeddings. For this, you can download pre-trained vectors from here. Each line of this file contains a word and it’s a corresponding n-dimensional vector. We will create a dictionary using this file for mapping each word to its vector representation.

from gensim.models import FastText
def load_fasttext():
        print('loading word embeddings...')
        embeddings_index = {}
        f = open('../input/fasttext/wiki.simple.vec',encoding='utf-8')
        for line in tqdm(f):
        values = line.strip().rsplit(' ')
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
        f.close()
        print('found %s word vectors' % len(embeddings_index))
    return embeddings_index


embeddings_index=load_fastext()

Let’s check the embedding for a word,

embeddings_index['london'].shape

Here’s a bit more info, from a blog post I wrote for my company, on FastText and other document classification methods (for smaller datasets)

How do I load FastText pretrained model with Gensim?

6 Answers6

Linked