36

I tried to load fastText pretrained model from here Fasttext model. I am using wiki.simple.en

from gensim.models.keyedvectors import KeyedVectors

word_vectors = KeyedVectors.load_word2vec_format('wiki.simple.bin', binary=True)

But, it shows the following errors

Traceback (most recent call last):
  File "nltk_check.py", line 28, in <module>
    word_vectors = KeyedVectors.load_word2vec_format('wiki.simple.bin', binary=True)
  File "P:\major_project\venv\lib\sitepackages\gensim\models\keyedvectors.py",line 206, in load_word2vec_format
     header = utils.to_unicode(fin.readline(), encoding=encoding)
  File "P:\major_project\venv\lib\site-packages\gensim\utils.py", line 235, in any2unicode
    return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte

Question 1 How do I load fasttext model with Gensim?

Question 2 Also, after loading the model, I want to find the similarity between two words

 model.find_similarity('teacher', 'teaches')
 # Something like this
 Output : 0.99

How do I do this?

leakey
  • 13
  • 6
Sabbiu Shah
  • 753
  • 1
  • 6
  • 9
  • Is gensim an absolute requirement? In the end, I just wound up going with the fasttext library directly, since I really just needed the words to get transformed – information_interchange May 10 '20 at 16:24

6 Answers6

27

Here's the link for the methods available for fasttext implementation in gensim fasttext.py

from gensim.models.wrappers import FastText

model = FastText.load_fasttext_format('wiki.simple')

print(model.most_similar('teacher'))
# Output = [('headteacher', 0.8075869083404541), ('schoolteacher', 0.7955552339553833), ('teachers', 0.733420729637146), ('teaches', 0.6839243173599243), ('meacher', 0.6825737357139587), ('teach', 0.6285147070884705), ('taught', 0.6244685649871826), ('teaching', 0.6199781894683838), ('schoolmaster', 0.6037642955780029), ('lessons', 0.5812176465988159)]

print(model.similarity('teacher', 'teaches'))
# Output = 0.683924396754
Sabbiu Shah
  • 753
  • 1
  • 6
  • 9
  • 2
    I get DeprecationWarning: Call to deprecated `load_fasttext_format` (use load_facebook_vectors. So I am using from gensim.models.fasttext import load_facebook_model – hru_d Oct 29 '19 at 22:35
  • @Sabbiu, could you please specify the model you used? I'm receiving ModuleNotFoundError: No module named 'gensim.models.wrappers' from Gensim version 4.1.1. – Oleg Melnikov Oct 27 '21 at 05:35
16

For .bin use: load_fasttext_format() (this typically contains full model with parameters, ngrams, etc).

For .vec use: load_word2vec_format (this contains ONLY word-vectors -> no ngrams + you can't update an model).

Note:: If you are facing issues with the memory or you are not able to load .bin models, then check the pyfasttext model for the same.

Credits : Ivan Menshikh (Gensim Maintainer)

Akash Kandpal
  • 261
  • 2
  • 6
  • 1
    "For .bin.... you can continue training after loading."

    This is not true, as documentation states: "Due to limitations in the FastText API, you cannot continue training with a model loaded this way."

    https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastText.load_fasttext_format

    – Andriy Drozdyuk Nov 18 '18 at 04:57
  • 1
    This is no longer true: DeprecationWarning: Deprecated. Use gensim.models.KeyedVectors.load_word2vec_format instead. – mickythump Aug 17 '19 at 16:02
  • @mickythump Can you please suggest some edits here? – Akash Kandpal Sep 28 '20 at 12:57
10

Update 04/2020

load_fasttext_format() is now deprecated, the updated way is to load the models is with gensim.models.fasttext.load_facebook_model() or gensim.models.fasttext.load_facebook_vectors() for binaries and vecs respectively.

For example:

from gensim.models.fasttext import load_facebook_model

wv = load_facebook_model('<path_to.bin.gz>')

jcaliz
  • 201
  • 2
  • 5
5

I really wanted to use gensim, but ultimately found that using the native fasttext library worked out better for me. The following code you can copy/paste into google colab and will work, out of the box:

pip install fasttext

import fasttext.util
fasttext.util.download_model('en', if_exists='ignore')  # English
ft = fasttext.load_model('cc.en.300.bin')

Works for out of vocab words too:

ft.get_word_vector("another")
ft.get_word_vector("dkjeri37id20hnd")
2

The FastText binary format (which is what it looks like you're trying to load) isn't compatible with Gensim's word2vec format; the former contains additional information about subword units, which word2vec doesn't make use of.

There's some discussion of the issue (and a workaround), on the FastText Github page. In short, you'll have to load the text format (available at https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md).

Once you've loaded the text format, you can use Gensim to save it in binary format, which will dramatically reduce the model size, and speed up future loading.

https://github.com/facebookresearch/fastText/issues/171#issuecomment-294295302

Fred
  • 131
  • 3
0

Let’s use a pre-trained model rather than training our own word embeddings. For this, you can download pre-trained vectors from here. Each line of this file contains a word and it’s a corresponding n-dimensional vector. We will create a dictionary using this file for mapping each word to its vector representation.

from gensim.models import FastText
def load_fasttext():
        print('loading word embeddings...')
        embeddings_index = {}
        f = open('../input/fasttext/wiki.simple.vec',encoding='utf-8')
        for line in tqdm(f):
        values = line.strip().rsplit(' ')
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
        f.close()
        print('found %s word vectors' % len(embeddings_index))
    return embeddings_index

embeddings_index=load_fastext()

enter image description here

Let’s check the embedding for a word,

enter image description here

embeddings_index['london'].shape

Here’s a bit more info, from a blog post I wrote for my company, on FastText and other document classification methods (for smaller datasets)

Stephen Rauch
  • 1,783
  • 11
  • 22
  • 34