gensim logo

gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine:

Corporate trainings in Python Data Science and Deep Learning

models._fasttext_bin – Facebook I/O

models._fasttext_bin – Facebook I/O

Load models from the native binary format released by Facebook.

The main entry point is the load() function. It returns a Model namedtuple containing everything loaded from the binary.


Load a model from a binary file:

>>> from gensim.test.utils import datapath
>>> from gensim.models.fasttext_bin import load
>>> with open(datapath('crime-and-punishment.bin'), 'rb') as fin:
...     model = load(fin)
>>> model.nwords
>>> model.vectors_ngrams.shape
(391, 5)
>>> sorted(model.raw_vocab, key=lambda w: len(w), reverse=True)[:5]
['останавливаться', 'изворачиваться,', 'раздражительном', 'exceptionally', 'проскользнуть']

See also

FB Implementation.

class gensim.models._fasttext_bin.Model(bucket, dim, epoch, hidden_output, loss, maxn, min_count, minn, model, neg, nwords, raw_vocab, t, vectors_ngrams, vocab_size, ws)

Bases: tuple

Holds data loaded from the Facebook binary.

  • dim (int) – The dimensionality of the vectors.
  • ws (int) – The window size.
  • epoch (int) – The number of training epochs.
  • neg (int) – If non-zero, indicates that the model uses negative sampling.
  • loss (int) – If equal to 1, indicates that the model uses hierarchical sampling.
  • model (int) – If equal to 2, indicates that the model uses skip-grams.
  • bucket (int) – The number of buckets.
  • min_count (int) – The threshold below which the model ignores terms.
  • t (float) – The sample threshold.
  • minn (int) – The minimum ngram length.
  • maxn (int) – The maximum ngram length.
  • raw_vocab (collections.OrderedDict) – A map from words (str) to their frequency (int). The order in the dict corresponds to the order of the words in the Facebook binary.
  • nwords (int) – The number of words.
  • vocab_size (int) – The size of the vocabulary.
  • vectors_ngrams (numpy.array) – This is a matrix that contains vectors learned by the model. Each row corresponds to a vector. The number of vectors is equal to the number of words plus the number of buckets. The number of columns is equal to the vector dimensionality.
  • hidden_output (numpy.array) – This is a matrix that contains the shallow neural network output. This array has the same dimensions as vectors_ngrams. May be None - in that case, it is impossible to continue training the model.

x.__getitem__(y) <==> x[y]


Alias for field number 0

count(value) → integer -- return number of occurrences of value

Alias for field number 1


Alias for field number 2


Alias for field number 3

index(value[, start[, stop]]) → integer -- return first index of value.

Raises ValueError if the value is not present.


Alias for field number 4


Alias for field number 5


Alias for field number 6


Alias for field number 7


Alias for field number 8


Alias for field number 9


Alias for field number 10


Alias for field number 11


Alias for field number 12


Alias for field number 13


Alias for field number 14


Alias for field number 15

gensim.models._fasttext_bin.load(fin, encoding='utf-8', full_model=True)

Load a model from a binary stream.

  • fin (file) – The readable binary stream.
  • encoding (str, optional) – The encoding to use for decoding text
  • full_model (boolean, optional) – If False, skips loading the hidden output matrix. This saves a fair bit of CPU time and RAM, but prevents training continuation.

The loaded model.

Return type: