gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

models._fasttext_bin – Facebook I/O

models._fasttext_bin – Facebook I/O

Load models from the native binary format released by Facebook.

The main entry point is the load() function. It returns a Model namedtuple containing everything loaded from the binary.

Examples

Load a model from a binary file:

>>> from gensim.test.utils import datapath
>>> from gensim.models.fasttext_bin import load
>>> with open(datapath('crime-and-punishment.bin'), 'rb') as fin:
...     model = load(fin)
>>> model.nwords
291
>>> model.vectors_ngrams.shape
(391, 5)
>>> sorted(model.raw_vocab, key=lambda w: len(w), reverse=True)[:5]
['останавливаться', 'изворачиваться,', 'раздражительном', 'exceptionally', 'проскользнуть']

See also

FB Implementation.

class gensim.models._fasttext_bin.Model(bucket, dim, epoch, hidden_output, loss, maxn, min_count, minn, model, neg, nwords, raw_vocab, t, vectors_ngrams, vocab_size, ws)

Bases: tuple

Holds data loaded from the Facebook binary.

Parameters:
  • dim (int) – The dimensionality of the vectors.
  • ws (int) – The window size.
  • epoch (int) – The number of training epochs.
  • neg (int) – If non-zero, indicates that the model uses negative sampling.
  • loss (int) – If equal to 1, indicates that the model uses hierarchical sampling.
  • model (int) – If equal to 2, indicates that the model uses skip-grams.
  • bucket (int) – The number of buckets.
  • min_count (int) – The threshold below which the model ignores terms.
  • t (float) – The sample threshold.
  • minn (int) – The minimum ngram length.
  • maxn (int) – The maximum ngram length.
  • raw_vocab (collections.OrderedDict) – A map from words (str) to their frequency (int). The order in the dict corresponds to the order of the words in the Facebook binary.
  • nwords (int) – The number of words.
  • vocab_size (int) – The size of the vocabulary.
  • vectors_ngrams (numpy.array) – This is a matrix that contains vectors learned by the model. Each row corresponds to a vector. The number of vectors is equal to the number of words plus the number of buckets. The number of columns is equal to the vector dimensionality.
  • hidden_output (numpy.array) – This is a matrix that contains the shallow neural network output. This array has the same dimensions as vectors_ngrams. May be None - in that case, it is impossible to continue training the model.
__getitem__

x.__getitem__(y) <==> x[y]

bucket

Alias for field number 0

count(value) → integer -- return number of occurrences of value
dim

Alias for field number 1

epoch

Alias for field number 2

hidden_output

Alias for field number 3

index(value[, start[, stop]]) → integer -- return first index of value.

Raises ValueError if the value is not present.

loss

Alias for field number 4

maxn

Alias for field number 5

min_count

Alias for field number 6

minn

Alias for field number 7

model

Alias for field number 8

neg

Alias for field number 9

nwords

Alias for field number 10

raw_vocab

Alias for field number 11

t

Alias for field number 12

vectors_ngrams

Alias for field number 13

vocab_size

Alias for field number 14

ws

Alias for field number 15

gensim.models._fasttext_bin.load(fin, encoding='utf-8', full_model=True)

Load a model from a binary stream.

Parameters:
  • fin (file) – The readable binary stream.
  • encoding (str, optional) – The encoding to use for decoding text
  • full_model (boolean, optional) – If False, skips loading the hidden output matrix. This saves a fair bit of CPU time and RAM, but prevents training continuation.
Returns:

The loaded model.

Return type:

Model