models._fasttext_bin
– Facebook’s fastText I/O¶
Load models from the native binary format released by Facebook.
The main entry point is the load()
function.
It returns a Model
namedtuple containing everything loaded from the binary.
Examples
Load a model from a binary file:
>>> from gensim.test.utils import datapath
>>> from gensim.models.fasttext_bin import load
>>> with open(datapath('crime-and-punishment.bin'), 'rb') as fin:
... model = load(fin)
>>> model.nwords
291
>>> model.vectors_ngrams.shape
(391, 5)
>>> sorted(model.raw_vocab, key=lambda w: len(w), reverse=True)[:5]
['останавливаться', 'изворачиваться,', 'раздражительном', 'exceptionally', 'проскользнуть']
See also
- class gensim.models._fasttext_bin.Model(bucket, dim, epoch, hidden_output, loss, lr_update_rate, maxn, min_count, minn, model, neg, ntokens, nwords, raw_vocab, t, vectors_ngrams, vocab_size, word_ngrams, ws)¶
Bases:
tuple
Holds data loaded from the Facebook binary.
- Parameters
dim (int) – The dimensionality of the vectors.
ws (int) – The window size.
epoch (int) – The number of training epochs.
neg (int) – If non-zero, indicates that the model uses negative sampling.
loss (int) – If equal to 1, indicates that the model uses hierarchical sampling.
model (int) – If equal to 2, indicates that the model uses skip-grams.
bucket (int) – The number of buckets.
min_count (int) – The threshold below which the model ignores terms.
t (float) – The sample threshold.
minn (int) – The minimum ngram length.
maxn (int) – The maximum ngram length.
raw_vocab (collections.OrderedDict) – A map from words (str) to their frequency (int). The order in the dict corresponds to the order of the words in the Facebook binary.
nwords (int) – The number of words.
vocab_size (int) – The size of the vocabulary.
vectors_ngrams (numpy.array) – This is a matrix that contains vectors learned by the model. Each row corresponds to a vector. The number of vectors is equal to the number of words plus the number of buckets. The number of columns is equal to the vector dimensionality.
hidden_output (numpy.array) – This is a matrix that contains the shallow neural network output. This array has the same dimensions as vectors_ngrams. May be None - in that case, it is impossible to continue training the model.
- __getitem__(key, /)¶
Return self[key].
- bucket¶
Alias for field number 0
- count(value, /)¶
Return number of occurrences of value.
- dim¶
Alias for field number 1
- epoch¶
Alias for field number 2
Alias for field number 3
- index(value, start=0, stop=9223372036854775807, /)¶
Return first index of value.
Raises ValueError if the value is not present.
- loss¶
Alias for field number 4
- lr_update_rate¶
Alias for field number 5
- maxn¶
Alias for field number 6
- min_count¶
Alias for field number 7
- minn¶
Alias for field number 8
- model¶
Alias for field number 9
- neg¶
Alias for field number 10
- ntokens¶
Alias for field number 11
- nwords¶
Alias for field number 12
- raw_vocab¶
Alias for field number 13
- t¶
Alias for field number 14
- vectors_ngrams¶
Alias for field number 15
- vocab_size¶
Alias for field number 16
- word_ngrams¶
Alias for field number 17
- ws¶
Alias for field number 18
- gensim.models._fasttext_bin.load(fin, encoding='utf-8', full_model=True)¶
Load a model from a binary stream.
- Parameters
fin (file) – The readable binary stream.
encoding (str, optional) – The encoding to use for decoding text
full_model (boolean, optional) – If False, skips loading the hidden output matrix. This saves a fair bit of CPU time and RAM, but prevents training continuation.
- Returns
The loaded model.
- Return type
- gensim.models._fasttext_bin.save(model, fout, fb_fasttext_parameters, encoding)¶
Saves word embeddings to the Facebook’s native fasttext .bin format.
- Parameters
fout (file name or writeable binary stream) – stream to which model is saved
model (gensim.models.fasttext.FastText) – saved model
fb_fasttext_parameters (dictionary) – dictionary contain parameters containing lr_update_rate, word_ngrams unused by gensim implementation, so they have to be provided externally
encoding (str) – encoding used in the output file
Notes
Unfortunately, there is no documentation of the Facebook’s native fasttext .bin format
This is just reimplementation of [FastText::saveModel](https://github.com/facebookresearch/fastText/blob/master/src/fasttext.cc)
Based on v0.9.1, more precisely commit da2745fcccb848c7a225a7d558218ee4c64d5333
Code follows the original C++ code naming.