models.fasttext_inner
– Cython routines for training FastText models¶
Optimized Cython functions for training a FastText
model.
The main entry point is train_batch_any()
which may be called directly from Python code.
Notes
The implementation of the above functions heavily depends on the
FastTextConfig struct defined in gensim/models/fasttext_inner.pxd
.
The gensim.models.word2vec.FAST_VERSION value reports what flavor of BLAS we’re currently using:
0: double 1: float 2: no BLAS, use Cython loops instead
See also
- gensim.models.fasttext_inner.compute_ngrams(word, unsigned int min_n, unsigned int max_n)¶
Get the list of all possible ngrams for a given word.
- Parameters
word (str) – The word whose ngrams need to be computed.
min_n (unsigned int) – Minimum character length of the ngrams.
max_n (unsigned int) – Maximum character length of the ngrams.
- Returns
Sequence of character ngrams.
- Return type
list of str
- gensim.models.fasttext_inner.compute_ngrams_bytes(word, unsigned int min_n, unsigned int max_n)¶
Computes ngrams for a word.
Ported from the original FB implementation.
- Parameters
word (str) – A unicode string.
min_n (unsigned int) – The minimum ngram length.
max_n (unsigned int) – The maximum ngram length.
Returns –
-------- –
str (list of) – A list of ngrams, where each ngram is a list of bytes.
See also
- gensim.models.fasttext_inner.ft_hash_bytes(bytes bytez)¶
Calculate hash based on bytez. Reproduce hash method from Facebook fastText implementation.
- Parameters
bytez (bytes) – The string whose hash needs to be calculated, encoded as UTF-8.
- Returns
The hash of the string.
- Return type
unsigned int
- gensim.models.fasttext_inner.init()¶
Precompute function sigmoid(x) = 1 / (1 + exp(-x)), for x values discretized into table EXP_TABLE. Also calculate log(sigmoid(x)) into LOG_TABLE.
We recalc, rather than re-use the table from word2vec_inner, because Facebook’s FastText code uses a 512-slot table rather than the 1000 precedent of word2vec.c.
- gensim.models.fasttext_inner.train_batch_any(model, sentences, alpha, _work, _neu1)¶
Update the model by training on a sequence of sentences.
Each sentence is a list of string tokens, which are looked up in the model’s vocab dictionary. Called internally from
train()
.- Parameters
model (
FastText
) – Model to be trained.sentences (iterable of list of str) – A single batch: part of the corpus streamed directly from disk/network.
alpha (float) – Learning rate.
_work (np.ndarray) – Private working memory for each worker.
_neu1 (np.ndarray) – Private working memory for each worker.
- Returns
Effective number of words trained.
- Return type
int