gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

models.deprecated.fasttext – FastText model

models.deprecated.fasttext – FastText model

Warning

Deprecated since version 3.3.0: Use gensim.models.fasttext instead.

Learn word representations via fasttext’s “skip-gram and CBOW models”, using either hierarchical softmax or negative sampling [1].

Notes

There are more ways to get word vectors in Gensim than just FastText. See wrappers for VarEmbed and WordRank or Word2Vec

This module allows training a word embedding from a training corpus with the additional ability to obtain word vectors for out-of-vocabulary words.

For a tutorial on gensim’s native fasttext, refer to the noteboook – [2]

Make sure you have a C compiler before installing gensim, to use optimized (compiled) fasttext training

[1](1, 2) P. Bojanowski, E. Grave, A. Joulin, T. Mikolov Enriching Word Vectors with Subword Information. In arXiv preprint arXiv:1607.04606. https://arxiv.org/abs/1607.04606
[2]https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/FastText_Tutorial.ipynb
class gensim.models.deprecated.fasttext.FastText(sentences=None, sg=0, hs=0, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, word_ngrams=1, sample=0.001, seed=1, workers=3, min_alpha=0.0001, negative=5, cbow_mean=1, hashfxn=<built-in function hash>, iter=5, null_word=0, min_n=3, max_n=6, sorted_vocab=1, bucket=2000000, trim_rule=None, batch_words=10000)

Bases: gensim.models.deprecated.word2vec.Word2Vec

Class for training, using and evaluating word representations learned using method described in [1] aka Fasttext.

The model can be stored/loaded via its save() and load() methods, or loaded in a format compatible with the original fasttext implementation via load_fasttext_format().

Initialize the model from an iterable of sentences. Each sentence is a list of words (unicode strings) that will be used for training.

Parameters:
  • sentences (iterable of iterables) – The sentences iterable can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, Text8Corpus or LineSentence in word2vec module for such examples. If you don’t supply sentences, the model is left uninitialized – use if you plan to initialize it in some other way.
  • sg (int {1, 0}) – Defines the training algorithm. If 1, skip-gram is used, otherwise, CBOW is employed.
  • size (int) – Dimensionality of the feature vectors.
  • window (int) – The maximum distance between the current and predicted word within a sentence.
  • alpha (float) – The initial learning rate.
  • min_alpha (float) – Learning rate will linearly drop to min_alpha as training progresses.
  • seed (int) – Seed for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread (workers=1), to eliminate ordering jitter from OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization).
  • min_count (int) – Ignores all words with total frequency lower than this.
  • max_vocab_size (int) – Limits the RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit.
  • sample (float) – The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).
  • workers (int) – Use these many worker threads to train the model (=faster training with multicore machines).
  • hs (int {1,0}) – If 1, hierarchical softmax will be used for model training. If set to 0, and negative is non-zero, negative sampling will be used.
  • negative (int) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.
  • cbow_mean (int {1,0}) – If 0, use the sum of the context word vectors. If 1, use the mean, only applies when cbow is used.
  • hashfxn (function) – Hash function to use to randomly initialize weights, for increased training reproducibility.
  • iter (int) – Number of iterations (epochs) over the corpus.
  • trim_rule (function) – Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to keep_vocab_item()), or a callable that accepts parameters (word, count, min_count) and returns either gensim.utils.RULE_DISCARD, gensim.utils.RULE_KEEP or gensim.utils.RULE_DEFAULT. Note: The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as part of the model.
  • sorted_vocab (int {1,0}) – If 1, sort the vocabulary by descending frequency before assigning word indexes.
  • batch_words (int) – Target size (in words) for batches of examples passed to worker threads (and thus cython routines).(Larger batches will be passed if individual texts are longer than 10000 words, but the standard cython code truncates to that maximum.)
  • min_n (int) – Min length of char ngrams to be used for training word representations.
  • max_n (int) – Max length of char ngrams to be used for training word representations. Set max_n to be lesser than min_n to avoid char ngrams being used.
  • word_ngrams (int {1,0}) – If 1, uses enriches word vectors with subword(ngrams) information. If 0, this is equivalent to word2vec.
  • bucket (int) – Character ngrams are hashed into a fixed number of buckets, in order to limit the memory usage of the model. This option specifies the number of buckets used by the model.

Examples

Initialize and train a FastText model

>>> from gensim.models import FastText
>>> sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
>>>
>>> model = FastText(sentences, min_count=1)
>>> say_vector = model['say']  # get vector for word
>>> of_vector = model['of']  # get vector for out-of-vocab word
__getitem__(word)

Get word representations in vector space, as a 1D numpy array.

Parameters:word (str) – A single word whose vector needs to be returned.
Returns:The word’s representations in vector space, as a 1D numpy array.
Return type:numpy.ndarray
Raises:KeyError – For words with all ngrams absent, a KeyError is raised.

Example

>>> from gensim.models import FastText
>>> from gensim.test.utils import datapath
>>>
>>> trained_model = FastText.load_fasttext_format(datapath('lee_fasttext'))
>>> meow_vector = trained_model['hello']  # get vector for word
accuracy(questions, restrict_vocab=30000, most_similar=None, case_insensitive=True)
build_vocab(sentences, keep_raw_vocab=False, trim_rule=None, progress_per=10000, update=False)

Build vocabulary from a sequence of sentences (can be a once-only generator stream). Each sentence must be a list of unicode strings.

Parameters:
  • sentences (iterable of iterables) – The sentences iterable can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, Text8Corpus or LineSentence in word2vec module for such examples.
  • keep_raw_vocab (bool) – If not true, delete the raw vocabulary after the scaling is done and free up RAM.
  • trim_rule (function) – Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to keep_vocab_item()), or a callable that accepts parameters (word, count, min_count) and returns either gensim.utils.RULE_DISCARD, gensim.utils.RULE_KEEP or gensim.utils.RULE_DEFAULT. Note: The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as part of the model.
  • progress_per (int) – Indicates how many words to process before showing/updating the progress.
  • update (bool) – If true, the new words in sentences will be added to model’s vocab.

Example

Train a model and update vocab for online training

>>> from gensim.models import FastText
>>> sentences_1 = [["cat", "say", "meow"], ["dog", "say", "woof"]]
>>> sentences_2 = [["dude", "say", "wazzup!"]]
>>>
>>> model = FastText(min_count=1)
>>> model.build_vocab(sentences_1)
>>> model.train(sentences_1, total_examples=model.corpus_count, epochs=model.iter)
>>> model.build_vocab(sentences_2, update=True)
>>> model.train(sentences_2, total_examples=model.corpus_count, epochs=model.iter)
build_vocab_from_freq(word_freq, keep_raw_vocab=False, corpus_count=None, trim_rule=None, update=False)

Build vocabulary from a dictionary of word frequencies. Build model vocabulary from a passed dictionary that contains (word,word count). Words must be of type unicode strings.

Parameters:
  • word_freq (dict) – Word,Word_Count dictionary.
  • keep_raw_vocab (bool) – If not true, delete the raw vocabulary after the scaling is done and free up RAM.
  • corpus_count (int) – Even if no corpus is provided, this argument can set corpus_count explicitly.
  • = vocabulary trimming rule, specifies whether certain words should remain (trim_rule) –
  • the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count) (in) –
  • be None (min_count will be used), or a callable that accepts parameters (word, count, min_count) and (Can) –
  • either utils.RULE_DISCARD, utils.RULE_KEEP or utils.RULE_DEFAULT. (returns) –
  • update (bool) – If true, the new provided words in word_freq dict will be added to model’s vocab.
Returns:

Return type:

None

Examples

>>> from gensim.models.word2vec import Word2Vec
>>> model= Word2Vec()
>>> model.build_vocab_from_freq({"Word1": 15, "Word2": 20})
clear_sims()

Removes all L2-normalized vectors for words from the model. You will have to recompute them using init_sims method.

create_binary_tree()

Create a binary Huffman tree using stored vocabulary word counts. Frequent words will have shorter binary codes. Called internally from build_vocab().

delete_temporary_training_data(replace_word_vectors_with_normalized=False)

Discard parameters that are used in training and score. Use if you’re sure you’re done training a model. If replace_word_vectors_with_normalized is set, forget the original vectors and only keep the normalized ones = saves lots of memory!

doesnt_match(words)

Deprecated. Use self.wv.doesnt_match() instead. Refer to the documentation for gensim.models.KeyedVectors.doesnt_match

estimate_memory(vocab_size=None, report=None)

Estimate required memory for a model using current settings and provided vocabulary size.

evaluate_word_pairs(pairs, delimiter='\t', restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)

Deprecated. Use self.wv.evaluate_word_pairs() instead. Refer to the documentation for gensim.models.KeyedVectors.evaluate_word_pairs

finalize_vocab(update=False)

Build tables and model weights based on final vocabulary settings.

get_latest_training_loss()
get_vocab_word_vecs()

Calculate vectors for words in vocabulary and stores them in wv.syn0.

init_ngrams(update=False)

Compute ngrams of all words present in vocabulary and stores vectors for only those ngrams. Vectors for other ngrams are initialized with a random uniform distribution in FastText.

Parameters:update (bool) – If True, the new vocab words and their new ngrams word vectors are initialized with random uniform distribution and updated/added to the existing vocab word and ngram vectors.
init_sims(replace=False)

init_sims() resides in KeyedVectors because it deals with syn0 mainly, but because syn1 is not an attribute of KeyedVectors, it has to be deleted in this class, and the normalizing of syn0 happens inside of KeyedVectors

initialize_word_vectors()

Initializes FastTextKeyedVectors instance to store all vocab/ngram vectors for the model.

intersect_word2vec_format(fname, lockf=0.0, binary=False, encoding='utf8', unicode_errors='strict')

Merge the input-hidden weight matrix from the original C word2vec-tool format given, where it intersects with the current vocabulary. (No words are added to the existing vocabulary, but intersecting words adopt the file’s weights, and non-intersecting words are left alone.)

binary is a boolean indicating whether the data is in binary word2vec format.

lockf is a lock-factor value to be set for any imported word-vectors; the default value of 0.0 prevents further updating of the vector during subsequent training. Use 1.0 to allow further training updates of merged vectors.

load(*args, **kwargs)
classmethod load_fasttext_format(*args, **kwargs)

Load a FastText model from a format compatible with the original fasttext implementation.

Parameters:fname (str) – Path to the file.
load_word2vec_format(fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict', limit=None, datatype=<type 'numpy.float32'>)

Deprecated. Use gensim.models.KeyedVectors.load_word2vec_format instead.

log_accuracy(section)
log_evaluate_word_pairs(pearson, spearman, oov, pairs)

Deprecated. Use self.wv.log_evaluate_word_pairs() instead. Refer to the documentation for gensim.models.KeyedVectors.log_evaluate_word_pairs

make_cum_table(power=0.75, domain=2147483647)

Create a cumulative-distribution table using stored vocabulary word counts for drawing random words in the negative-sampling training routines.

To draw a word index, choose a random integer up to the maximum value in the table (cum_table[-1]), then finding that integer’s sorted insertion point (as if by bisect_left or ndarray.searchsorted()). That insertion point is the drawn index, coming up in proportion equal to the increment at that slot.

Called internally from ‘build_vocab()’.

most_similar(positive=None, negative=None, topn=10, restrict_vocab=None, indexer=None)

Deprecated. Use self.wv.most_similar() instead. Refer to the documentation for gensim.models.KeyedVectors.most_similar

most_similar_cosmul(positive=None, negative=None, topn=10)

Deprecated. Use self.wv.most_similar_cosmul() instead. Refer to the documentation for gensim.models.KeyedVectors.most_similar_cosmul

n_similarity(ws1, ws2)

Deprecated. Use self.wv.n_similarity() instead. Refer to the documentation for gensim.models.KeyedVectors.n_similarity

predict_output_word(context_words_list, topn=10)

Report the probability distribution of the center word given the context words as input to the trained model.

reset_from(other_model)

Borrow shareable pre-built structures (like vocab) from the other_model. Useful if testing multiple models in parallel on the same corpus.

reset_ngram_weights()

Reset all projection weights to an initial (untrained) state, but keep the existing vocabulary and their ngrams.

reset_weights()

Reset all projection weights to an initial (untrained) state, but keep the existing vocabulary.

save(*args, **kwargs)

Save the model. This saved model can be loaded again using load(), which supports online training and getting vectors for out-of-vocabulary words.

Parameters:fname (str) – Path to the file.
save_word2vec_format(fname, fvocab=None, binary=False)

Deprecated. Use model.wv.save_word2vec_format instead.

scale_vocab(min_count=None, sample=None, dry_run=False, keep_raw_vocab=False, trim_rule=None, update=False)

Apply vocabulary settings for min_count (discarding less-frequent words) and sample (controlling the downsampling of more-frequent words).

Calling with dry_run=True will only simulate the provided settings and report the size of the retained vocabulary, effective corpus length, and estimated memory requirements. Results are both printed via logging and returned as a dict.

Delete the raw vocabulary after the scaling is done to free up RAM, unless keep_raw_vocab is set.

scan_vocab(sentences, progress_per=10000, trim_rule=None)

Do an initial scan of all words appearing in sentences.

score(sentences, total_sentences=1000000, chunksize=100, queue_factor=2, report_delay=1)

Score the log probability for a sequence of sentences (can be a once-only generator stream). Each sentence must be a list of unicode strings. This does not change the fitted model in any way (see Word2Vec.train() for that).

We have currently only implemented score for the hierarchical softmax scheme, so you need to have run word2vec with hs=1 and negative=0 for this to work.

Note that you should specify total_sentences; we’ll run into problems if you ask to score more than this number of sentences but it is inefficient to set the value too high.

See the article by [3] and the gensim demo at [4] for examples of how to use such scores in document classification.

[3]Taddy, Matt. Document Classification by Inversion of Distributed Language Representations, in Proceedings of the 2015 Conference of the Association of Computational Linguistics.
[4]https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/deepir.ipynb
seeded_vector(seed_string)

Create one ‘random’ vector (but deterministic by seed_string)

similar_by_vector(vector, topn=10, restrict_vocab=None)

Deprecated. Use self.wv.similar_by_vector() instead. Refer to the documentation for gensim.models.KeyedVectors.similar_by_vector

similar_by_word(word, topn=10, restrict_vocab=None)

Deprecated. Use self.wv.similar_by_word() instead. Refer to the documentation for gensim.models.KeyedVectors.similar_by_word

similarity(w1, w2)

Deprecated. Use self.wv.similarity() instead. Refer to the documentation for gensim.models.KeyedVectors.similarity

sort_vocab()

Sort the vocabulary so the most frequent words have the lowest indexes.

train(sentences, total_examples=None, total_words=None, epochs=None, start_alpha=None, end_alpha=None, word_count=0, queue_factor=2, report_delay=1.0)

Update the model’s neural weights from a sequence of sentences (can be a once-only generator stream). For FastText, each sentence must be a list of unicode strings. (Subclasses may accept other examples.)

To support linear learning-rate decay from (initial) alpha to min_alpha, and accurate progress-percentage logging, either total_examples (count of sentences) or total_words (count of raw words in sentences) MUST be provided (if the corpus is the same as was provided to build_vocab(), the count of examples in that corpus will be available in the model’s corpus_count property).

To avoid common mistakes around the model’s ability to do multiple training passes itself, an explicit epochs argument MUST be provided. In the common and recommended case, where train() is only called once, the model’s cached iter value should be supplied as epochs value.

Parameters:
  • sentences (iterable of iterables) – The sentences iterable can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, Text8Corpus or LineSentence in word2vec module for such examples.
  • total_examples (int) – Count of sentences.
  • total_words (int) – Count of raw words in sentences.
  • epochs (int) – Number of iterations (epochs) over the corpus.
  • start_alpha (float) – Initial learning rate.
  • end_alpha (float) – Final learning rate. Drops linearly from start_alpha.
  • word_count (int) – Count of words already trained. Set this to 0 for the usual case of training on all words in sentences.
  • queue_factor (int) – Multiplier for size of queue (number of workers * queue_factor).
  • report_delay (float) – Seconds to wait before reporting progress.

Examples

>>> from gensim.models import FastText
>>> sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
>>>
>>> model = FastText(min_count=1)
>>> model.build_vocab(sentences)
>>> model.train(sentences, total_examples=model.corpus_count, epochs=model.iter)
update_weights()

Copy all the existing weights, and reset the weights for the newly added vocabulary.

wmdistance(document1, document2)

Deprecated. Use self.wv.wmdistance() instead. Refer to the documentation for gensim.models.KeyedVectors.wmdistance

word_vec(word, use_norm=False)

Get the word’s representations in vector space, as a 1D numpy array.

Parameters:
  • word (str) – A single word whose vector needs to be returned.
  • use_norm (bool) – If True, returns normalized vector.
Returns:

The word’s representations in vector space, as a 1D numpy array.

Return type:

numpy.ndarray

Raises:

KeyError – For words with all ngrams absent, a KeyError is raised.

Example

>>> from gensim.models import FastText
>>> sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
>>>
>>> model = FastText(sentences, min_count=1)
>>> meow_vector = model.word_vec('meow')  # get vector for word
gensim.models.deprecated.fasttext.load_old_fasttext(*args, **kwargs)
gensim.models.deprecated.fasttext.train_batch_cbow(model, sentences, alpha, work=None, neu1=None)

Update CBOW model by training on a sequence of sentences.

Each sentence is a list of string tokens, which are looked up in the model’s vocab dictionary. Called internally from gensim.models.fasttext.FastText.train().

This is the non-optimized, Python version. If you have cython installed, gensim will use the optimized version from fasttext_inner instead.

Parameters:
  • model (FastText) – FastText instance.
  • sentences (iterable of iterables) – Iterable of the sentences directly from disk/network.
  • alpha (float) – Learning rate.
  • work (numpy.ndarray) – Private working memory for each worker.
  • neu1 (numpy.ndarray) – Private working memory for each worker.
Returns:

Effective number of words trained.

Return type:

int

gensim.models.deprecated.fasttext.train_batch_sg(model, sentences, alpha, work=None, neu1=None)

Update skip-gram model by training on a sequence of sentences.

Each sentence is a list of string tokens, which are looked up in the model’s vocab dictionary. Called internally from gensim.models.fasttext.FastText.train().

This is the non-optimized, Python version. If you have cython installed, gensim will use the optimized version from fasttext_inner instead.

Parameters:
  • model (FastText) – FastText instance.
  • sentences (iterable of iterables) – Iterable of the sentences directly from disk/network.
  • alpha (float) – Learning rate.
  • work (numpy.ndarray) – Private working memory for each worker.
  • neu1 (numpy.ndarray) – Private working memory for each worker.
Returns:

Effective number of words trained.

Return type:

int