models.fasttext – FastText model

Introduction

Learn word representations via fastText: Enriching Word Vectors with Subword Information.

This module allows training word embeddings from a training corpus with the additional ability to obtain word vectors for out-of-vocabulary words.

This module contains a fast native C implementation of fastText with Python interfaces. It is not only a wrapper around Facebook’s implementation.

This module supports loading models trained with Facebook’s fastText implementation. It also supports continuing training from such models.

For a tutorial see FastText Model.

Usage examples

Initialize and train a model:

>>> from gensim.models import FastText
>>> from gensim.test.utils import common_texts  # some example sentences
>>>
>>> print(common_texts[0])
['human', 'interface', 'computer']
>>> print(len(common_texts))
9
>>> model = FastText(vector_size=4, window=3, min_count=1)  # instantiate
>>> model.build_vocab(corpus_iterable=common_texts)
>>> model.train(corpus_iterable=common_texts, total_examples=len(common_texts), epochs=10)  # train

Once you have a model, you can access its keyed vectors via the model.wv attributes. The keyed vectors instance is quite powerful: it can perform a wide range of NLP tasks. For a full list of examples, see KeyedVectors.

You can also pass all the above parameters to the constructor to do everything in a single line:

>>> model2 = FastText(vector_size=4, window=3, min_count=1, sentences=common_texts, epochs=10)

The two models above are instantiated differently, but behave identically. For example, we can compare the embeddings they’ve calculated for the word “computer”:

>>> import numpy as np
>>>
>>> np.allclose(model.wv['computer'], model2.wv['computer'])
True

In the above examples, we trained the model from sentences (lists of words) loaded into memory. This is OK for smaller datasets, but for larger datasets, we recommend streaming the file, for example from disk or the network. In Gensim, we refer to such datasets as “corpora” (singular “corpus”), and keep them in the format described in LineSentence. Passing a corpus is simple:

>>> from gensim.test.utils import datapath
>>>
>>> corpus_file = datapath('lee_background.cor')  # absolute path to corpus
>>> model3 = FastText(vector_size=4, window=3, min_count=1)
>>> model3.build_vocab(corpus_file=corpus_file)  # scan over corpus to build the vocabulary
>>>
>>> total_words = model3.corpus_total_words  # number of words in the corpus
>>> model3.train(corpus_file=corpus_file, total_words=total_words, epochs=5)

The model needs the total_words parameter in order to manage the training rate (alpha) correctly, and to give accurate progress estimates. The above example relies on an implementation detail: the build_vocab() method sets the corpus_total_words (and also corpus_count) model attributes. You may calculate them by scanning over the corpus yourself, too.

If you have a corpus in a different format, then you can use it by wrapping it in an iterator. Your iterator should yield a list of strings each time, where each string should be a separate word. Gensim will take care of the rest:

>>> from gensim.utils import tokenize
>>> from gensim import utils
>>>
>>>
>>> class MyIter:
...     def __iter__(self):
...         path = datapath('crime-and-punishment.txt')
...         with utils.open(path, 'r', encoding='utf-8') as fin:
...             for line in fin:
...                 yield list(tokenize(line))
>>>
>>>
>>> model4 = FastText(vector_size=4, window=3, min_count=1)
>>> model4.build_vocab(corpus_iterable=MyIter())
>>> total_examples = model4.corpus_count
>>> model4.train(corpus_iterable=MyIter(), total_examples=total_examples, epochs=5)

Persist a model to disk with:

>>> from gensim.test.utils import get_tmpfile
>>>
>>> fname = get_tmpfile("fasttext.model")
>>>
>>> model.save(fname)
>>> model = FastText.load(fname)

Once loaded, such models behave identically to those created from scratch. For example, you can continue training the loaded model:

>>> import numpy as np
>>>
>>> 'computation' in model.wv.key_to_index  # New word, currently out of vocab
False
>>> old_vector = np.copy(model.wv['computation'])  # Grab the existing vector
>>> new_sentences = [
...     ['computer', 'aided', 'design'],
...     ['computer', 'science'],
...     ['computational', 'complexity'],
...     ['military', 'supercomputer'],
...     ['central', 'processing', 'unit'],
...     ['onboard', 'car', 'computer'],
... ]
>>>
>>> model.build_vocab(new_sentences, update=True)  # Update the vocabulary
>>> model.train(new_sentences, total_examples=len(new_sentences), epochs=model.epochs)
>>>
>>> new_vector = model.wv['computation']
>>> np.allclose(old_vector, new_vector, atol=1e-4)  # Vector has changed, model has learnt something
False
>>> 'computation' in model.wv.key_to_index  # Word is still out of vocab
False

Important

Be sure to call the build_vocab() method with update=True before the train() method when continuing training. Without this call, previously unseen terms will not be added to the vocabulary.

You can also load models trained with Facebook’s fastText implementation:

>>> cap_path = datapath("crime-and-punishment.bin")
>>> fb_model = load_facebook_model(cap_path)

Once loaded, such models behave identically to those trained from scratch. You may continue training them on new data:

>>> 'computer' in fb_model.wv.key_to_index  # New word, currently out of vocab
False
>>> old_computer = np.copy(fb_model.wv['computer'])  # Calculate current vectors
>>> fb_model.build_vocab(new_sentences, update=True)
>>> fb_model.train(new_sentences, total_examples=len(new_sentences), epochs=model.epochs)
>>> new_computer = fb_model.wv['computer']
>>> np.allclose(old_computer, new_computer, atol=1e-4)  # Vector has changed, model has learnt something
False
>>> 'computer' in fb_model.wv.key_to_index  # New word is now in the vocabulary
True

If you do not intend to continue training the model, consider using the gensim.models.fasttext.load_facebook_vectors() function instead. That function only loads the word embeddings (keyed vectors), consuming much less CPU and RAM:

>>> from gensim.test.utils import datapath
>>>
>>> cap_path = datapath("crime-and-punishment.bin")
>>> wv = load_facebook_vectors(cap_path)
>>>
>>> 'landlord' in wv.key_to_index  # Word is out of vocabulary
False
>>> oov_vector = wv['landlord']  # Even OOV words have vectors in FastText
>>>
>>> 'landlady' in wv.key_to_index  # Word is in the vocabulary
True
>>> iv_vector = wv['landlady']

Retrieve the word-vector for vocab and out-of-vocab word:

>>> existent_word = "computer"
>>> existent_word in model.wv.key_to_index
True
>>> computer_vec = model.wv[existent_word]  # numpy vector of a word
>>>
>>> oov_word = "graph-out-of-vocab"
>>> oov_word in model.wv.key_to_index
False
>>> oov_vec = model.wv[oov_word]  # numpy vector for OOV word

You can perform various NLP word tasks with the model, some of them are already built-in:

>>> similarities = model.wv.most_similar(positive=['computer', 'human'], negative=['interface'])
>>> most_similar = similarities[0]
>>>
>>> similarities = model.wv.most_similar_cosmul(positive=['computer', 'human'], negative=['interface'])
>>> most_similar = similarities[0]
>>>
>>> not_matching = model.wv.doesnt_match("human computer interface tree".split())
>>>
>>> sim_score = model.wv.similarity('computer', 'human')

Correlation with human opinion on word similarity:

>>> from gensim.test.utils import datapath
>>>
>>> similarities = model.wv.evaluate_word_pairs(datapath('wordsim353.tsv'))

And on word analogies:

>>> analogies_result = model.wv.evaluate_word_analogies(datapath('questions-words.txt'))
class gensim.models.fasttext.FastText(sentences=None, corpus_file=None, sg=0, hs=0, vector_size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, word_ngrams=1, sample=0.001, seed=1, workers=3, min_alpha=0.0001, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=<built-in function hash>, epochs=5, null_word=0, min_n=3, max_n=6, sorted_vocab=1, bucket=2000000, trim_rule=None, batch_words=10000, callbacks=(), max_final_vocab=None, shrink_windows=True)

Bases: Word2Vec

Train, use and evaluate word representations learned using the method described in Enriching Word Vectors with Subword Information, aka FastText.

The model can be stored/loaded via its save() and load() methods, or loaded from a format compatible with the original Fasttext implementation via load_facebook_model().

Parameters
  • sentences (iterable of list of str, optional) – Can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, LineSentence in word2vec module for such examples. If you don’t supply sentences, the model is left uninitialized – use if you plan to initialize it in some other way.

  • corpus_file (str, optional) – Path to a corpus file in LineSentence format. You may use this argument instead of sentences to get performance boost. Only one of sentences or corpus_file arguments need to be passed (or none of them, in that case, the model is left uninitialized).

  • min_count (int, optional) – The model ignores all words with total frequency lower than this.

  • vector_size (int, optional) – Dimensionality of the word vectors.

  • window (int, optional) – The maximum distance between the current and predicted word within a sentence.

  • workers (int, optional) – Use these many worker threads to train the model (=faster training with multicore machines).

  • alpha (float, optional) – The initial learning rate.

  • min_alpha (float, optional) – Learning rate will linearly drop to min_alpha as training progresses.

  • sg ({1, 0}, optional) – Training algorithm: skip-gram if sg=1, otherwise CBOW.

  • hs ({1,0}, optional) – If 1, hierarchical softmax will be used for model training. If set to 0, and negative is non-zero, negative sampling will be used.

  • seed (int, optional) – Seed for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread (workers=1), to eliminate ordering jitter from OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization).

  • max_vocab_size (int, optional) – Limits the RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit.

  • sample (float, optional) – The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).

  • negative (int, optional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.

  • ns_exponent (float, optional) – The exponent used to shape the negative sampling distribution. A value of 1.0 samples exactly in proportion to the frequencies, 0.0 samples all words equally, while a negative value samples low-frequency words more than high-frequency words. The popular default value of 0.75 was chosen by the original Word2Vec paper. More recently, in https://arxiv.org/abs/1804.04212, Caselles-Dupré, Lesaint, & Royo-Letelier suggest that other values may perform better for recommendation applications.

  • cbow_mean ({1,0}, optional) – If 0, use the sum of the context word vectors. If 1, use the mean, only applies when cbow is used.

  • hashfxn (function, optional) – Hash function to use to randomly initialize weights, for increased training reproducibility.

  • iter (int, optional) – Number of iterations (epochs) over the corpus.

  • trim_rule (function, optional) –

    Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to keep_vocab_item()), or a callable that accepts parameters (word, count, min_count) and returns either gensim.utils.RULE_DISCARD, gensim.utils.RULE_KEEP or gensim.utils.RULE_DEFAULT. The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as part of themodel.

    The input parameters are of the following types:
    • word (str) - the word we are examining

    • count (int) - the word’s frequency count in the corpus

    • min_count (int) - the minimum count threshold.

  • sorted_vocab ({1,0}, optional) – If 1, sort the vocabulary by descending frequency before assigning word indices.

  • batch_words (int, optional) – Target size (in words) for batches of examples passed to worker threads (and thus cython routines).(Larger batches will be passed if individual texts are longer than 10000 words, but the standard cython code truncates to that maximum.)

  • min_n (int, optional) – Minimum length of char n-grams to be used for training word representations.

  • max_n (int, optional) – Max length of char ngrams to be used for training word representations. Set max_n to be lesser than min_n to avoid char ngrams being used.

  • word_ngrams (int, optional) – In Facebook’s FastText, “max length of word ngram” - but gensim only supports the default of 1 (regular unigram word handling).

  • bucket (int, optional) – Character ngrams are hashed into a fixed number of buckets, in order to limit the memory usage of the model. This option specifies the number of buckets used by the model. The default value of 2000000 consumes as much memory as having 2000000 more in-vocabulary words in your model.

  • callbacks – List of callbacks that need to be executed/run at specific stages during training.

  • max_final_vocab (int, optional) – Limits the vocab to a target vocab size by automatically selecting min_count`. If the specified min_count is more than the automatically calculated min_count, the former will be used. Set to None if not required.

  • shrink_windows (bool, optional) – New in 4.1. Experimental. If True, the effective window size is uniformly sampled from [1, window] for each target word during training, to match the original word2vec algorithm’s approximate weighting of context words by distance. Otherwise, the effective window size is always fixed to window words to either side.

Examples

Initialize and train a FastText model:

>>> from gensim.models import FastText
>>> sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
>>>
>>> model = FastText(sentences, min_count=1)
>>> say_vector = model.wv['say']  # get vector for word
>>> of_vector = model.wv['of']  # get vector for out-of-vocab word
wv

This object essentially contains the mapping between words and embeddings. These are similar to the embedding computed in the Word2Vec, however here we also include vectors for n-grams. This allows the model to compute embeddings even for unseen words (that do not exist in the vocabulary), as the aggregate of the n-grams included in the word. After training the model, this attribute can be used directly to query those embeddings in various ways. Check the module level docstring for some examples.

Type

FastTextKeyedVectors

add_lifecycle_event(event_name, log_level=20, **event)

Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.

Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.

The lifecycle_events attribute is persisted across object’s save() and load() operations. It has no impact on the use of the model, but is useful during debugging and support.

Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.

Parameters
  • event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.

  • event (dict) –

    Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.

    This method will automatically add the following key-values to event, so you don’t have to specify them:

    • datetime: the current date & time

    • gensim: the current Gensim version

    • python: the current Python version

    • platform: the current platform

    • event: the name of this event

  • log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.

add_null_word()
build_vocab(corpus_iterable=None, corpus_file=None, update=False, progress_per=10000, keep_raw_vocab=False, trim_rule=None, **kwargs)

Build vocabulary from a sequence of sentences (can be a once-only generator stream).

Parameters
  • corpus_iterable (iterable of list of str) – Can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, Text8Corpus or LineSentence module for such examples.

  • corpus_file (str, optional) – Path to a corpus file in LineSentence format. You may use this argument instead of sentences to get performance boost. Only one of sentences or corpus_file arguments need to be passed (not both of them).

  • update (bool) – If true, the new words in sentences will be added to model’s vocab.

  • progress_per (int, optional) – Indicates how many words to process before showing/updating the progress.

  • keep_raw_vocab (bool, optional) – If False, the raw vocabulary will be deleted after the scaling is done to free up RAM.

  • trim_rule (function, optional) –

    Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to keep_vocab_item()), or a callable that accepts parameters (word, count, min_count) and returns either gensim.utils.RULE_DISCARD, gensim.utils.RULE_KEEP or gensim.utils.RULE_DEFAULT. The rule, if given, is only used to prune vocabulary during current method call and is not stored as part of the model.

    The input parameters are of the following types:
    • word (str) - the word we are examining

    • count (int) - the word’s frequency count in the corpus

    • min_count (int) - the minimum count threshold.

  • **kwargs (object) – Keyword arguments propagated to self.prepare_vocab.

build_vocab_from_freq(word_freq, keep_raw_vocab=False, corpus_count=None, trim_rule=None, update=False)

Build vocabulary from a dictionary of word frequencies.

Parameters
  • word_freq (dict of (str, int)) – A mapping from a word in the vocabulary to its frequency count.

  • keep_raw_vocab (bool, optional) – If False, delete the raw vocabulary after the scaling is done to free up RAM.

  • corpus_count (int, optional) – Even if no corpus is provided, this argument can set corpus_count explicitly.

  • trim_rule (function, optional) –

    Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to keep_vocab_item()), or a callable that accepts parameters (word, count, min_count) and returns either gensim.utils.RULE_DISCARD, gensim.utils.RULE_KEEP or gensim.utils.RULE_DEFAULT. The rule, if given, is only used to prune vocabulary during current method call and is not stored as part of the model.

    The input parameters are of the following types:
    • word (str) - the word we are examining

    • count (int) - the word’s frequency count in the corpus

    • min_count (int) - the minimum count threshold.

  • update (bool, optional) – If true, the new provided words in word_freq dict will be added to model’s vocab.

create_binary_tree()

Create a binary Huffman tree using stored vocabulary word counts. Frequent words will have shorter binary codes. Called internally from build_vocab().

estimate_memory(vocab_size=None, report=None)

Estimate memory that will be needed to train a model, and print the estimates to log.

get_latest_training_loss()

Get current value of the training loss.

Returns

Current training loss.

Return type

float

init_sims(replace=False)

Precompute L2-normalized vectors. Obsoleted.

If you need a single unit-normalized vector for some key, call get_vector() instead: fasttext_model.wv.get_vector(key, norm=True).

To refresh norms after you performed some atypical out-of-band vector tampering, call :meth:`~gensim.models.keyedvectors.KeyedVectors.fill_norms() instead.

Parameters

replace (bool) – If True, forget the original trained vectors and only keep the normalized ones. You lose information if you do this.

init_weights()

Reset all projection weights to an initial (untrained) state, but keep the existing vocabulary.

classmethod load(*args, **kwargs)

Load a previously saved FastText model.

Parameters

fname (str) – Path to the saved file.

Returns

Loaded model.

Return type

FastText

See also

save()

Save FastText model.

load_binary_data(encoding='utf8')

Load data from a binary file created by Facebook’s native FastText.

Parameters

encoding (str, optional) – Specifies the encoding.

classmethod load_fasttext_format(model_file, encoding='utf8')

Deprecated.

Use gensim.models.fasttext.load_facebook_model() or gensim.models.fasttext.load_facebook_vectors() instead.

make_cum_table(domain=2147483647)

Create a cumulative-distribution table using stored vocabulary word counts for drawing random words in the negative-sampling training routines.

To draw a word index, choose a random integer up to the maximum value in the table (cum_table[-1]), then finding that integer’s sorted insertion point (as if by bisect_left or ndarray.searchsorted()). That insertion point is the drawn index, coming up in proportion equal to the increment at that slot.

predict_output_word(context_words_list, topn=10)

Get the probability distribution of the center word given context words.

Note this performs a CBOW-style propagation, even in SG models, and doesn’t quite weight the surrounding words the same as in training – so it’s just one crude way of using a trained model as a predictor.

Parameters
  • context_words_list (list of (str and/or int)) – List of context words, which may be words themselves (str) or their index in self.wv.vectors (int).

  • topn (int, optional) – Return topn words and their probabilities.

Returns

topn length list of tuples of (word, probability).

Return type

list of (str, float)

prepare_vocab(update=False, keep_raw_vocab=False, trim_rule=None, min_count=None, sample=None, dry_run=False)

Apply vocabulary settings for min_count (discarding less-frequent words) and sample (controlling the downsampling of more-frequent words).

Calling with dry_run=True will only simulate the provided settings and report the size of the retained vocabulary, effective corpus length, and estimated memory requirements. Results are both printed via logging and returned as a dict.

Delete the raw vocabulary after the scaling is done to free up RAM, unless keep_raw_vocab is set.

prepare_weights(update=False)

Build tables and model weights based on final vocabulary settings.

reset_from(other_model)

Borrow shareable pre-built structures from other_model and reset hidden layer weights.

Structures copied are:
  • Vocabulary

  • Index to word mapping

  • Cumulative frequency table (used for negative sampling)

  • Cached corpus length

Useful when testing multiple models on the same corpus in parallel. However, as the models then share all vocabulary-related structures other than vectors, neither should then expand their vocabulary (which could leave the other in an inconsistent, broken state). And, any changes to any per-word ‘vecattr’ will affect both models.

Parameters

other_model (Word2Vec) – Another model to copy the internal structures from.

save(*args, **kwargs)

Save the Fasttext model. This saved model can be loaded again using load(), which supports incremental training and getting vectors for out-of-vocabulary words.

Parameters

fname (str) – Store the model to this file.

See also

load()

Load FastText model.

scan_vocab(corpus_iterable=None, corpus_file=None, progress_per=10000, workers=None, trim_rule=None)
score(sentences, total_sentences=1000000, chunksize=100, queue_factor=2, report_delay=1)

Score the log probability for a sequence of sentences. This does not change the fitted model in any way (see train() for that).

Gensim has currently only implemented score for the hierarchical softmax scheme, so you need to have run word2vec with hs=1 and negative=0 for this to work.

Note that you should specify total_sentences; you’ll run into problems if you ask to score more than this number of sentences but it is inefficient to set the value too high.

See the article by Matt Taddy: “Document Classification by Inversion of Distributed Language Representations” and the gensim demo for examples of how to use such scores in document classification.

Parameters
  • sentences (iterable of list of str) – The sentences iterable can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, Text8Corpus or LineSentence in word2vec module for such examples.

  • total_sentences (int, optional) – Count of sentences.

  • chunksize (int, optional) – Chunksize of jobs

  • queue_factor (int, optional) – Multiplier for size of queue (number of workers * queue_factor).

  • report_delay (float, optional) – Seconds to wait before reporting progress.

seeded_vector(seed_string, vector_size)
train(corpus_iterable=None, corpus_file=None, total_examples=None, total_words=None, epochs=None, start_alpha=None, end_alpha=None, word_count=0, queue_factor=2, report_delay=1.0, compute_loss=False, callbacks=(), **kwargs)

Update the model’s neural weights from a sequence of sentences.

Notes

To support linear learning-rate decay from (initial) alpha to min_alpha, and accurate progress-percentage logging, either total_examples (count of sentences) or total_words (count of raw words in sentences) MUST be provided. If sentences is the same corpus that was provided to build_vocab() earlier, you can simply use total_examples=self.corpus_count.

Warning

To avoid common mistakes around the model’s ability to do multiple training passes itself, an explicit epochs argument MUST be provided. In the common and recommended case where train() is only called once, you can set epochs=self.epochs.

Parameters
  • corpus_iterable (iterable of list of str) – The corpus_iterable can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network, to limit RAM usage. See BrownCorpus, Text8Corpus or LineSentence in word2vec module for such examples. See also the tutorial on data streaming in Python.

  • corpus_file (str, optional) – Path to a corpus file in LineSentence format. You may use this argument instead of sentences to get performance boost. Only one of sentences or corpus_file arguments need to be passed (not both of them).

  • total_examples (int) – Count of sentences.

  • total_words (int) – Count of raw words in sentences.

  • epochs (int) – Number of iterations (epochs) over the corpus.

  • start_alpha (float, optional) – Initial learning rate. If supplied, replaces the starting alpha from the constructor, for this one call to`train()`. Use only if making multiple calls to train(), when you want to manage the alpha learning-rate yourself (not recommended).

  • end_alpha (float, optional) – Final learning rate. Drops linearly from start_alpha. If supplied, this replaces the final min_alpha from the constructor, for this one call to train(). Use only if making multiple calls to train(), when you want to manage the alpha learning-rate yourself (not recommended).

  • word_count (int, optional) – Count of words already trained. Set this to 0 for the usual case of training on all words in sentences.

  • queue_factor (int, optional) – Multiplier for size of queue (number of workers * queue_factor).

  • report_delay (float, optional) – Seconds to wait before reporting progress.

  • compute_loss (bool, optional) – If True, computes and stores loss value which can be retrieved using get_latest_training_loss().

  • callbacks (iterable of CallbackAny2Vec, optional) – Sequence of callbacks to be executed at specific stages during training.

Examples

>>> from gensim.models import Word2Vec
>>> sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
>>>
>>> model = Word2Vec(min_count=1)
>>> model.build_vocab(sentences)  # prepare the model vocabulary
>>> model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)  # train word vectors
(1, 30)
update_weights()

Copy all the existing weights, and reset the weights for the newly added vocabulary.

class gensim.models.fasttext.FastTextKeyedVectors(vector_size, min_n, max_n, bucket, count=0, dtype=<class 'numpy.float32'>)

Bases: KeyedVectors

Vectors and vocab for FastText.

Implements significant parts of the FastText algorithm. For example, the word_vec() calculates vectors for out-of-vocabulary (OOV) entities. FastText achieves this by keeping vectors for ngrams: adding the vectors for the ngrams of an entity yields the vector for the entity.

Similar to a hashmap, this class keeps a fixed number of buckets, and maps all ngrams to buckets using a hash function.

Parameters
  • vector_size (int) – The dimensionality of all vectors.

  • min_n (int) – The minimum number of characters in an ngram

  • max_n (int) – The maximum number of characters in an ngram

  • bucket (int) – The number of buckets.

  • count (int, optional) – If provided, vectors will be pre-allocated for at least this many vectors. (Otherwise they can be added later.)

  • dtype (type, optional) – Vector dimensions will default to np.float32 (AKA REAL in some Gensim code) unless another type is provided here.

vectors_vocab

Each row corresponds to a vector for an entity in the vocabulary. Columns correspond to vector dimensions. When embedded in a full FastText model, these are the full-word-token vectors updated by training, whereas the inherited vectors are the actual per-word vectors synthesized from the full-word-token and all subword (ngram) vectors.

Type

np.array

vectors_ngrams

A vector for each ngram across all entities in the vocabulary. Each row is a vector that corresponds to a bucket. Columns correspond to vector dimensions.

Type

np.array

buckets_word

For each key (by its index), report bucket slots their subwords map to.

Type

list of np.array

__contains__(word)

Check if word or any character ngrams in word are present in the vocabulary. A vector for the word is guaranteed to exist if current method returns True.

Parameters

word (str) – Input word.

Returns

True if word or any character ngrams in word are present in the vocabulary, False otherwise.

Return type

bool

Note

This method always returns True with char ngrams, because of the way FastText works.

If you want to check if a word is an in-vocabulary term, use this instead:

__getitem__(key_or_keys)

Get vector representation of key_or_keys.

Parameters

key_or_keys ({str, list of str, int, list of int}) – Requested key or list-of-keys.

Returns

Vector representation for key_or_keys (1D if key_or_keys is single key, otherwise - 2D).

Return type

numpy.ndarray

add_lifecycle_event(event_name, log_level=20, **event)

Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.

Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.

The lifecycle_events attribute is persisted across object’s save() and load() operations. It has no impact on the use of the model, but is useful during debugging and support.

Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.

Parameters
  • event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.

  • event (dict) –

    Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.

    This method will automatically add the following key-values to event, so you don’t have to specify them:

    • datetime: the current date & time

    • gensim: the current Gensim version

    • python: the current Python version

    • platform: the current platform

    • event: the name of this event

  • log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.

add_vector(key, vector)

Add one new vector at the given key, into existing slot if available.

Warning: using this repeatedly is inefficient, requiring a full reallocation & copy, if this instance hasn’t been preallocated to be ready for such incremental additions.

Parameters
  • key (str) – Key identifier of the added vector.

  • vector (numpy.ndarray) – 1D numpy array with the vector values.

Returns

Index of the newly added vector, so that self.vectors[result] == vector and self.index_to_key[result] == key.

Return type

int

add_vectors(keys, weights, extras=None, replace=False)

Append keys and their vectors in a manual way. If some key is already in the vocabulary, the old vector is kept unless replace flag is True.

Parameters
  • keys (list of (str or int)) – Keys specified by string or int ids.

  • weights (list of numpy.ndarray or numpy.ndarray) – List of 1D np.array vectors or a 2D np.array of vectors.

  • replace (bool, optional) – Flag indicating whether to replace vectors for keys which already exist in the map; if True - replace vectors, otherwise - keep old vectors.

adjust_vectors()

Adjust the vectors for words in the vocabulary.

The adjustment composes the trained full-word-token vectors with the vectors of the subword ngrams, matching the Facebook reference implementation behavior.

allocate_vecattrs(attrs=None, types=None)

Ensure arrays for given per-vector extra-attribute names & types exist, at right size.

The length of the index_to_key list is canonical ‘intended size’ of KeyedVectors, even if other properties (vectors array) hasn’t yet been allocated or expanded. So this allocation targets that size.

closer_than(key1, key2)

Get all keys that are closer to key1 than key2 is to key1.

static cosine_similarities(vector_1, vectors_all)

Compute cosine similarities between one vector and a set of other vectors.

Parameters
  • vector_1 (numpy.ndarray) – Vector from which similarities are to be computed, expected shape (dim,).

  • vectors_all (numpy.ndarray) – For each row in vectors_all, distance from vector_1 is computed, expected shape (num_vectors, dim).

Returns

Contains cosine distance between vector_1 and each row in vectors_all, shape (num_vectors,).

Return type

numpy.ndarray

distance(w1, w2)

Compute cosine distance between two keys. Calculate 1 - similarity().

Parameters
  • w1 (str) – Input key.

  • w2 (str) – Input key.

Returns

Distance between w1 and w2.

Return type

float

distances(word_or_vector, other_words=())

Compute cosine distances from given word or vector to all words in other_words. If other_words is empty, return distance between word_or_vector and all words in vocab.

Parameters
  • word_or_vector ({str, numpy.ndarray}) – Word or vector from which distances are to be computed.

  • other_words (iterable of str) – For each word in other_words distance from word_or_vector is computed. If None or empty, distance of word_or_vector from all words in vocab is computed (including itself).

Returns

Array containing distances to all words in other_words from input word_or_vector.

Return type

numpy.array

Raises

KeyError – If either word_or_vector or any word in other_words is absent from vocab.

doesnt_match(words)

Which key from the given list doesn’t go with the others?

Parameters

words (list of str) – List of keys.

Returns

The key further away from the mean of all keys.

Return type

str

evaluate_word_analogies(analogies, restrict_vocab=300000, case_insensitive=True, dummy4unknown=False, similarity_function='most_similar')

Compute performance of the model on an analogy test set.

The accuracy is reported (printed to log and returned as a score) for each section separately, plus there’s one aggregate summary at the end.

This method corresponds to the compute-accuracy script of the original C word2vec. See also Analogy (State of the art).

Parameters
  • analogies (str) – Path to file, where lines are 4-tuples of words, split into sections by “: SECTION NAME” lines. See gensim/test/test_data/questions-words.txt as example.

  • restrict_vocab (int, optional) – Ignore all 4-tuples containing a word not in the first restrict_vocab words. This may be meaningful if you’ve sorted the model vocabulary by descending frequency (which is standard in modern word embedding models).

  • case_insensitive (bool, optional) – If True - convert all words to their uppercase form before evaluating the performance. Useful to handle case-mismatch between training tokens and words in the test set. In case of multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.

  • dummy4unknown (bool, optional) – If True - produce zero accuracies for 4-tuples with out-of-vocabulary words. Otherwise, these tuples are skipped entirely and not used in the evaluation.

  • similarity_function (str, optional) – Function name used for similarity calculation.

Returns

  • score (float) – The overall evaluation score on the entire evaluation set

  • sections (list of dict of {str : str or list of tuple of (str, str, str, str)}) – Results broken down by each section of the evaluation set. Each dict contains the name of the section under the key ‘section’, and lists of correctly and incorrectly predicted 4-tuples of words under the keys ‘correct’ and ‘incorrect’.

evaluate_word_pairs(pairs, delimiter='\t', encoding='utf8', restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)

Compute correlation of the model with human similarity judgments.

Notes

More datasets can be found at * http://technion.ac.il/~ira.leviant/MultilingualVSMdata.html * https://www.cl.cam.ac.uk/~fh295/simlex.html.

Parameters
  • pairs (str) – Path to file, where lines are 3-tuples, each consisting of a word pair and a similarity value. See test/test_data/wordsim353.tsv as example.

  • delimiter (str, optional) – Separator in pairs file.

  • restrict_vocab (int, optional) – Ignore all 4-tuples containing a word not in the first restrict_vocab words. This may be meaningful if you’ve sorted the model vocabulary by descending frequency (which is standard in modern word embedding models).

  • case_insensitive (bool, optional) – If True - convert all words to their uppercase form before evaluating the performance. Useful to handle case-mismatch between training tokens and words in the test set. In case of multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.

  • dummy4unknown (bool, optional) – If True - produce zero accuracies for 4-tuples with out-of-vocabulary words. Otherwise, these tuples are skipped entirely and not used in the evaluation.

Returns

  • pearson (tuple of (float, float)) – Pearson correlation coefficient with 2-tailed p-value.

  • spearman (tuple of (float, float)) – Spearman rank-order correlation coefficient between the similarities from the dataset and the similarities produced by the model itself, with 2-tailed p-value.

  • oov_ratio (float) – The ratio of pairs with unknown words.

fill_norms(force=False)

Ensure per-vector norms are available.

Any code which modifies vectors should ensure the accompanying norms are either recalculated or ‘None’, to trigger a full recalculation later on-request.

get_index(key, default=None)

Return the integer index (slot/position) where the given key’s vector is stored in the backing vectors array.

get_mean_vector(keys, weights=None, pre_normalize=True, post_normalize=False, ignore_missing=True)

Get the mean vector for a given list of keys.

Parameters
  • keys (list of (str or int or ndarray)) – Keys specified by string or int ids or numpy array.

  • weights (list of float or numpy.ndarray, optional) – 1D array of same size of keys specifying the weight for each key.

  • pre_normalize (bool, optional) – Flag indicating whether to normalize each keyvector before taking mean. If False, individual keyvector will not be normalized.

  • post_normalize (bool, optional) – Flag indicating whether to normalize the final mean vector. If True, normalized mean vector will be return.

  • ignore_missing (bool, optional) – If False, will raise error if a key doesn’t exist in vocabulary.

Returns

Mean vector for the list of keys.

Return type

numpy.ndarray

Raises
  • ValueError – If the size of the list of keys and weights doesn’t match.

  • KeyError – If any of the key doesn’t exist in vocabulary and ignore_missing is false.

get_normed_vectors()

Get all embedding vectors normalized to unit L2 length (euclidean), as a 2D numpy array.

To see which key corresponds to which vector = which array row, refer to the index_to_key attribute.

Returns

2D numpy array of shape (number_of_keys, embedding dimensionality), L2-normalized along the rows (key vectors).

Return type

numpy.ndarray

get_sentence_vector(sentence)

Get a single 1-D vector representation for a given sentence. This function is workalike of the official fasttext’s get_sentence_vector().

Parameters

sentence (list of (str or int)) – list of words specified by string or int ids.

Returns

1-D numpy array representation of the sentence.

Return type

numpy.ndarray

get_vecattr(key, attr)

Get attribute value associated with given key.

Parameters
  • key (str) – Vector key for which to fetch the attribute value.

  • attr (str) – Name of the additional attribute to fetch for the given key.

Returns

Value of the additional attribute fetched for the given key.

Return type

object

get_vector(word, norm=False)

Get word representations in vector space, as a 1D numpy array.

Parameters
  • word (str) – Input word.

  • norm (bool, optional) – If True, resulting vector will be L2-normalized (unit Euclidean length).

Returns

Vector representation of word.

Return type

numpy.ndarray

Raises

KeyError – If word and all its ngrams not in vocabulary.

has_index_for(key)

Can this model return a single index for this key?

Subclasses that synthesize vectors for out-of-vocabulary words (like FastText) may respond True for a simple word in wv (__contains__()) check but False for this more-specific check.

property index2entity
property index2word
init_post_load(fb_vectors)

Perform initialization after loading a native Facebook model.

Expects that the vocabulary (self.key_to_index) has already been initialized.

Parameters

fb_vectors (np.array) – A matrix containing vectors for all the entities, including words and ngrams. This comes directly from the binary model. The order of the vectors must correspond to the indices in the vocabulary.

init_sims(replace=False)

Precompute data helpful for bulk similarity calculations.

fill_norms() now preferred for this purpose.

Parameters

replace (bool, optional) – If True - forget the original vectors and only keep the normalized ones.

Warning

You cannot sensibly continue training after doing a replace on a model’s internal KeyedVectors, and a replace is no longer necessary to save RAM. Do not use this method.

intersect_word2vec_format(fname, lockf=0.0, binary=False, encoding='utf8', unicode_errors='strict')

Merge in an input-hidden weight matrix loaded from the original C word2vec-tool format, where it intersects with the current vocabulary.

No words are added to the existing vocabulary, but intersecting words adopt the file’s weights, and non-intersecting words are left alone.

Parameters
  • fname (str) – The file path to load the vectors from.

  • lockf (float, optional) – Lock-factor value to be set for any imported word-vectors; the default value of 0.0 prevents further updating of the vector during subsequent training. Use 1.0 to allow further training updates of merged vectors.

  • binary (bool, optional) – If True, fname is in the binary word2vec C format.

  • encoding (str, optional) – Encoding of text for unicode function (python2 only).

  • unicode_errors (str, optional) – Error handling behaviour, used as parameter for unicode function (python2 only).

classmethod load(fname_or_handle, **kwargs)

Load a previously saved FastTextKeyedVectors model.

Parameters

fname (str) – Path to the saved file.

Returns

Loaded model.

Return type

FastTextKeyedVectors

See also

save()

Save FastTextKeyedVectors model.

classmethod load_word2vec_format(fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict', limit=None, datatype=<class 'numpy.float32'>, no_header=False)

Load KeyedVectors from a file produced by the original C word2vec-tool format.

Warning

The information stored in the file is incomplete (the binary tree is missing), so while you can query for word similarity etc., you cannot continue training with a model loaded this way.

Parameters
  • fname (str) – The file path to the saved word2vec-format file.

  • fvocab (str, optional) – File path to the vocabulary.Word counts are read from fvocab filename, if set (this is the file generated by -save-vocab flag of the original C tool).

  • binary (bool, optional) – If True, indicates whether the data is in binary word2vec format.

  • encoding (str, optional) – If you trained the C model using non-utf8 encoding for words, specify that encoding in encoding.

  • unicode_errors (str, optional) – default ‘strict’, is a string suitable to be passed as the errors argument to the unicode() (Python 2.x) or str() (Python 3.x) function. If your source file may include word tokens truncated in the middle of a multibyte unicode character (as is common from the original word2vec.c tool), ‘ignore’ or ‘replace’ may help.

  • limit (int, optional) – Sets a maximum number of word-vectors to read from the file. The default, None, means read all.

  • datatype (type, optional) – (Experimental) Can coerce dimensions to a non-default float type (such as np.float16) to save memory. Such types may result in much slower bulk operations or incompatibility with optimized routines.)

  • no_header (bool, optional) – Default False means a usual word2vec-format file, with a 1st line declaring the count of following vectors & number of dimensions. If True, the file is assumed to lack a declaratory (vocab_size, vector_size) header and instead start with the 1st vector, and an extra reading-pass will be used to discover the number of vectors. Works only with binary=False.

Returns

Loaded model.

Return type

KeyedVectors

static log_accuracy(section)
static log_evaluate_word_pairs(pearson, spearman, oov, pairs)
most_similar(positive=None, negative=None, topn=10, clip_start=0, clip_end=None, restrict_vocab=None, indexer=None)

Find the top-N most similar keys. Positive keys contribute positively towards the similarity, negative keys negatively.

This method computes cosine similarity between a simple mean of the projection weight vectors of the given keys and the vectors for each key in the model. The method corresponds to the word-analogy and distance scripts in the original word2vec implementation.

Parameters
  • positive (list of (str or int or ndarray) or list of ((str,float) or (int,float) or (ndarray,float)), optional) – List of keys that contribute positively. If tuple, second element specifies the weight (default 1.0)

  • negative (list of (str or int or ndarray) or list of ((str,float) or (int,float) or (ndarray,float)), optional) – List of keys that contribute negatively. If tuple, second element specifies the weight (default -1.0)

  • topn (int or None, optional) – Number of top-N similar keys to return, when topn is int. When topn is None, then similarities for all keys are returned.

  • clip_start (int) – Start clipping index.

  • clip_end (int) – End clipping index.

  • restrict_vocab (int, optional) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 key vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.) If specified, overrides any values of clip_start or clip_end.

Returns

When topn is int, a sequence of (key, similarity) is returned. When topn is None, then similarities for all keys are returned as a one-dimensional numpy array with the size of the vocabulary.

Return type

list of (str, float) or numpy.array

most_similar_cosmul(positive=None, negative=None, topn=10, restrict_vocab=None)

Find the top-N most similar words, using the multiplicative combination objective, proposed by Omer Levy and Yoav Goldberg “Linguistic Regularities in Sparse and Explicit Word Representations”. Positive words still contribute positively towards the similarity, negative words negatively, but with less susceptibility to one large distance dominating the calculation. In the common analogy-solving case, of two positive and one negative examples, this method is equivalent to the “3CosMul” objective (equation (4)) of Levy and Goldberg.

Additional positive or negative examples contribute to the numerator or denominator, respectively - a potentially sensible but untested extension of the method. With a single positive example, rankings will be the same as in the default most_similar().

Allows calls like most_similar_cosmul(‘dog’, ‘cat’), as a shorthand for most_similar_cosmul([‘dog’], [‘cat’]) where ‘dog’ is positive and ‘cat’ negative

Parameters
  • positive (list of str, optional) – List of words that contribute positively.

  • negative (list of str, optional) – List of words that contribute negatively.

  • topn (int or None, optional) – Number of top-N similar words to return, when topn is int. When topn is None, then similarities for all words are returned.

  • restrict_vocab (int or None, optional) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 node vectors in the vocabulary order. This may be meaningful if vocabulary is sorted by descending frequency.

Returns

When topn is int, a sequence of (word, similarity) is returned. When topn is None, then similarities for all words are returned as a one-dimensional numpy array with the size of the vocabulary.

Return type

list of (str, float) or numpy.array

most_similar_to_given(key1, keys_list)

Get the key from keys_list most similar to key1.

n_similarity(ws1, ws2)

Compute cosine similarity between two sets of keys.

Parameters
  • ws1 (list of str) – Sequence of keys.

  • ws2 (list of str) – Sequence of keys.

Returns

Similarities between ws1 and ws2.

Return type

numpy.ndarray

rank(key1, key2)

Rank of the distance of key2 from key1, in relation to distances of all keys from key1.

rank_by_centrality(words, use_norm=True)

Rank the given words by similarity to the centroid of all the words.

Parameters
  • words (list of str) – List of keys.

  • use_norm (bool, optional) – Whether to calculate centroid using unit-normed vectors; default True.

Returns

Ranked list of (similarity, key), most-similar to the centroid first.

Return type

list of (float, str)

recalc_char_ngram_buckets()

Scan the vocabulary, calculate ngrams and their hashes, and cache the list of ngrams for each known word.

relative_cosine_similarity(wa, wb, topn=10)

Compute the relative cosine similarity between two words given top-n similar words, by Artuur Leeuwenberga, Mihaela Velab , Jon Dehdaribc, Josef van Genabithbc “A Minimally Supervised Approach for Synonym Extraction with Word Embeddings”.

To calculate relative cosine similarity between two words, equation (1) of the paper is used. For WordNet synonyms, if rcs(topn=10) is greater than 0.10 then wa and wb are more similar than any arbitrary word pairs.

Parameters
  • wa (str) – Word for which we have to look top-n similar word.

  • wb (str) – Word for which we evaluating relative cosine similarity with wa.

  • topn (int, optional) – Number of top-n similar words to look with respect to wa.

Returns

Relative cosine similarity between wa and wb.

Return type

numpy.float64

resize_vectors(seed=0)

Make underlying vectors match ‘index_to_key’ size; random-initialize any new rows.

save(*args, **kwargs)

Save object.

Parameters

fname (str) – Path to the output file.

See also

load()

Load object.

save_word2vec_format(fname, fvocab=None, binary=False, total_vec=None, write_header=True, prefix='', append=False, sort_attr='count')

Store the input-hidden weight matrix in the same format used by the original C word2vec-tool, for compatibility.

Parameters
  • fname (str) – File path to save the vectors to.

  • fvocab (str, optional) – File path to save additional vocabulary information to. None to not store the vocabulary.

  • binary (bool, optional) – If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.

  • total_vec (int, optional) – Explicitly specify total number of vectors (in case word vectors are appended with document vectors afterwards).

  • write_header (bool, optional) – If False, don’t write the 1st line declaring the count of vectors and dimensions. This is the format used by e.g. gloVe vectors.

  • prefix (str, optional) – String to prepend in front of each stored word. Default = no prefix.

  • append (bool, optional) – If set, open fname in ab mode instead of the default wb mode.

  • sort_attr (str, optional) – Sort the output vectors in descending order of this attribute. Default: most frequent keys first.

set_vecattr(key, attr, val)

Set attribute associated with the given key to value.

Parameters
  • key (str) – Store the attribute for this vector key.

  • attr (str) – Name of the additional attribute to store for the given key.

  • val (object) – Value of the additional attribute to store for the given key.

Returns

Return type

None

similar_by_key(key, topn=10, restrict_vocab=None)

Find the top-N most similar keys.

Parameters
  • key (str) – Key

  • topn (int or None, optional) – Number of top-N similar keys to return. If topn is None, similar_by_key returns the vector of similarity scores.

  • restrict_vocab (int, optional) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 key vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)

Returns

When topn is int, a sequence of (key, similarity) is returned. When topn is None, then similarities for all keys are returned as a one-dimensional numpy array with the size of the vocabulary.

Return type

list of (str, float) or numpy.array

similar_by_vector(vector, topn=10, restrict_vocab=None)

Find the top-N most similar keys by vector.

Parameters
  • vector (numpy.array) – Vector from which similarities are to be computed.

  • topn (int or None, optional) – Number of top-N similar keys to return, when topn is int. When topn is None, then similarities for all keys are returned.

  • restrict_vocab (int, optional) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 key vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)

Returns

When topn is int, a sequence of (key, similarity) is returned. When topn is None, then similarities for all keys are returned as a one-dimensional numpy array with the size of the vocabulary.

Return type

list of (str, float) or numpy.array

similar_by_word(word, topn=10, restrict_vocab=None)

Compatibility alias for similar_by_key().

similarity(w1, w2)

Compute cosine similarity between two keys.

Parameters
  • w1 (str) – Input key.

  • w2 (str) – Input key.

Returns

Cosine similarity between w1 and w2.

Return type

float

similarity_unseen_docs(*args, **kwargs)
sort_by_descending_frequency()

Sort the vocabulary so the most frequent words have the lowest indexes.

unit_normalize_all()

Destructively scale all vectors to unit-length.

You cannot sensibly continue training after such a step.

vectors_for_all(keys: Iterable, allow_inference: bool = True, copy_vecattrs: bool = False) KeyedVectors

Produce vectors for all given keys as a new KeyedVectors object.

Notes

The keys will always be deduplicated. For optimal performance, you should not pass entire corpora to the method. Instead, you should construct a dictionary of unique words in your corpus:

>>> from collections import Counter
>>> import itertools
>>>
>>> from gensim.models import FastText
>>> from gensim.test.utils import datapath, common_texts
>>>
>>> model_corpus_file = datapath('lee_background.cor')  # train word vectors on some corpus
>>> model = FastText(corpus_file=model_corpus_file, vector_size=20, min_count=1)
>>> corpus = common_texts  # infer word vectors for words from another corpus
>>> word_counts = Counter(itertools.chain.from_iterable(corpus))  # count words in your corpus
>>> words_by_freq = (k for k, v in word_counts.most_common())
>>> word_vectors = model.wv.vectors_for_all(words_by_freq)  # create word-vectors for words in your corpus
Parameters
  • keys (iterable) – The keys that will be vectorized.

  • allow_inference (bool, optional) – In subclasses such as FastTextKeyedVectors, vectors for out-of-vocabulary keys (words) may be inferred. Default is True.

  • copy_vecattrs (bool, optional) – Additional attributes set via the KeyedVectors.set_vecattr() method will be preserved in the produced KeyedVectors object. Default is False. To ensure that all the produced vectors will have vector attributes assigned, you should set allow_inference=False.

Returns

keyedvectors – Vectors for all the given keys.

Return type

KeyedVectors

property vectors_norm
property vocab
wmdistance(document1, document2, norm=True)

Compute the Word Mover’s Distance between two documents.

When using this code, please consider citing the following papers:

Parameters
  • document1 (list of str) – Input document.

  • document2 (list of str) – Input document.

  • norm (boolean) – Normalize all word vectors to unit length before computing the distance? Defaults to True.

Returns

Word Mover’s distance between document1 and document2.

Return type

float

Warning

This method only works if POT is installed.

If one of the documents have no words that exist in the vocab, float(‘inf’) (i.e. infinity) will be returned.

Raises

ImportError

If POT isn’t installed.

word_vec(*args, **kwargs)

Compatibility alias for get_vector(); must exist so subclass calls reach subclass get_vector().

words_closer_than(word1, word2)
class gensim.models.fasttext.FastTextTrainables

Bases: SaveLoad

Obsolete class retained for backward-compatible load()s

add_lifecycle_event(event_name, log_level=20, **event)

Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.

Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.

The lifecycle_events attribute is persisted across object’s save() and load() operations. It has no impact on the use of the model, but is useful during debugging and support.

Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.

Parameters
  • event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.

  • event (dict) –

    Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.

    This method will automatically add the following key-values to event, so you don’t have to specify them:

    • datetime: the current date & time

    • gensim: the current Gensim version

    • python: the current Python version

    • platform: the current platform

    • event: the name of this event

  • log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.

class gensim.models.fasttext.FastTextVocab

Bases: SaveLoad

This is a redundant class. It exists only to maintain backwards compatibility with older gensim versions.

add_lifecycle_event(event_name, log_level=20, **event)

Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.

Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.

The lifecycle_events attribute is persisted across object’s save() and load() operations. It has no impact on the use of the model, but is useful during debugging and support.

Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.

Parameters
  • event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.

  • event (dict) –

    Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.

    This method will automatically add the following key-values to event, so you don’t have to specify them:

    • datetime: the current date & time

    • gensim: the current Gensim version

    • python: the current Python version

    • platform: the current platform

    • event: the name of this event

  • log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.

gensim.models.fasttext.ft_ngram_hashes(word, minn, maxn, num_buckets)

Calculate the ngrams of the word and hash them.

Parameters
  • word (str) – The word to calculate ngram hashes for.

  • minn (int) – Minimum ngram length

  • maxn (int) – Maximum ngram length

  • num_buckets (int) – The number of buckets

Returns

Return type

A list of hashes (integers), one per each detected ngram.

gensim.models.fasttext.load_facebook_model(path, encoding='utf-8')

Load the model from Facebook’s native fasttext .bin output file.

Notes

Facebook provides both .vec and .bin files with their modules. The former contains human-readable vectors. The latter contains machine-readable vectors along with other model parameters. This function requires you to provide the full path to the .bin file. It effectively ignores the .vec output file, since it is redundant.

This function uses the smart_open library to open the path. The path may be on a remote host (e.g. HTTP, S3, etc). It may also be gzip or bz2 compressed (i.e. end in .bin.gz or .bin.bz2). For details, see https://github.com/RaRe-Technologies/smart_open.

Parameters
  • model_file (str) – Path to the FastText output files. FastText outputs two model files - /path/to/model.vec and /path/to/model.bin Expected value for this example: /path/to/model or /path/to/model.bin, as Gensim requires only .bin file to the load entire fastText model.

  • encoding (str, optional) – Specifies the file encoding.

Examples

Load, infer, continue training:

>>> from gensim.test.utils import datapath
>>>
>>> cap_path = datapath("crime-and-punishment.bin")
>>> fb_model = load_facebook_model(cap_path)
>>>
>>> 'landlord' in fb_model.wv.key_to_index  # Word is out of vocabulary
False
>>> oov_term = fb_model.wv['landlord']
>>>
>>> 'landlady' in fb_model.wv.key_to_index  # Word is in the vocabulary
True
>>> iv_term = fb_model.wv['landlady']
>>>
>>> new_sent = [['lord', 'of', 'the', 'rings'], ['lord', 'of', 'the', 'flies']]
>>> fb_model.build_vocab(new_sent, update=True)
>>> fb_model.train(sentences=new_sent, total_examples=len(new_sent), epochs=5)
Returns

The loaded model.

Return type

gensim.models.fasttext.FastText

See also

load_facebook_vectors() loads the word embeddings only. Its faster, but does not enable you to continue training.

gensim.models.fasttext.load_facebook_vectors(path, encoding='utf-8')

Load word embeddings from a model saved in Facebook’s native fasttext .bin format.

Notes

Facebook provides both .vec and .bin files with their modules. The former contains human-readable vectors. The latter contains machine-readable vectors along with other model parameters. This function requires you to provide the full path to the .bin file. It effectively ignores the .vec output file, since it is redundant.

This function uses the smart_open library to open the path. The path may be on a remote host (e.g. HTTP, S3, etc). It may also be gzip or bz2 compressed. For details, see https://github.com/RaRe-Technologies/smart_open.

Parameters
  • path (str) – The location of the model file.

  • encoding (str, optional) – Specifies the file encoding.

Returns

The word embeddings.

Return type

gensim.models.fasttext.FastTextKeyedVectors

Examples

Load and infer:

>>> from gensim.test.utils import datapath
>>>
>>> cap_path = datapath("crime-and-punishment.bin")
>>> fbkv = load_facebook_vectors(cap_path)
>>>
>>> 'landlord' in fbkv.key_to_index  # Word is out of vocabulary
False
>>> oov_vector = fbkv['landlord']
>>>
>>> 'landlady' in fbkv.key_to_index  # Word is in the vocabulary
True
>>> iv_vector = fbkv['landlady']

See also

load_facebook_model() loads the full model, not just word embeddings, and enables you to continue model training.

gensim.models.fasttext.save_facebook_model(model, path, encoding='utf-8', lr_update_rate=100, word_ngrams=1)

Saves word embeddings to the Facebook’s native fasttext .bin format.

Notes

Facebook provides both .vec and .bin files with their modules. The former contains human-readable vectors. The latter contains machine-readable vectors along with other model parameters. This function saves only the .bin file.

Parameters
  • model (gensim.models.fasttext.FastText) – FastText model to be saved.

  • path (str) – Output path and filename (including .bin extension)

  • encoding (str, optional) – Specifies the file encoding. Defaults to utf-8.

  • lr_update_rate (int) – This parameter is used by Facebook fasttext tool, unused by Gensim. It defaults to Facebook fasttext default value 100. In very rare circumstances you might wish to fiddle with it.

  • word_ngrams (int) – This parameter is used by Facebook fasttext tool, unused by Gensim. It defaults to Facebook fasttext default value 1. In very rare circumstances you might wish to fiddle with it.

Returns

Return type

None