gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

Corporate trainings in Data Science, NLP and Deep Learning

models.word2vec – Word2vec embeddings

models.word2vec – Word2vec embeddings

This module implements the word2vec family of algorithms, using highly optimized C routines, data streaming and Pythonic interfaces.

The word2vec algorithms include skip-gram and CBOW models, using either hierarchical softmax or negative sampling: Tomas Mikolov et al: Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov et al: Distributed Representations of Words and Phrases and their Compositionality.

Other embeddings

There are more ways to train word vectors in Gensim than just Word2Vec. See also Doc2Vec, FastText and wrappers for VarEmbed and WordRank.

The training algorithms were originally ported from the C package https://code.google.com/p/word2vec/ and extended with additional functionality and optimizations over the years.

For a tutorial on Gensim word2vec, with an interactive web app trained on GoogleNews, visit https://rare-technologies.com/word2vec-tutorial/.

Make sure you have a C compiler before installing Gensim, to use the optimized word2vec routines (70x speedup compared to plain NumPy implementation, https://rare-technologies.com/parallelizing-word2vec-in-python/).

Usage examples

Initialize a model with e.g.:

>>> from gensim.test.utils import common_texts, get_tmpfile
>>> from gensim.models import Word2Vec
>>>
>>> path = get_tmpfile("word2vec.model")
>>>
>>> model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)
>>> model.save("word2vec.model")

The training is streamed, meaning sentences can be a generator, reading input data from disk on-the-fly, without loading the entire corpus into RAM.

It also means you can continue training the model later:

>>> model = Word2Vec.load("word2vec.model")
>>> model.train([["hello", "world"]], total_examples=1, epochs=1)
(0, 2)

The trained word vectors are stored in a KeyedVectors instance in model.wv:

>>> vector = model.wv['computer']  # numpy vector of a word

The reason for separating the trained vectors into KeyedVectors is that if you don’t need the full model state any more (don’t need to continue training), the state can discarded, resulting in a much smaller and faster object that can be mmapped for lightning fast loading and sharing the vectors in RAM between processes:

>>> from gensim.models import KeyedVectors
>>>
>>> path = get_tmpfile("wordvectors.kv")
>>>
>>> model.wv.save(path)
>>> wv = KeyedVectors.load("model.wv", mmap='r')
>>> vector = wv['computer']  # numpy vector of a word

Gensim can also load word vectors in the “word2vec C format”, as a KeyedVectors instance:

>>> from gensim.test.utils import datapath
>>>
>>> wv_from_text = KeyedVectors.load_word2vec_format(datapath('word2vec_pre_kv_c'), binary=False)  # C text format
>>> wv_from_bin = KeyedVectors.load_word2vec_format(datapath("euclidean_vectors.bin"), binary=True)  # C bin format

It is impossible to continue training the vectors loaded from the C format because the hidden weights, vocabulary frequencies and the binary tree are missing. To continue training, you’ll need the full Word2Vec object state, as stored by save(), not just the KeyedVectors.

You can perform various NLP word tasks with a trained model. Some of them are already built-in - you can see it in gensim.models.keyedvectors.

If you’re finished training a model (i.e. no more updates, only querying), you can switch to the KeyedVectors instance:

>>> word_vectors = model.wv
>>> del model

to trim unneeded model state = use much less RAM and allow fast loading and memory sharing (mmap).

Note that there is a gensim.models.phrases module which lets you automatically detect phrases longer than one word. Using phrases, you can learn a word2vec model where “words” are actually multiword expressions, such as new_york_times or financial_crisis:

>>> from gensim.test.utils import common_texts
>>> from gensim.models import Phrases
>>>
>>> bigram_transformer = Phrases(common_texts)
>>> model = Word2Vec(bigram_transformer[common_texts], min_count=1)
class gensim.models.word2vec.BrownCorpus(dirname)

Bases: object

Iterate over sentences from the Brown corpus (part of NLTK data).

class gensim.models.word2vec.LineSentence(source, max_sentence_length=10000, limit=None)

Bases: object

Iterate over a file that contains sentences: one line = one sentence. Words must be already preprocessed and separated by whitespace.

Parameters
  • source (string or a file-like object) – Path to the file on disk, or an already-open file object (must support seek(0)).

  • limit (int or None) – Clip the file to the first limit lines. Do no clipping if limit is None (the default).

Examples

>>> from gensim.test.utils import datapath
>>> sentences = LineSentence(datapath('lee_background.cor'))
>>> for sentence in sentences:
...     pass
class gensim.models.word2vec.PathLineSentences(source, max_sentence_length=10000, limit=None)

Bases: object

Like LineSentence, but process all files in a directory in alphabetical order by filename.

The directory must only contain files that can be read by gensim.models.word2vec.LineSentence: .bz2, .gz, and text files. Any file not ending with .bz2 or .gz is assumed to be a text file.

The format of files (either text, or compressed text files) in the path is one sentence = one line, with words already preprocessed and separated by whitespace.

Warning

Does not recurse into subdirectories.

Parameters
  • source (str) – Path to the directory.

  • limit (int or None) – Read only the first limit lines from each file. Read all if limit is None (the default).

class gensim.models.word2vec.Text8Corpus(fname, max_sentence_length=10000)

Bases: object

Iterate over sentences from the “text8” corpus, unzipped from http://mattmahoney.net/dc/text8.zip.

class gensim.models.word2vec.Word2Vec(sentences=None, corpus_file=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=<built-in function hash>, iter=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000, compute_loss=False, callbacks=(), max_final_vocab=None)

Bases: gensim.models.base_any2vec.BaseWordEmbeddingsModel

Train, use and evaluate neural networks described in https://code.google.com/p/word2vec/.

Once you’re finished training a model (=no more updates, only querying) store and use only the KeyedVectors instance in self.wv to reduce memory.

The model can be stored/loaded via its save() and load() methods.

The trained word vectors can also be stored/loaded from a format compatible with the original word2vec implementation via self.wv.save_word2vec_format and gensim.models.keyedvectors.KeyedVectors.load_word2vec_format().

Some important attributes are the following:

wv

This object essentially contains the mapping between words and embeddings. After training, it can be used directly to query those embeddings in various ways. See the module level docstring for examples.

Type

Word2VecKeyedVectors

vocabulary

This object represents the vocabulary (sometimes called Dictionary in gensim) of the model. Besides keeping track of all unique words, this object provides extra functionality, such as constructing a huffman tree (frequent words are closer to the root), or discarding extremely rare words.

Type

Word2VecVocab

trainables

This object represents the inner shallow neural network used to train the embeddings. The semantics of the network differ slightly in the two available training modes (CBOW or SG) but you can think of it as a NN with a single projection and hidden layer which we train on the corpus. The weights are then used as our embeddings (which means that the size of the hidden layer is equal to the number of features self.size).

Type

Word2VecTrainables

Parameters
  • sentences (iterable of iterables, optional) – The sentences iterable can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, Text8Corpus or LineSentence in word2vec module for such examples. See also the tutorial on data streaming in Python. If you don’t supply sentences, the model is left uninitialized – use if you plan to initialize it in some other way.

  • corpus_file (str, optional) – Path to a corpus file in LineSentence format. You may use this argument instead of sentences to get performance boost. Only one of sentences or corpus_file arguments need to be passed (or none of them, in that case, the model is left uninitialized).

  • size (int, optional) – Dimensionality of the word vectors.

  • window (int, optional) – Maximum distance between the current and predicted word within a sentence.

  • min_count (int, optional) – Ignores all words with total frequency lower than this.

  • workers (int, optional) – Use these many worker threads to train the model (=faster training with multicore machines).

  • sg ({0, 1}, optional) – Training algorithm: 1 for skip-gram; otherwise CBOW.

  • hs ({0, 1}, optional) – If 1, hierarchical softmax will be used for model training. If 0, and negative is non-zero, negative sampling will be used.

  • negative (int, optional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.

  • ns_exponent (float, optional) – The exponent used to shape the negative sampling distribution. A value of 1.0 samples exactly in proportion to the frequencies, 0.0 samples all words equally, while a negative value samples low-frequency words more than high-frequency words. The popular default value of 0.75 was chosen by the original Word2Vec paper. More recently, in https://arxiv.org/abs/1804.04212, Caselles-Dupré, Lesaint, & Royo-Letelier suggest that other values may perform better for recommendation applications.

  • cbow_mean ({0, 1}, optional) – If 0, use the sum of the context word vectors. If 1, use the mean, only applies when cbow is used.

  • alpha (float, optional) – The initial learning rate.

  • min_alpha (float, optional) – Learning rate will linearly drop to min_alpha as training progresses.

  • seed (int, optional) – Seed for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread (workers=1), to eliminate ordering jitter from OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization).

  • max_vocab_size (int, optional) – Limits the RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit.

  • max_final_vocab (int, optional) – Limits the vocab to a target vocab size by automatically picking a matching min_count. If the specified min_count is more than the calculated min_count, the specified min_count will be used. Set to None if not required.

  • sample (float, optional) – The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).

  • hashfxn (function, optional) – Hash function to use to randomly initialize weights, for increased training reproducibility.

  • iter (int, optional) – Number of iterations (epochs) over the corpus.

  • trim_rule (function, optional) –

    Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to keep_vocab_item()), or a callable that accepts parameters (word, count, min_count) and returns either gensim.utils.RULE_DISCARD, gensim.utils.RULE_KEEP or gensim.utils.RULE_DEFAULT. The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as part of the model.

    The input parameters are of the following types:
    • word (str) - the word we are examining

    • count (int) - the word’s frequency count in the corpus

    • min_count (int) - the minimum count threshold.

  • sorted_vocab ({0, 1}, optional) – If 1, sort the vocabulary by descending frequency before assigning word indexes. See sort_vocab().

  • batch_words (int, optional) – Target size (in words) for batches of examples passed to worker threads (and thus cython routines).(Larger batches will be passed if individual texts are longer than 10000 words, but the standard cython code truncates to that maximum.)

  • compute_loss (bool, optional) – If True, computes and stores loss value which can be retrieved using get_latest_training_loss().

  • callbacks (iterable of CallbackAny2Vec, optional) – Sequence of callbacks to be executed at specific stages during training.

Examples

Initialize and train a Word2Vec model

>>> from gensim.models import Word2Vec
>>> sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
>>> model = Word2Vec(sentences, min_count=1)
accuracy(questions, restrict_vocab=30000, most_similar=None, case_insensitive=True)

Deprecated. Use self.wv.accuracy instead. See accuracy().

build_vocab(sentences=None, corpus_file=None, update=False, progress_per=10000, keep_raw_vocab=False, trim_rule=None, **kwargs)

Build vocabulary from a sequence of sentences (can be a once-only generator stream).

Parameters
  • sentences (iterable of list of str) – Can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, Text8Corpus or LineSentence module for such examples.

  • corpus_file (str, optional) – Path to a corpus file in LineSentence format. You may use this argument instead of sentences to get performance boost. Only one of sentences or corpus_file arguments need to be passed (not both of them).

  • update (bool) – If true, the new words in sentences will be added to model’s vocab.

  • progress_per (int, optional) – Indicates how many words to process before showing/updating the progress.

  • keep_raw_vocab (bool, optional) – If False, the raw vocabulary will be deleted after the scaling is done to free up RAM.

  • trim_rule (function, optional) –

    Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to keep_vocab_item()), or a callable that accepts parameters (word, count, min_count) and returns either gensim.utils.RULE_DISCARD, gensim.utils.RULE_KEEP or gensim.utils.RULE_DEFAULT. The rule, if given, is only used to prune vocabulary during current method call and is not stored as part of the model.

    The input parameters are of the following types:
    • word (str) - the word we are examining

    • count (int) - the word’s frequency count in the corpus

    • min_count (int) - the minimum count threshold.

  • **kwargs (object) – Key word arguments propagated to self.vocabulary.prepare_vocab

build_vocab_from_freq(word_freq, keep_raw_vocab=False, corpus_count=None, trim_rule=None, update=False)

Build vocabulary from a dictionary of word frequencies.

Parameters
  • word_freq (dict of (str, int)) – A mapping from a word in the vocabulary to its frequency count.

  • keep_raw_vocab (bool, optional) – If False, delete the raw vocabulary after the scaling is done to free up RAM.

  • corpus_count (int, optional) – Even if no corpus is provided, this argument can set corpus_count explicitly.

  • trim_rule (function, optional) –

    Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to keep_vocab_item()), or a callable that accepts parameters (word, count, min_count) and returns either gensim.utils.RULE_DISCARD, gensim.utils.RULE_KEEP or gensim.utils.RULE_DEFAULT. The rule, if given, is only used to prune vocabulary during current method call and is not stored as part of the model.

    The input parameters are of the following types:
    • word (str) - the word we are examining

    • count (int) - the word’s frequency count in the corpus

    • min_count (int) - the minimum count threshold.

  • update (bool, optional) – If true, the new provided words in word_freq dict will be added to model’s vocab.

clear_sims()

Remove all L2-normalized word vectors from the model, to free up memory.

You can recompute them later again using the init_sims() method.

property cum_table
delete_temporary_training_data(replace_word_vectors_with_normalized=False)

Discard parameters that are used in training and scoring, to save memory.

Warning

Use only if you’re sure you’re done training a model.

Parameters

replace_word_vectors_with_normalized (bool, optional) – If True, forget the original (not normalized) word vectors and only keep the L2-normalized word vectors, to save even more memory.

doesnt_match(words)

Deprecated, use self.wv.doesnt_match() instead.

Refer to the documentation for doesnt_match().

estimate_memory(vocab_size=None, report=None)

Estimate required memory for a model using current settings and provided vocabulary size.

Parameters
  • vocab_size (int, optional) – Number of unique tokens in the vocabulary

  • report (dict of (str, int), optional) – A dictionary from string representations of the model’s memory consuming members to their size in bytes.

Returns

A dictionary from string representations of the model’s memory consuming members to their size in bytes.

Return type

dict of (str, int)

evaluate_word_pairs(pairs, delimiter='\t', restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)

Deprecated, use self.wv.evaluate_word_pairs() instead.

Refer to the documentation for evaluate_word_pairs().

get_latest_training_loss()

Get current value of the training loss.

Returns

Current training loss.

Return type

float

property hashfxn
init_sims(replace=False)

Deprecated. Use self.wv.init_sims instead. See init_sims().

intersect_word2vec_format(fname, lockf=0.0, binary=False, encoding='utf8', unicode_errors='strict')

Merge in an input-hidden weight matrix loaded from the original C word2vec-tool format, where it intersects with the current vocabulary.

No words are added to the existing vocabulary, but intersecting words adopt the file’s weights, and non-intersecting words are left alone.

Parameters
  • fname (str) – The file path to load the vectors from.

  • lockf (float, optional) – Lock-factor value to be set for any imported word-vectors; the default value of 0.0 prevents further updating of the vector during subsequent training. Use 1.0 to allow further training updates of merged vectors.

  • binary (bool, optional) – If True, fname is in the binary word2vec C format.

  • encoding (str, optional) – Encoding of text for unicode function (python2 only).

  • unicode_errors (str, optional) – Error handling behaviour, used as parameter for unicode function (python2 only).

property iter
property layer1_size
classmethod load(*args, **kwargs)

Load a previously saved Word2Vec model.

See also

save()

Save model.

Parameters

fname (str) – Path to the saved file.

Returns

Loaded model.

Return type

Word2Vec

classmethod load_word2vec_format(fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict', limit=None, datatype=<class 'numpy.float32'>)

Deprecated. Use gensim.models.KeyedVectors.load_word2vec_format() instead.

static log_accuracy(section)

Deprecated. Use self.wv.log_accuracy instead. See log_accuracy().

property min_count
most_similar(positive=None, negative=None, topn=10, restrict_vocab=None, indexer=None)

Deprecated, use self.wv.most_similar() instead.

Refer to the documentation for most_similar().

most_similar_cosmul(positive=None, negative=None, topn=10)

Deprecated, use self.wv.most_similar_cosmul() instead.

Refer to the documentation for most_similar_cosmul().

n_similarity(ws1, ws2)

Deprecated, use self.wv.n_similarity() instead.

Refer to the documentation for n_similarity().

predict_output_word(context_words_list, topn=10)

Get the probability distribution of the center word given context words.

Parameters
  • context_words_list (list of str) – List of context words.

  • topn (int, optional) – Return topn words and their probabilities.

Returns

topn length list of tuples of (word, probability).

Return type

list of (str, float)

reset_from(other_model)

Borrow shareable pre-built structures from other_model and reset hidden layer weights.

Structures copied are:
  • Vocabulary

  • Index to word mapping

  • Cumulative frequency table (used for negative sampling)

  • Cached corpus length

Useful when testing multiple models on the same corpus in parallel.

Parameters

other_model (Word2Vec) – Another model to copy the internal structures from.

property sample
save(*args, **kwargs)

Save the model. This saved model can be loaded again using load(), which supports online training and getting vectors for vocabulary words.

Parameters

fname (str) – Path to the file.

save_word2vec_format(fname, fvocab=None, binary=False)

Deprecated. Use model.wv.save_word2vec_format instead. See gensim.models.KeyedVectors.save_word2vec_format().

score(sentences, total_sentences=1000000, chunksize=100, queue_factor=2, report_delay=1)

Score the log probability for a sequence of sentences. This does not change the fitted model in any way (see train() for that).

Gensim has currently only implemented score for the hierarchical softmax scheme, so you need to have run word2vec with hs=1 and negative=0 for this to work.

Note that you should specify total_sentences; you’ll run into problems if you ask to score more than this number of sentences but it is inefficient to set the value too high.

See the article by Matt Taddy: “Document Classification by Inversion of Distributed Language Representations” and the gensim demo for examples of how to use such scores in document classification.

Parameters
  • sentences (iterable of list of str) – The sentences iterable can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, Text8Corpus or LineSentence in word2vec module for such examples.

  • total_sentences (int, optional) – Count of sentences.

  • chunksize (int, optional) – Chunksize of jobs

  • queue_factor (int, optional) – Multiplier for size of queue (number of workers * queue_factor).

  • report_delay (float, optional) – Seconds to wait before reporting progress.

similar_by_vector(vector, topn=10, restrict_vocab=None)

Deprecated, use self.wv.similar_by_vector() instead.

Refer to the documentation for similar_by_vector().

similar_by_word(word, topn=10, restrict_vocab=None)

Deprecated, use self.wv.similar_by_word() instead.

Refer to the documentation for similar_by_word().

similarity(w1, w2)

Deprecated, use self.wv.similarity() instead.

Refer to the documentation for similarity().

property syn0_lockf
property syn1
property syn1neg
train(sentences=None, corpus_file=None, total_examples=None, total_words=None, epochs=None, start_alpha=None, end_alpha=None, word_count=0, queue_factor=2, report_delay=1.0, compute_loss=False, callbacks=())

Update the model’s neural weights from a sequence of sentences.

Notes

To support linear learning-rate decay from (initial) alpha to min_alpha, and accurate progress-percentage logging, either total_examples (count of sentences) or total_words (count of raw words in sentences) MUST be provided. If sentences is the same corpus that was provided to build_vocab() earlier, you can simply use total_examples=self.corpus_count.

Warning

To avoid common mistakes around the model’s ability to do multiple training passes itself, an explicit epochs argument MUST be provided. In the common and recommended case where train() is only called once, you can set epochs=self.iter.

Parameters
  • sentences (iterable of list of str) –

    The sentences iterable can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, Text8Corpus or LineSentence in word2vec module for such examples. See also the tutorial on data streaming in Python.

  • corpus_file (str, optional) – Path to a corpus file in LineSentence format. You may use this argument instead of sentences to get performance boost. Only one of sentences or corpus_file arguments need to be passed (not both of them).

  • total_examples (int) – Count of sentences.

  • total_words (int) – Count of raw words in sentences.

  • epochs (int) – Number of iterations (epochs) over the corpus.

  • start_alpha (float, optional) – Initial learning rate. If supplied, replaces the starting alpha from the constructor, for this one call to`train()`. Use only if making multiple calls to train(), when you want to manage the alpha learning-rate yourself (not recommended).

  • end_alpha (float, optional) – Final learning rate. Drops linearly from start_alpha. If supplied, this replaces the final min_alpha from the constructor, for this one call to train(). Use only if making multiple calls to train(), when you want to manage the alpha learning-rate yourself (not recommended).

  • word_count (int, optional) – Count of words already trained. Set this to 0 for the usual case of training on all words in sentences.

  • queue_factor (int, optional) – Multiplier for size of queue (number of workers * queue_factor).

  • report_delay (float, optional) – Seconds to wait before reporting progress.

  • compute_loss (bool, optional) – If True, computes and stores loss value which can be retrieved using get_latest_training_loss().

  • callbacks (iterable of CallbackAny2Vec, optional) – Sequence of callbacks to be executed at specific stages during training.

Examples

>>> from gensim.models import Word2Vec
>>> sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
>>>
>>> model = Word2Vec(min_count=1)
>>> model.build_vocab(sentences)  # prepare the model vocabulary
>>> model.train(sentences, total_examples=model.corpus_count, epochs=model.iter)  # train word vectors
(1, 30)
wmdistance(document1, document2)

Deprecated, use self.wv.wmdistance() instead.

Refer to the documentation for wmdistance().

class gensim.models.word2vec.Word2VecTrainables(vector_size=100, seed=1, hashfxn=<built-in function hash>)

Bases: gensim.utils.SaveLoad

Represents the inner shallow neural network used to train Word2Vec.

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

prepare_weights(hs, negative, wv, update=False, vocabulary=None)

Build tables and model weights based on final vocabulary settings.

reset_weights(hs, negative, wv)

Reset all projection weights to an initial (untrained) state, but keep the existing vocabulary.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.

seeded_vector(seed_string, vector_size)

Get a random vector (but deterministic by seed_string).

update_weights(hs, negative, wv)

Copy all the existing weights, and reset the weights for the newly added vocabulary.

class gensim.models.word2vec.Word2VecVocab(max_vocab_size=None, min_count=5, sample=0.001, sorted_vocab=True, null_word=0, max_final_vocab=None, ns_exponent=0.75)

Bases: gensim.utils.SaveLoad

Vocabulary used by Word2Vec.

add_null_word(wv)
create_binary_tree(wv)

Create a binary Huffman tree using stored vocabulary word counts. Frequent words will have shorter binary codes. Called internally from build_vocab().

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

make_cum_table(wv, domain=2147483647)

Create a cumulative-distribution table using stored vocabulary word counts for drawing random words in the negative-sampling training routines.

To draw a word index, choose a random integer up to the maximum value in the table (cum_table[-1]), then finding that integer’s sorted insertion point (as if by bisect_left or ndarray.searchsorted()). That insertion point is the drawn index, coming up in proportion equal to the increment at that slot.

Called internally from build_vocab().

prepare_vocab(hs, negative, wv, update=False, keep_raw_vocab=False, trim_rule=None, min_count=None, sample=None, dry_run=False)

Apply vocabulary settings for min_count (discarding less-frequent words) and sample (controlling the downsampling of more-frequent words).

Calling with dry_run=True will only simulate the provided settings and report the size of the retained vocabulary, effective corpus length, and estimated memory requirements. Results are both printed via logging and returned as a dict.

Delete the raw vocabulary after the scaling is done to free up RAM, unless keep_raw_vocab is set.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.

scan_vocab(sentences=None, corpus_file=None, progress_per=10000, workers=None, trim_rule=None)
sort_vocab(wv)

Sort the vocabulary so the most frequent words have the lowest indexes.

gensim.models.word2vec.score_cbow_pair(model, word, l1)

Score the trained CBOW model on a pair of words.

Parameters
  • model (Word2Vec) – The trained model.

  • word (Vocab) – Vocabulary representation of the first word.

  • l1 (list of float) – Vector representation of the second word.

Returns

Logarithm of the sum of exponentiations of input words.

Return type

float

gensim.models.word2vec.score_sg_pair(model, word, word2)

Score the trained Skip-gram model on a pair of words.

Parameters
  • model (Word2Vec) – The trained model.

  • word (Vocab) – Vocabulary representation of the first word.

  • word2 (Vocab) – Vocabulary representation of the second word.

Returns

Logarithm of the sum of exponentiations of input words.

Return type

float

gensim.models.word2vec.train_cbow_pair(model, word, input_word_indices, l1, alpha, learn_vectors=True, learn_hidden=True, compute_loss=False, context_vectors=None, context_locks=None, is_ft=False)

Train the passed model instance on a word and its context, using the CBOW algorithm.

Parameters
  • model (Word2Vec) – The model to be trained.

  • word (str) – The label (predicted) word.

  • input_word_indices (list of int) – The vocabulary indices of the words in the context.

  • l1 (list of float) – Vector representation of the label word.

  • alpha (float) – Learning rate.

  • learn_vectors (bool, optional) – Whether the vectors should be updated.

  • learn_hidden (bool, optional) – Whether the weights of the hidden layer should be updated.

  • compute_loss (bool, optional) – Whether or not the training loss should be computed.

  • context_vectors (list of list of float, optional) – Vector representations of the words in the context. If None, these will be retrieved from the model.

  • context_locks (list of float, optional) – The lock factors for each word in the context.

  • is_ft (bool, optional) – If True, weights will be computed using model.wv.syn0_vocab and model.wv.syn0_ngrams instead of model.wv.syn0.

Returns

Error vector to be back-propagated.

Return type

numpy.ndarray

gensim.models.word2vec.train_sg_pair(model, word, context_index, alpha, learn_vectors=True, learn_hidden=True, context_vectors=None, context_locks=None, compute_loss=False, is_ft=False)

Train the passed model instance on a word and its context, using the Skip-gram algorithm.

Parameters
  • model (Word2Vec) – The model to be trained.

  • word (str) – The label (predicted) word.

  • context_index (list of int) – The vocabulary indices of the words in the context.

  • alpha (float) – Learning rate.

  • learn_vectors (bool, optional) – Whether the vectors should be updated.

  • learn_hidden (bool, optional) – Whether the weights of the hidden layer should be updated.

  • context_vectors (list of list of float, optional) – Vector representations of the words in the context. If None, these will be retrieved from the model.

  • context_locks (list of float, optional) – The lock factors for each word in the context.

  • compute_loss (bool, optional) – Whether or not the training loss should be computed.

  • is_ft (bool, optional) – If True, weights will be computed using model.wv.syn0_vocab and model.wv.syn0_ngrams instead of model.wv.syn0.

Returns

Error vector to be back-propagated.

Return type

numpy.ndarray