gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

models.doc2vec – Deep learning with paragraph2vec

models.doc2vec – Deep learning with paragraph2vec

Deep learning via the distributed memory and distributed bag of words models from [1], using either hierarchical softmax or negative sampling [2] [3]. See [4]

Make sure you have a C compiler before installing gensim, to use optimized (compiled) doc2vec training (70x speedup [blog]).

Initialize a model with e.g.:

>>> model = Doc2Vec(documents, size=100, window=8, min_count=5, workers=4)

Persist a model to disk with:

>>> model.save(fname)
>>> model = Doc2Vec.load(fname)  # you can continue training with the loaded model!

If you’re finished training a model (=no more updates, only querying), you can do

>>> model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True):

to trim unneeded model memory = use (much) less RAM.

[1]Quoc Le and Tomas Mikolov. Distributed Representations of Sentences and Documents. http://arxiv.org/pdf/1405.4053v2.pdf
[2]Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.
[3]Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.
[blog]Optimizing word2vec in gensim, http://radimrehurek.com/2013/09/word2vec-in-python-part-two-optimizing/
[4]Doc2vec in gensim tutorial, https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
class gensim.models.doc2vec.Doc2Vec(documents=None, dm_mean=None, dm=1, dbow_words=0, dm_concat=0, dm_tag_count=1, docvecs=None, docvecs_mapfile=None, comment=None, trim_rule=None, callbacks=(), **kwargs)

Bases: gensim.models.base_any2vec.BaseWordEmbeddingsModel

Class for training, using and evaluating neural networks described in http://arxiv.org/pdf/1405.4053v2.pdf

Initialize the model from an iterable of documents. Each document is a TaggedDocument object that will be used for training.

Parameters:
  • documents (iterable of iterables) – The documents iterable can be simply a list of TaggedDocument elements, but for larger corpora, consider an iterable that streams the documents directly from disk/network. If you don’t supply documents, the model is left uninitialized – use if you plan to initialize it in some other way.
  • dm (int {1,0}) – Defines the training algorithm. If dm=1, ‘distributed memory’ (PV-DM) is used. Otherwise, distributed bag of words (PV-DBOW) is employed.
  • size (int) – Dimensionality of the feature vectors.
  • window (int) – The maximum distance between the current and predicted word within a sentence.
  • alpha (float) – The initial learning rate.
  • min_alpha (float) – Learning rate will linearly drop to min_alpha as training progresses.
  • seed (int) – Seed for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread (workers=1), to eliminate ordering jitter from OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization).
  • min_count (int) – Ignores all words with total frequency lower than this.
  • max_vocab_size (int) – Limits the RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit.
  • sample (float) – The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).
  • workers (int) – Use these many worker threads to train the model (=faster training with multicore machines).
  • iter (int) – Number of iterations (epochs) over the corpus.
  • hs (int {1,0}) – If 1, hierarchical softmax will be used for model training. If set to 0, and negative is non-zero, negative sampling will be used.
  • negative (int) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.
  • dm_mean (int {1,0}) – If 0 , use the sum of the context word vectors. If 1, use the mean. Only applies when dm is used in non-concatenative mode.
  • dm_concat (int {1,0}) – If 1, use concatenation of context vectors rather than sum/average; Note concatenation results in a much-larger model, as the input is no longer the size of one (sampled or arithmetically combined) word vector, but the size of the tag(s) and all words in the context strung together.
  • dm_tag_count (int) – Expected constant number of document tags per document, when using dm_concat mode; default is 1.
  • dbow_words (int {1,0}) – If set to 1 trains word-vectors (in skip-gram fashion) simultaneous with DBOW doc-vector training; If 0, only trains doc-vectors (faster).
  • trim_rule (function) – Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to keep_vocab_item()), or a callable that accepts parameters (word, count, min_count) and returns either gensim.utils.RULE_DISCARD, gensim.utils.RULE_KEEP or gensim.utils.RULE_DEFAULT. Note: The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as part of the model.
  • callbacks – List of callbacks that need to be executed/run at specific stages during training.
build_vocab(documents, update=False, progress_per=10000, keep_raw_vocab=False, trim_rule=None, **kwargs)

Build vocabulary from a sequence of sentences (can be a once-only generator stream). Each sentence is a iterable of iterables (can simply be a list of unicode strings too).

Parameters:
  • documents (iterable of iterables) – The documents iterable can be simply a list of TaggedDocument elements, but for larger corpora, consider an iterable that streams the documents directly from disk/network. See TaggedBrownCorpus or TaggedLineDocument in doc2vec module for such examples.
  • keep_raw_vocab (bool) – If not true, delete the raw vocabulary after the scaling is done and free up RAM.
  • trim_rule (function) – Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to keep_vocab_item()), or a callable that accepts parameters (word, count, min_count) and returns either gensim.utils.RULE_DISCARD, gensim.utils.RULE_KEEP or gensim.utils.RULE_DEFAULT. Note: The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as part of the model.
  • progress_per (int) – Indicates how many words to process before showing/updating the progress.
  • update (bool) – If true, the new words in sentences will be added to model’s vocab.
build_vocab_from_freq(word_freq, keep_raw_vocab=False, corpus_count=None, trim_rule=None, update=False)

Build vocabulary from a dictionary of word frequencies. Build model vocabulary from a passed dictionary that contains (word,word count). Words must be of type unicode strings.

Parameters:
  • word_freq (dict) – Word,Word_Count dictionary.
  • keep_raw_vocab (bool) – If not true, delete the raw vocabulary after the scaling is done and free up RAM.
  • corpus_count (int) – Even if no corpus is provided, this argument can set corpus_count explicitly.
  • trim_rule (function) – Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to keep_vocab_item()), or a callable that accepts parameters (word, count, min_count) and returns either gensim.utils.RULE_DISCARD, gensim.utils.RULE_KEEP or gensim.utils.RULE_DEFAULT. Note: The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as part of the model.
  • update (bool) – If true, the new provided words in word_freq dict will be added to model’s vocab.

Examples

>>> from gensim.models.word2vec import Word2Vec
>>> model= Word2Vec()
>>> model.build_vocab_from_freq({"Word1": 15, "Word2": 20})
clear_sims()
cum_table
dbow

int {1,0}dbow=1 indicates distributed bag of words (PV-DBOW) else ‘distributed memory’ (PV-DM) is used.

delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

Discard parameters that are used in training and score. Use if you’re sure you’re done training a model.

Parameters:
  • keep_doctags_vectors (bool) – Set keep_doctags_vectors to False if you don’t want to save doctags vectors, in this case you can’t to use docvecs’s most_similar, similarity etc. methods.
  • keep_inference (bool) – Set keep_inference to False if you don’t want to store parameters that is used for infer_vector method
dm

int {1,0}dm=1 indicates ‘distributed memory’ (PV-DM) else distributed bag of words (PV-DBOW) is used.

doesnt_match(**kwargs)

Deprecated. Use self.wv.doesnt_match() instead. Refer to the documentation for gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.doesnt_match

estimate_memory(vocab_size=None, report=None)

Estimate required memory for a model using current settings.

estimated_lookup_memory()

Estimated memory for tag lookup; 0 if using pure int tags.

evaluate_word_pairs(**kwargs)

Deprecated. Use self.wv.evaluate_word_pairs() instead. Refer to the documentation for gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.evaluate_word_pairs

hashfxn
infer_vector(doc_words, alpha=0.1, min_alpha=0.0001, steps=5)

Infer a vector for given post-bulk training document.

Parameters:
  • doc_words – Document should be a list of (word) tokens.
  • alpha (float) – The initial learning rate.
  • min_alpha (float) – Learning rate will linearly drop to min_alpha as training progresses.
  • steps (int) – Number of times to train the new document.
Returns:

Returns the inferred vector for the new document.

Return type:

obj: numpy.ndarray

init_sims(replace=False)

Precompute L2-normalized vectors.

If replace is set, forget the original vectors and only keep the normalized ones = saves lots of memory!

Note that you cannot continue training or inference after doing a replace. The model becomes effectively read-only = you can call most_similar, similarity etc., but not train or infer_vector.

iter
layer1_size
classmethod load(*args, **kwargs)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
min_count
most_similar(**kwargs)

Deprecated. Use self.wv.most_similar() instead. Refer to the documentation for gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.most_similar

most_similar_cosmul(**kwargs)

Deprecated. Use self.wv.most_similar_cosmul() instead. Refer to the documentation for gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.most_similar_cosmul

n_similarity(**kwargs)

Deprecated. Use self.wv.n_similarity() instead. Refer to the documentation for gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.n_similarity

reset_from(other_model)

Reuse shareable structures from other_model.

sample
save(fname_or_handle, **kwargs)

Save the object to file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) – If None - automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently. If list of str - this attributes will be stored in separate files, the automatic check is not performed in this case.
  • sep_limit (int) – Limit for automatic separation.
  • ignore (frozenset of str) – Attributes that shouldn’t be serialize/store.
  • pickle_protocol (int) – Protocol number for pickle.

See also

load()

save_word2vec_format(fname, doctag_vec=False, word_vec=True, prefix='*dt_', fvocab=None, binary=False)

Store the input-hidden weight matrix in the same format used by the original C word2vec-tool, for compatibility.

Parameters:
  • fname (str) – The file path used to save the vectors in.
  • doctag_vec (bool) – Indicates whether to store document vectors.
  • word_vec (bool) – Indicates whether to store word vectors.
  • prefix (str) – Uniquely identifies doctags from word vocab, and avoids collision in case of repeated string in doctag and word vocab.
  • fvocab (str) – Optional file path used to save the vocabulary
  • binary (bool) – If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.
similar_by_vector(**kwargs)

Deprecated. Use self.wv.similar_by_vector() instead. Refer to the documentation for gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similar_by_vector

similar_by_word(**kwargs)

Deprecated. Use self.wv.similar_by_word() instead. Refer to the documentation for gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similar_by_word

similarity(**kwargs)

Deprecated. Use self.wv.similarity() instead. Refer to the documentation for gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similarity

syn0_lockf
syn1
syn1neg
train(documents, total_examples=None, total_words=None, epochs=None, start_alpha=None, end_alpha=None, word_count=0, queue_factor=2, report_delay=1.0, callbacks=())

Update the model’s neural weights from a sequence of sentences (can be a once-only generator stream). The documents iterable can be simply a list of TaggedDocument elements.

To support linear learning-rate decay from (initial) alpha to min_alpha, and accurate progress-percentage logging, either total_examples (count of sentences) or total_words (count of raw words in sentences) MUST be provided (if the corpus is the same as was provided to build_vocab(), the count of examples in that corpus will be available in the model’s corpus_count property).

To avoid common mistakes around the model’s ability to do multiple training passes itself, an explicit epochs argument MUST be provided. In the common and recommended case, where train() is only called once, the model’s cached iter value should be supplied as epochs value.

Parameters:
  • documents (iterable of iterables) – The documents iterable can be simply a list of TaggedDocument elements, but for larger corpora, consider an iterable that streams the documents directly from disk/network. See TaggedBrownCorpus or TaggedLineDocument in doc2vec module for such examples.
  • total_examples (int) – Count of sentences.
  • total_words (int) – Count of raw words in documents.
  • epochs (int) – Number of iterations (epochs) over the corpus.
  • start_alpha (float) – Initial learning rate.
  • end_alpha (float) – Final learning rate. Drops linearly from start_alpha.
  • word_count (int) – Count of words already trained. Set this to 0 for the usual case of training on all words in sentences.
  • queue_factor (int) – Multiplier for size of queue (number of workers * queue_factor).
  • report_delay (float) – Seconds to wait before reporting progress.
  • callbacks – List of callbacks that need to be executed/run at specific stages during training.
wmdistance(**kwargs)

Deprecated. Use self.wv.wmdistance() instead. Refer to the documentation for gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.wmdistance

class gensim.models.doc2vec.Doc2VecTrainables(dm=1, dm_concat=0, dm_tag_count=1, vector_size=100, seed=1, hashfxn=<built-in function hash>, window=5)

Bases: gensim.models.word2vec.Word2VecTrainables

get_doctag_trainables(doc_words, vector_size)
classmethod load(fname, mmap=None)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
prepare_weights(hs, negative, wv, docvecs, update=False)

Build tables and model weights based on final vocabulary settings.

reset_doc_weights(docvecs)
reset_weights(hs, negative, wv, docvecs, vocabulary=None)

Reset all projection weights to an initial (untrained) state, but keep the existing vocabulary.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) – If None - automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently. If list of str - this attributes will be stored in separate files, the automatic check is not performed in this case.
  • sep_limit (int) – Limit for automatic separation.
  • ignore (frozenset of str) – Attributes that shouldn’t be serialize/store.
  • pickle_protocol (int) – Protocol number for pickle.

See also

load()

seeded_vector(seed_string, vector_size)

Create one ‘random’ vector (but deterministic by seed_string)

update_weights(hs, negative, wv)

Copy all the existing weights, and reset the weights for the newly added vocabulary.

class gensim.models.doc2vec.Doc2VecVocab(max_vocab_size=None, min_count=5, sample=0.001, sorted_vocab=True, null_word=0)

Bases: gensim.models.word2vec.Word2VecVocab

add_null_word(wv)
create_binary_tree(wv)

Create a binary Huffman tree using stored vocabulary word counts. Frequent words will have shorter binary codes. Called internally from build_vocab().

indexed_doctags(doctag_tokens, docvecs)

Return indexes and backing-arrays used in training examples.

classmethod load(fname, mmap=None)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
make_cum_table(wv, power=0.75, domain=2147483647)

Create a cumulative-distribution table using stored vocabulary word counts for drawing random words in the negative-sampling training routines.

To draw a word index, choose a random integer up to the maximum value in the table (cum_table[-1]), then finding that integer’s sorted insertion point (as if by bisect_left or ndarray.searchsorted()). That insertion point is the drawn index, coming up in proportion equal to the increment at that slot.

Called internally from ‘build_vocab()’.

note_doctag(key, document_no, document_length, docvecs)

Note a document tag during initial corpus scan, for structure sizing.

prepare_vocab(hs, negative, wv, update=False, keep_raw_vocab=False, trim_rule=None, min_count=None, sample=None, dry_run=False)

Apply vocabulary settings for min_count (discarding less-frequent words) and sample (controlling the downsampling of more-frequent words).

Calling with dry_run=True will only simulate the provided settings and report the size of the retained vocabulary, effective corpus length, and estimated memory requirements. Results are both printed via logging and returned as a dict.

Delete the raw vocabulary after the scaling is done to free up RAM, unless keep_raw_vocab is set.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) – If None - automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently. If list of str - this attributes will be stored in separate files, the automatic check is not performed in this case.
  • sep_limit (int) – Limit for automatic separation.
  • ignore (frozenset of str) – Attributes that shouldn’t be serialize/store.
  • pickle_protocol (int) – Protocol number for pickle.

See also

load()

scan_vocab(documents, docvecs, progress_per=10000, trim_rule=None)

Do an initial scan of all words appearing in sentences.

sort_vocab(wv)

Sort the vocabulary so the most frequent words have the lowest indexes.

class gensim.models.doc2vec.Doctag

Bases: gensim.models.doc2vec.Doctag

A string document tag discovered during the initial vocabulary scan. (The document-vector equivalent of a Vocab object.)

Will not be used if all presented document tags are ints.

The offset is only the true index into the doctags_syn0/doctags_syn0_lockf if-and-only-if no raw-int tags were used. If any raw-int tags were used, string Doctag vectors begin at index (max_rawint + 1), so the true index is (rawint_index + 1 + offset). See also _index_to_doctag().

Create new instance of Doctag(offset, word_count, doc_count)

count(value) → integer -- return number of occurrences of value
doc_count

Alias for field number 2

index(value[, start[, stop]]) → integer -- return first index of value.

Raises ValueError if the value is not present.

offset

Alias for field number 0

repeat(word_count)
word_count

Alias for field number 1

gensim.models.doc2vec.LabeledSentence(*args, **kwargs)
class gensim.models.doc2vec.TaggedBrownCorpus(dirname)

Bases: object

Iterate over documents from the Brown corpus (part of NLTK data), yielding each document out as a TaggedDocument object.

class gensim.models.doc2vec.TaggedDocument

Bases: gensim.models.doc2vec.TaggedDocument

A single document, made up of words (a list of unicode string tokens) and tags (a list of tokens). Tags may be one or more unicode string tokens, but typical practice (which will also be most memory-efficient) is for the tags list to include a unique integer id as the only tag.

Replaces “sentence as a list of words” from Word2Vec.

Create new instance of TaggedDocument(words, tags)

count(value) → integer -- return number of occurrences of value
index(value[, start[, stop]]) → integer -- return first index of value.

Raises ValueError if the value is not present.

tags

Alias for field number 1

words

Alias for field number 0

class gensim.models.doc2vec.TaggedLineDocument(source)

Bases: object

Simple format: one document = one line = one TaggedDocument object.

Words are expected to be already preprocessed and separated by whitespace, tags are constructed automatically from the document line number.

source can be either a string (filename) or a file object.

Example:

documents = TaggedLineDocument('myfile.txt')

Or for compressed files:

documents = TaggedLineDocument('compressed_text.txt.bz2')
documents = TaggedLineDocument('compressed_text.txt.gz')