gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

models.doc2vec_inner – Cython routines for training Doc2Vec models

models.doc2vec_inner – Cython routines for training Doc2Vec models

Optimized cython functions for training Doc2Vec model.

gensim.models.doc2vec_inner.train_document_dbow(model, doc_words, doctag_indexes, alpha, work=None, train_words=False, learn_doctags=True, learn_words=True, learn_hidden=True, word_vectors=None, word_locks=None, doctag_vectors=None, doctag_locks=None)

Update distributed bag of words model (“PV-DBOW”) by training on a single document.

Called internally from train() and infer_vector().

Parameters:
  • model (Doc2Vec) – The model to train.
  • doc_words (list of str) – The input document as a list of words to be used for training. Each word will be looked up in the model’s vocabulary.
  • doctag_indexes (list of int) – Indices into doctag_vectors used to obtain the tags of the document.
  • alpha (float) – Learning rate.
  • work (list of float, optional) – Updates to be performed on each neuron in the hidden layer of the underlying network.
  • train_words (bool, optional) – Word vectors will be updated exactly as per Word2Vec skip-gram training only if both learn_words and train_words are set to True.
  • learn_doctags (bool, optional) – Whether the tag vectors should be updated.
  • learn_words (bool, optional) – Word vectors will be updated exactly as per Word2Vec skip-gram training only if both learn_words and train_words are set to True.
  • learn_hidden (bool, optional) – Whether or not the weights of the hidden layer will be updated.
  • word_vectors (numpy.ndarray, optional) – The vector representation for each word in the vocabulary. If None, these will be retrieved from the model.
  • word_locks (numpy.ndarray, optional) – A learning lock factor for each weight in the hidden layer for words, value 0 completely blocks updates, a value of 1 allows to update word-vectors.
  • doctag_vectors (numpy.ndarray, optional) – Vector representations of the tags. If None, these will be retrieved from the model.
  • doctag_locks (numpy.ndarray, optional) – The lock factors for each tag, same as word_locks, but for document-vectors.
Returns:

Number of words in the input document that were actually used for training.

Return type:

int

gensim.models.doc2vec_inner.train_document_dm(model, doc_words, doctag_indexes, alpha, work=None, neu1=None, learn_doctags=True, learn_words=True, learn_hidden=True, word_vectors=None, word_locks=None, doctag_vectors=None, doctag_locks=None)

Update distributed memory model (“PV-DM”) by training on a single document. This method implements the DM model with a projection (input) layer that is either the sum or mean of the context vectors, depending on the model’s dm_mean configuration field.

Called internally from train() and infer_vector().

Parameters:
  • model (Doc2Vec) – The model to train.
  • doc_words (list of str) – The input document as a list of words to be used for training. Each word will be looked up in the model’s vocabulary.
  • doctag_indexes (list of int) – Indices into doctag_vectors used to obtain the tags of the document.
  • alpha (float) – Learning rate.
  • work (np.ndarray, optional) – Private working memory for each worker.
  • neu1 (np.ndarray, optional) – Private working memory for each worker.
  • learn_doctags (bool, optional) – Whether the tag vectors should be updated.
  • learn_words (bool, optional) – Word vectors will be updated exactly as per Word2Vec skip-gram training only if both learn_words and train_words are set to True.
  • learn_hidden (bool, optional) – Whether or not the weights of the hidden layer will be updated.
  • word_vectors (numpy.ndarray, optional) – The vector representation for each word in the vocabulary. If None, these will be retrieved from the model.
  • word_locks (numpy.ndarray, optional) – A learning lock factor for each weight in the hidden layer for words, value 0 completely blocks updates, a value of 1 allows to update word-vectors.
  • doctag_vectors (numpy.ndarray, optional) – Vector representations of the tags. If None, these will be retrieved from the model.
  • doctag_locks (numpy.ndarray, optional) – The lock factors for each tag, same as word_locks, but for document-vectors.
Returns:

Number of words in the input document that were actually used for training.

Return type:

int

gensim.models.doc2vec_inner.train_document_dm_concat(model, doc_words, doctag_indexes, alpha, work=None, neu1=None, learn_doctags=True, learn_words=True, learn_hidden=True, word_vectors=None, word_locks=None, doctag_vectors=None, doctag_locks=None)
Update distributed memory model (“PV-DM”) by training on a single document, using a concatenation of the context
window word vectors (rather than a sum or average). This might be slower since the input at each batch will be significantly larger.

Called internally from train() and infer_vector().

Parameters:
  • model (Doc2Vec) – The model to train.
  • doc_words (list of str) – The input document as a list of words to be used for training. Each word will be looked up in the model’s vocabulary.
  • doctag_indexes (list of int) – Indices into doctag_vectors used to obtain the tags of the document.
  • alpha (float, optional) – Learning rate.
  • work (np.ndarray, optional) – Private working memory for each worker.
  • neu1 (np.ndarray, optional) – Private working memory for each worker.
  • learn_doctags (bool, optional) – Whether the tag vectors should be updated.
  • learn_words (bool, optional) – Word vectors will be updated exactly as per Word2Vec skip-gram training only if both learn_words and train_words are set to True.
  • learn_hidden (bool, optional) – Whether or not the weights of the hidden layer will be updated.
  • word_vectors (numpy.ndarray, optional) – The vector representation for each word in the vocabulary. If None, these will be retrieved from the model.
  • word_locks (numpy.ndarray, optional) – A learning lock factor for each weight in the hidden layer for words, value 0 completely blocks updates, a value of 1 allows to update word-vectors.
  • doctag_vectors (numpy.ndarray, optional) – Vector representations of the tags. If None, these will be retrieved from the model.
  • doctag_locks (numpy.ndarray, optional) – The lock factors for each tag, same as word_locks, but for document-vectors.
Returns:

Number of words in the input document that were actually used for training.

Return type:

int