models.ldamulticore
– parallelized Latent Dirichlet Allocation¶Online Latent Dirichlet Allocation (LDA) in Python, using all CPU cores to parallelize and speed up model training.
The parallelization uses multiprocessing; in case this doesn’t work for you for some reason,
try the gensim.models.ldamodel.LdaModel
class which is an equivalent, but more straightforward and singlecore
implementation.
The training algorithm:
Wallclock performance on the English Wikipedia (2G corpus positions, 3.5M documents, 100K features, 0.54G nonzero entries in the final bagofwords matrix), requesting 100 topics:
algorithm  training time 

LdaMulticore(workers=1)  2h30m 
LdaMulticore(workers=2)  1h24m 
LdaMulticore(workers=3)  1h6m 
old LdaModel()  3h44m 
simply iterating over input corpus = I/O overhead  20m 
(Measured on this i7 server with 4 physical cores, so that optimal workers=3, one less than the number of cores.)
This module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents. The model can also be updated with new documents for online training.
The core estimation code is based on the onlineldavb.py script, by Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation, NIPS 2010.
The constructor estimates Latent Dirichlet Allocation model parameters based on a training corpus
>>> from gensim.test.utils import common_corpus, common_dictionary
>>>
>>> lda = LdaMulticore(common_corpus, id2word=common_dictionary, num_topics=10)
Save a model to disk, or reload a pretrained model
>>> from gensim.test.utils import datapath
>>>
>>> # Save model to disk.
>>> temp_file = datapath("model")
>>> lda.save(temp_file)
>>>
>>> # Load a potentially pretrained model from disk.
>>> lda = LdaModel.load(temp_file)
Query, or update the model using new, unseen documents
>>> other_texts = [
... ['computer', 'time', 'graph'],
... ['survey', 'response', 'eps'],
... ['human', 'system', 'computer']
... ]
>>> other_corpus = [common_dictionary.doc2bow(text) for text in other_texts]
>>>
>>> unseen_doc = other_corpus[0]
>>> vector = lda[unseen_doc] # get topic probability distribution for a document
>>>
>>> # Update the model by incrementally training on the new corpus.
>>> lda.update(other_corpus) # update the LDA model with additional documents
gensim.models.ldamulticore.
LdaMulticore
(corpus=None, num_topics=100, id2word=None, workers=None, chunksize=2000, passes=1, batch=False, alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10, iterations=50, gamma_threshold=0.001, random_state=None, minimum_probability=0.01, minimum_phi_value=0.01, per_word_topics=False, dtype=<type 'numpy.float32'>)¶Bases: gensim.models.ldamodel.LdaModel
An optimized implementation of the LDA algorithm, able to harness the power of multicore CPUs.
Follows the similar API as the parent class LdaModel
.
Parameters: 


bound
(corpus, gamma=None, subsample_ratio=1.0)¶Estimate the variational bound of documents from the corpus as E_q[log p(corpus)]  E_q[log q(corpus)].
Parameters: 


Returns:  The variational bound score calculated for each document. 
Return type:  numpy.ndarray 
clear
()¶Clear the model’s state to free some memory. Used in the distributed implementation.
diff
(other, distance='kullback_leibler', num_words=100, n_ann_terms=10, diagonal=False, annotation=True, normed=True)¶Calculate the difference in topic distributions between two models: self and other.
Parameters: 


Returns: 

Examples
Get the differences between each pair of topics inferred by two models
>>> from gensim.models.ldamulticore import LdaMulticore
>>> from gensim.test.utils import datapath
>>>
>>> m1, m2 = LdaMulticore.load(datapath("lda_3_0_1_model")), LdaMulticore.load(datapath("ldamodel_python_3_5"))
>>> mdiff, annotation = m1.diff(m2)
>>> topic_diff = mdiff # get matrix with difference for each topic pair from `m1` and `m2`
do_estep
(chunk, state=None)¶Perform inference on a chunk of documents, and accumulate the collected sufficient statistics.
Parameters: 


Returns:  Gamma parameters controlling the topic weights, shape (len(chunk), self.num_topics). 
Return type:  numpy.ndarray 
do_mstep
(rho, other, extra_pass=False)¶Maximization step: use linear interpolation between the existing topics and collected sufficient statistics in other to update the topics.
Parameters: 


get_document_topics
(bow, minimum_probability=None, minimum_phi_value=None, per_word_topics=False)¶Get the topic distribution for the given document.
Parameters: 


Returns: 

get_term_topics
(word_id, minimum_probability=None)¶Get the most relevant topics to the given word.
Parameters: 


Returns:  The relevant topics represented as pairs of their ID and their assigned probability, sorted by relevance to the given word. 
Return type:  list of (int, float) 
get_topic_terms
(topicid, topn=10)¶Get the representation for a single topic. Words the integer IDs, in constrast to
show_topic()
that represents words by the actual strings.
Parameters: 


Returns:  Word ID  probability pairs for the most relevant words generated by the topic. 
Return type:  list of (int, float) 
get_topics
()¶Get the termtopic matrix learned during inference.
Returns:  The probability for each word in each topic, shape (num_topics, vocabulary_size). 

Return type:  numpy.ndarray 
inference
(chunk, collect_sstats=False)¶Given a chunk of sparse document vectors, estimate gamma (parameters controlling the topic weights) for each document in the chunk.
This function does not modify the model The whole input chunk of document is assumed to fit in RAM; chunking of a large corpus must be done earlier in the pipeline. Avoids computing the phi variational parameter directly using the optimization presented in Lee, Seung: Algorithms for nonnegative matrix factorization”.
Parameters: 


Returns:  The first element is always returned and it corresponds to the states gamma matrix. The second element is only returned if collect_sstats == True and corresponds to the sufficient statistics for the M step. 
Return type:  (numpy.ndarray, {numpy.ndarray, None}) 
init_dir_prior
(prior, name)¶Initialize priors for the Dirichlet distribution.
Parameters: 


load
(fname, *args, **kwargs)¶Load a previously saved gensim.models.ldamodel.LdaModel
from file.
See also
save()
Parameters: 

Examples
Large arrays can be memmap’ed back as readonly (shared memory) by setting mmap=’r’:
>>> from gensim.test.utils import datapath
>>>
>>> fname = datapath("lda_3_0_1_model")
>>> lda = LdaModel.load(fname, mmap='r')
log_perplexity
(chunk, total_docs=None)¶Calculate and return perword likelihood bound, using a chunk of documents as evaluation corpus.
Also output the calculated statistics, including the perplexity=2^(bound), to log at INFO level.
Parameters: 


Returns:  The variational bound score calculated for each word. 
Return type:  numpy.ndarray 
print_topic
(topicno, topn=10)¶Get a single topic as a formatted string.
Parameters: 


Returns:  String representation of topic, like ‘0.340 * “category” + 0.298 * “$M$” + 0.183 * “algebra” + … ‘. 
Return type:  str 
print_topics
(num_topics=20, num_words=10)¶Get the most significant topics (alias for show_topics() method).
Parameters: 


Returns:  Sequence with (topic_id, [(word, value), … ]). 
Return type:  list of (int, list of (str, float)) 
save
(fname, ignore=('state', 'dispatcher'), separately=None, *args, **kwargs)¶Save the model to a file.
Large internal arrays may be stored into separate files, with fname as prefix.
Notes
If you intend to use models across Python 2/3 versions there are a few things to keep in mind:
 The pickled Python dictionaries will not work across Python versions
 The save method does not automatically save all numpy arrays separately, only those ones that exceed sep_limit set in
save()
. The main concern here is the alpha array if for instance using alpha=’auto’.
Please refer to the wiki recipes section for an example on how to work around these issues.
See also
load()
Parameters: 


show_topic
(topicid, topn=10)¶Get the representation for a single topic. Words here are the actual strings, in constrast to
get_topic_terms()
that represents words by their vocabulary ID.
Parameters: 


Returns:  Word  probability pairs for the most relevant words generated by the topic. 
Return type:  list of (str, float) 
show_topics
(num_topics=10, num_words=10, log=False, formatted=True)¶Get a representation for selected topics.
Parameters: 


Returns:  a list of topics, each represented either as a string (when formatted == True) or wordprobability pairs. 
Return type:  list of {str, tuple of (str, float)} 
sync_state
()¶Propagate the states topic probabilities to the inner object’s attribute.
top_topics
(corpus=None, texts=None, dictionary=None, window_size=None, coherence='u_mass', topn=20, processes=1)¶Get the topics with the highest coherence score the coherence for each topic.
Parameters: 


Returns:  Each element in the list is a pair of a topic representation and its coherence score. Topic representations are distributions of words, represented as a list of pairs of word IDs and their probabilities. 
Return type:  list of (list of (int, str), float) 
update
(corpus, chunks_as_numpy=False)¶Train the model with new documents, by EMiterating over corpus until the topics converge (or until the maximum number of allowed iterations is reached).
Train the model with new documents, by EMiterating over the corpus until the topics converge, or until the maximum number of allowed iterations is reached. corpus must be an iterable. The E step is distributed into the several processes.
Notes
This update also supports updating an already trained model (self) with new documents from corpus; the two models are then merged in proportion to the number of old vs. new documents. This feature is still experimental for nonstationary input streams.
For stationary input (no topic drift in new documents), on the other hand, this equals the online update of Hoffman et al. and is guaranteed to converge for any decay in (0.5, 1.0>.
Parameters: 


update_alpha
(gammat, rho)¶Update parameters for the Dirichlet prior on the perdocument topic weights.
Parameters: 


Returns:  Sequence of alpha parameters. 
Return type:  numpy.ndarray 
update_eta
(lambdat, rho)¶Update parameters for the Dirichlet prior on the pertopic word weights.
Parameters: 


Returns:  The updated eta parameters. 
Return type:  numpy.ndarray 
gensim.models.ldamulticore.
worker_e_step
(input_queue, result_queue)¶Perform Estep for each job.
Parameters: 

