models.ldamodel
– Latent Dirichlet Allocation¶Optimized Latent Dirichlet Allocation (LDA) <https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation> in Python.
For a faster implementation of LDA (parallelized for multicore machines), see also gensim.models.ldamulticore
.
This module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents. The model can also be updated with new documents for online training.
The core estimation code is based on the onlineldavb.py script, by Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation, NIPS 2010.
The algorithm:
Train an LDA model using a Gensim corpus
>>> from gensim.test.utils import common_texts
>>> from gensim.corpora.dictionary import Dictionary
>>>
>>> # Create a corpus from a list of texts
>>> common_dictionary = Dictionary(common_texts)
>>> common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]
>>>
>>> # Train the model on the corpus.
>>> lda = LdaModel(common_corpus, num_topics=10)
Save a model to disk, or reload a pretrained model
>>> from gensim.test.utils import datapath
>>>
>>> # Save model to disk.
>>> temp_file = datapath("model")
>>> lda.save(temp_file)
>>>
>>> # Load a potentially pretrained model from disk.
>>> lda = LdaModel.load(temp_file)
Query, the model using new, unseen documents
>>> # Create a new corpus, made of previously unseen documents.
>>> other_texts = [
... ['computer', 'time', 'graph'],
... ['survey', 'response', 'eps'],
... ['human', 'system', 'computer']
... ]
>>> other_corpus = [common_dictionary.doc2bow(text) for text in other_texts]
>>>
>>> unseen_doc = other_corpus[0]
>>> vector = lda[unseen_doc] # get topic probability distribution for a document
Update the model by incrementally training on the new corpus
>>> lda.update(other_corpus)
>>> vector = lda[unseen_doc]
A lot of parameters can be tuned to optimize training for your specific case
>>> lda = LdaModel(common_corpus, num_topics=50, alpha='auto', eval_every=5) # learn asymmetric alpha from data
gensim.models.ldamodel.
LdaModel
(corpus=None, num_topics=100, id2word=None, distributed=False, chunksize=2000, passes=1, update_every=1, alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10, iterations=50, gamma_threshold=0.001, minimum_probability=0.01, random_state=None, ns_conf=None, minimum_phi_value=0.01, per_word_topics=False, callbacks=None, dtype=<type 'numpy.float32'>)¶Bases: gensim.interfaces.TransformationABC
, gensim.models.basemodel.BaseTopicModel
Train and use Online Latent Dirichlet Allocation (OLDA) models as presented in Hoffman et al. :”Online Learning for Latent Dirichlet Allocation”.
Examples
Initialize a model using a Gensim corpus
>>> from gensim.test.utils import common_corpus
>>>
>>> lda = LdaModel(common_corpus, num_topics=10)
You can then infer topic distributions on new, unseen documents.
>>> doc_bow = [(1, 0.3), (2, 0.1), (0, 0.09)]
>>> doc_lda = lda[doc_bow]
The model can be updated (trained) with new documents.
>>> # In practice (corpus =/= initial training corpus), but we use the same here for simplicity.
>>> other_corpus = common_corpus
>>>
>>> lda.update(other_corpus)
Model persistency is achieved through load()
and
save()
methods.
Parameters: 


bound
(corpus, gamma=None, subsample_ratio=1.0)¶Estimate the variational bound of documents from the corpus as E_q[log p(corpus)]  E_q[log q(corpus)].
Parameters: 


Returns:  The variational bound score calculated for each document. 
Return type:  numpy.ndarray 
clear
()¶Clear the model’s state to free some memory. Used in the distributed implementation.
diff
(other, distance='kullback_leibler', num_words=100, n_ann_terms=10, diagonal=False, annotation=True, normed=True)¶Calculate the difference in topic distributions between two models: self and other.
Parameters: 


Returns: 

Examples
Get the differences between each pair of topics inferred by two models
>>> from gensim.models.ldamulticore import LdaMulticore
>>> from gensim.test.utils import datapath
>>>
>>> m1 = LdaMulticore.load(datapath("lda_3_0_1_model"))
>>> m2 = LdaMulticore.load(datapath("ldamodel_python_3_5"))
>>> mdiff, annotation = m1.diff(m2)
>>> topic_diff = mdiff # get matrix with difference for each topic pair from `m1` and `m2`
do_estep
(chunk, state=None)¶Perform inference on a chunk of documents, and accumulate the collected sufficient statistics.
Parameters: 


Returns:  Gamma parameters controlling the topic weights, shape (len(chunk), self.num_topics). 
Return type:  numpy.ndarray 
do_mstep
(rho, other, extra_pass=False)¶Maximization step: use linear interpolation between the existing topics and collected sufficient statistics in other to update the topics.
Parameters: 


get_document_topics
(bow, minimum_probability=None, minimum_phi_value=None, per_word_topics=False)¶Get the topic distribution for the given document.
Parameters: 


Returns: 

get_term_topics
(word_id, minimum_probability=None)¶Get the most relevant topics to the given word.
Parameters: 


Returns:  The relevant topics represented as pairs of their ID and their assigned probability, sorted by relevance to the given word. 
Return type:  list of (int, float) 
get_topic_terms
(topicid, topn=10)¶Get the representation for a single topic. Words the integer IDs, in constrast to
show_topic()
that represents words by the actual strings.
Parameters: 


Returns:  Word ID  probability pairs for the most relevant words generated by the topic. 
Return type:  list of (int, float) 
get_topics
()¶Get the termtopic matrix learned during inference.
Returns:  The probability for each word in each topic, shape (num_topics, vocabulary_size). 

Return type:  numpy.ndarray 
inference
(chunk, collect_sstats=False)¶Given a chunk of sparse document vectors, estimate gamma (parameters controlling the topic weights) for each document in the chunk.
This function does not modify the model The whole input chunk of document is assumed to fit in RAM; chunking of a large corpus must be done earlier in the pipeline. Avoids computing the phi variational parameter directly using the optimization presented in Lee, Seung: Algorithms for nonnegative matrix factorization”.
Parameters: 


Returns:  The first element is always returned and it corresponds to the states gamma matrix. The second element is only returned if collect_sstats == True and corresponds to the sufficient statistics for the M step. 
Return type:  (numpy.ndarray, {numpy.ndarray, None}) 
init_dir_prior
(prior, name)¶Initialize priors for the Dirichlet distribution.
Parameters: 


load
(fname, *args, **kwargs)¶Load a previously saved gensim.models.ldamodel.LdaModel
from file.
See also
save()
Parameters: 

Examples
Large arrays can be memmap’ed back as readonly (shared memory) by setting mmap=’r’:
>>> from gensim.test.utils import datapath
>>>
>>> fname = datapath("lda_3_0_1_model")
>>> lda = LdaModel.load(fname, mmap='r')
log_perplexity
(chunk, total_docs=None)¶Calculate and return perword likelihood bound, using a chunk of documents as evaluation corpus.
Also output the calculated statistics, including the perplexity=2^(bound), to log at INFO level.
Parameters: 


Returns:  The variational bound score calculated for each word. 
Return type:  numpy.ndarray 
print_topic
(topicno, topn=10)¶Get a single topic as a formatted string.
Parameters: 


Returns:  String representation of topic, like ‘0.340 * “category” + 0.298 * “$M$” + 0.183 * “algebra” + … ‘. 
Return type:  str 
print_topics
(num_topics=20, num_words=10)¶Get the most significant topics (alias for show_topics() method).
Parameters: 


Returns:  Sequence with (topic_id, [(word, value), … ]). 
Return type:  list of (int, list of (str, float)) 
save
(fname, ignore=('state', 'dispatcher'), separately=None, *args, **kwargs)¶Save the model to a file.
Large internal arrays may be stored into separate files, with fname as prefix.
Notes
If you intend to use models across Python 2/3 versions there are a few things to keep in mind:
 The pickled Python dictionaries will not work across Python versions
 The save method does not automatically save all numpy arrays separately, only those ones that exceed sep_limit set in
save()
. The main concern here is the alpha array if for instance using alpha=’auto’.
Please refer to the wiki recipes section for an example on how to work around these issues.
See also
load()
Parameters: 


show_topic
(topicid, topn=10)¶Get the representation for a single topic. Words here are the actual strings, in constrast to
get_topic_terms()
that represents words by their vocabulary ID.
Parameters: 


Returns:  Word  probability pairs for the most relevant words generated by the topic. 
Return type:  list of (str, float) 
show_topics
(num_topics=10, num_words=10, log=False, formatted=True)¶Get a representation for selected topics.
Parameters: 


Returns:  a list of topics, each represented either as a string (when formatted == True) or wordprobability pairs. 
Return type:  list of {str, tuple of (str, float)} 
sync_state
(current_Elogbeta=None)¶Propagate the states topic probabilities to the inner object’s attribute.
Parameters:  current_Elogbeta (numpy.ndarray) – Posterior probabilities for each topic, optional. If omitted, it will get Elogbeta from state. 

top_topics
(corpus=None, texts=None, dictionary=None, window_size=None, coherence='u_mass', topn=20, processes=1)¶Get the topics with the highest coherence score the coherence for each topic.
Parameters: 


Returns:  Each element in the list is a pair of a topic representation and its coherence score. Topic representations are distributions of words, represented as a list of pairs of word IDs and their probabilities. 
Return type:  list of (list of (int, str), float) 
update
(corpus, chunksize=None, decay=None, offset=None, passes=None, update_every=None, eval_every=None, iterations=None, gamma_threshold=None, chunks_as_numpy=False)¶Train the model with new documents, by EMiterating over the corpus until the topics converge, or until the maximum number of allowed iterations is reached. corpus must be an iterable.
In distributed mode, the E step is distributed over a cluster of machines.
Notes
This update also supports updating an already trained model with new documents; the two models are then merged in proportion to the number of old vs. new documents. This feature is still experimental for nonstationary input streams. For stationary input (no topic drift in new documents), on the other hand, this equals the online update of Matthew D. Hoffman, David M. Blei, Francis Bach: “Online Learning for Latent Dirichlet Allocation NIPS‘10”. and is guaranteed to converge for any decay in (0.5, 1.0). Additionally, for smaller corpus sizes, an increasing offset may be beneficial (see Table 1 in the same paper).
Parameters: 


update_alpha
(gammat, rho)¶Update parameters for the Dirichlet prior on the perdocument topic weights.
Parameters: 


Returns:  Sequence of alpha parameters. 
Return type:  numpy.ndarray 
update_eta
(lambdat, rho)¶Update parameters for the Dirichlet prior on the pertopic word weights.
Parameters: 


Returns:  The updated eta parameters. 
Return type:  numpy.ndarray 
gensim.models.ldamodel.
LdaState
(eta, shape, dtype=<type 'numpy.float32'>)¶Bases: gensim.utils.SaveLoad
Encapsulate information for distributed computation of LdaModel
objects.
Objects of this class are sent over the network, so try to keep them lean to reduce traffic.
Parameters: 


blend
(rhot, other, targetsize=None)¶Merge the current state with another one using a weighted average for the sufficient statistics.
The number of documents is stretched in both state objects, so that they are of comparable magnitude. This procedure corresponds to the stochastic gradient update from Hoffman et al. :”Online Learning for Latent Dirichlet Allocation”, see equations (5) and (9).
Parameters: 


blend2
(rhot, other, targetsize=None)¶Merge the current state with another one using a weighted sum for the sufficient statistics.
In contrast to blend()
, the sufficient statistics are not scaled
prior to aggregation.
Parameters: 


get_Elogbeta
()¶Get the log (posterior) probabilities for each topic.
Returns:  Posterior probabilities for each topic. 

Return type:  numpy.ndarray 
get_lambda
()¶Get the parameters of the posterior over the topics, also referred to as “the topics”.
Returns:  Parameters of the posterior probability over topics. 

Return type:  numpy.ndarray 
load
(fname, *args, **kwargs)¶Load a previously stored state from disk.
Overrides load
by enforcing the dtype parameter
to ensure backwards compatibility.
Parameters: 


Returns:  The state loaded from the given file. 
Return type: 
merge
(other)¶Merge the result of an E step from one node with that of another node (summing up sufficient statistics).
The merging is trivial and after merging all cluster nodes, we have the exact same result as if the computation was run on a single node (no approximation).
Parameters:  other (LdaState ) – The state object with which the current one will be merged. 

reset
()¶Prepare the state for a new EM iteration (reset sufficient stats).
save
(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)¶Save the object to a file.
Parameters: 


See also
load()
gensim.models.ldamodel.
update_dir_prior
(prior, N, logphat, rho)¶Update a given prior using Newton’s method, described in J. Huang: “Maximum Likelihood Estimation of Dirichlet Distribution Parameters”.
Parameters: 


Returns:  The updated prior. 
Return type:  list of float 