models.ldaseqmodel
– Dynamic Topic Modeling in Python¶Lda Sequence model, inspired by David M. Blei, John D. Lafferty: “Dynamic Topic Models” . The original C/C++ implementation can be found on blei-lab/dtm <https://github.com/blei-lab/dtm>.
TODO: The next steps to take this forward would be:
Include DIM mode. Most of the infrastructure for this is in place.
See if LdaPost can be replaced by LdaModel completely without breaking anything.
Heavy lifting going on in the Sslm class - efforts can be made to cythonise mathematical methods, in particular, update_obs and the optimization takes a lot time.
Try and make it distributed, especially around the E and M step.
Remove all C/C++ coding style/syntax.
Examples
Set up a model using have 30 documents, with 5 in the first time-slice, 10 in the second, and 15 in the third
>>> from gensim.test.utils import common_corpus
>>> from gensim.models import LdaSeqModel
>>>
>>> ldaseq = LdaSeqModel(corpus=common_corpus, time_slice=[2, 4, 3], num_topics=2, chunksize=1)
Persist a model to disk and reload it later
>>> from gensim.test.utils import datapath
>>>
>>> temp_file = datapath("model")
>>> ldaseq.save(temp_file)
>>>
>>> # Load a potentially pre-trained model from disk.
>>> ldaseq = LdaSeqModel.load(temp_file)
Access the document embeddings generated from the DTM
>>> doc = common_corpus[1]
>>>
>>> embedding = ldaseq[doc]
gensim.models.ldaseqmodel.
LdaPost
(doc=None, lda=None, max_doc_len=None, num_topics=None, gamma=None, lhood=None)¶Bases: gensim.utils.SaveLoad
Posterior values associated with each set of documents.
TODO: use Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation, NIPS 2010. to update phi, gamma. End game would be to somehow replace LdaPost entirely with LdaModel.
Initialize the posterior value structure for the given LDA model.
doc (list of (int, int)) – A BOW representation of the document. Each element in the list is a pair of a word’s ID and its number of occurences in the document.
lda (LdaModel
, optional) – The underlying LDA model.
max_doc_len (int, optional) – The maximum number of words in a document.
num_topics (int, optional) – Number of topics discovered by the LDA model.
gamma (numpy.ndarray, optional) – Topic weight variational parameters for each document. If not supplied, it will be inferred from the model.
lhood (float, optional) – The log likelihood lower bound.
compute_lda_lhood
()¶Compute the log likelihood bound.
The optimal lower bound for the true posterior using the approximate distribution.
float
fit_lda_post
(doc_number, time, ldaseq, LDA_INFERENCE_CONVERGED=1e-08, lda_inference_max_iter=25, g=None, g3_matrix=None, g4_matrix=None, g5_matrix=None)¶Posterior inference for lda.
doc_number (int) – The documents number.
time (int) – Time slice.
ldaseq (object) – Unused.
LDA_INFERENCE_CONVERGED (float) – Epsilon value used to check whether the inference step has sufficiently converged.
lda_inference_max_iter (int) – Maximum number of iterations in the inference step.
g (object) – Unused. Will be useful when the DIM model is implemented.
g3_matrix (object) – Unused. Will be useful when the DIM model is implemented.
g4_matrix (object) – Unused. Will be useful when the DIM model is implemented.
g5_matrix (object) – Unused. Will be useful when the DIM model is implemented.
The optimal lower bound for the true posterior using the approximate distribution.
float
init_lda_post
()¶Initialize variational posterior.
load
(fname, mmap=None)¶Load an object previously saved using save()
from a file.
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
Object loaded from fname.
object
AttributeError – When called on an object instance instead of class (this is a class method).
save
(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)¶Save the object to a file.
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
update_gamma
()¶Update variational dirichlet parameters.
This operations is described in the original Blei LDA paper: gamma = alpha + sum(phi), over every topic for every word.
The updated gamma parameters for each word in the document.
list of float
update_lda_seq_ss
(time, doc, topic_suffstats)¶Update lda sequence sufficient statistics from an lda posterior.
This is very similar to the update_gamma()
method and uses
the same formula.
time (int) – The time slice.
doc (list of (int, float)) – Unused but kept here for backwards compatibility. The document set in the constructor (self.doc) is used instead.
topic_suffstats (list of float) – Sufficient statistics for each topic.
The updated sufficient statistics for each topic.
list of float
update_phi
(doc_number, time)¶Update variational multinomial parameters, based on a document and a time-slice.
This is done based on the original Blei-LDA paper, where: log_phi := beta * exp(Ψ(gamma)), over every topic for every word.
TODO: incorporate lee-sueng trick used in Lee, Seung: Algorithms for non-negative matrix factorization, NIPS 2001.
doc_number (int) – Document number. Unused.
time (int) – Time slice. Unused.
Multinomial parameters, and their logarithm, for each word in the document.
(list of float, list of float)
gensim.models.ldaseqmodel.
LdaSeqModel
(corpus=None, time_slice=None, id2word=None, alphas=0.01, num_topics=10, initialize='gensim', sstats=None, lda_model=None, obs_variance=0.5, chain_variance=0.005, passes=10, random_state=None, lda_inference_max_iter=25, em_min_iter=6, em_max_iter=20, chunksize=100)¶Bases: gensim.utils.SaveLoad
Estimate Dynamic Topic Model parameters based on a training corpus.
corpus ({iterable of list of (int, float), scipy.sparse.csc}, optional) – Stream of document vectors or sparse matrix of shape (num_terms, num_documents).
If not given, the model is left untrained (presumably because you want to call
update()
manually).
time_slice (list of int, optional) – Number of documents in each time-slice. Each time slice could for example represent a year’s published papers, in case the corpus comes from a journal publishing over multiple years. It is assumed that sum(time_slice) == num_documents.
id2word (dict of (int, str), optional) – Mapping from word IDs to words. It is used to determine the vocabulary size, as well as for debugging and topic printing.
alphas (float, optional) – The prior probability for the model.
num_topics (int, optional) – The number of requested latent topics to be extracted from the training corpus.
initialize ({'gensim', 'own', 'ldamodel'}, optional) –
’gensim’: Uses gensim’s LDA initialization.
’own’: Uses your own initialization matrix of an LDA model that has been previously trained.
’lda_model’: Use a previously used LDA model, passing it through the lda_model argument.
sstats (numpy.ndarray , optional) – Sufficient statistics used for initializing the model if initialize == ‘own’. Corresponds to matrix beta in the linked paper for time slice 0, expected shape (self.vocab_len, num_topics).
lda_model (LdaModel
) – Model whose sufficient statistics will be used to initialize the current object if initialize == ‘gensim’.
obs_variance (float, optional) –
Observed variance used to approximate the true and forward variance as shown in David M. Blei, John D. Lafferty: “Dynamic Topic Models”.
chain_variance (float, optional) – Gaussian parameter defined in the beta distribution to dictate how the beta values evolve over time.
passes (int, optional) – Number of passes over the corpus for the initial LdaModel
random_state ({numpy.random.RandomState, int}, optional) – Can be a np.random.RandomState object, or the seed to generate one. Used for reproducibility of results.
lda_inference_max_iter (int, optional) – Maximum number of iterations in the inference step of the LDA training.
em_min_iter (int, optional) – Minimum number of iterations until converge of the Expectation-Maximization algorithm
em_max_iter (int, optional) – Maximum number of iterations until converge of the Expectation-Maximization algorithm.
chunksize (int, optional) – Number of documents in the corpus do be processed in in a chunk.
doc_topics
(doc_number)¶Get the topic mixture for a document.
Uses the priors for the dirichlet distribution that approximates the true posterior with the optimal lower bound, and therefore requires the model to be already trained.
doc_number (int) – Index of the document for which the mixture is returned.
Probability for each topic in the mixture (essentially a point in the self.num_topics - 1 simplex.
list of length self.num_topics
dtm_coherence
(time)¶Get the coherence for each topic.
Can be used to measure the quality of the model, or to inspect the convergence through training via a callback.
time (int) – The time slice.
The word representation for each topic, for each time slice. This can be used to check the time coherence of topics as time evolves: If the most relevant words remain the same then the topic has somehow converged or is relatively static, if they change rapidly the topic is evolving.
list of list of str
dtm_vis
(time, corpus)¶Get the information needed to visualize the corpus model at a given time slice, using the pyLDAvis format.
time (int) – The time slice we are interested in.
corpus ({iterable of list of (int, float), scipy.sparse.csc}, optional) – The corpus we want to visualize at the given time slice.
doc_topics (list of length self.num_topics) – Probability for each topic in the mixture (essentially a point in the self.num_topics - 1 simplex.
topic_term (numpy.ndarray) – The representation of each topic as a multinomial over words in the vocabulary, expected shape (num_topics, vocabulary length).
doc_lengths (list of int) – The number of words in each document. These could be fixed, or drawn from a Poisson distribution.
term_frequency (numpy.ndarray) – The term frequency matrix (denoted as beta in the original Blei paper). This could also be the TF-IDF representation of the corpus, expected shape (number of documents, length of vocabulary).
vocab (list of str) – The set of unique terms existing in the cropuse’s vocabulary.
fit_lda_seq
(corpus, lda_inference_max_iter, em_min_iter, em_max_iter, chunksize)¶Fit a LDA Sequence model (DTM).
This method will iteratively setup LDA models and perform EM steps until the sufficient statistics convergence, or until the maximum number of iterations is reached. Because the true posterior is intractable, an appropriately tight lower bound must be used instead. This function will optimize this bound, by minimizing its true Kullback-Liebler Divergence with the true posterior.
corpus ({iterable of list of (int, float), scipy.sparse.csc}) – Stream of document vectors or sparse matrix of shape (num_terms, num_documents).
lda_inference_max_iter (int) – Maximum number of iterations for the inference step of LDA.
em_min_iter (int) – Minimum number of time slices to be inspected.
em_max_iter (int) – Maximum number of time slices to be inspected.
chunksize (int) – Number of documents to be processed in each chunk.
The highest lower bound for the true posterior produced after all iterations.
float
fit_lda_seq_topics
(topic_suffstats)¶Fit the sequential model topic-wise.
topic_suffstats (numpy.ndarray) – Sufficient statistics of the current model, expected shape (self.vocab_len, num_topics).
The sum of the optimized lower bounds for all topics.
float
inferDTMseq
(corpus, topic_suffstats, gammas, lhoods, lda, ldapost, iter_, bound, lda_inference_max_iter, chunksize)¶Compute the likelihood of a sequential corpus under an LDA seq model, and reports the likelihood bound.
corpus ({iterable of list of (int, float), scipy.sparse.csc}) – Stream of document vectors or sparse matrix of shape (num_terms, num_documents).
topic_suffstats (numpy.ndarray) – Sufficient statistics of the current model, expected shape (self.vocab_len, num_topics).
gammas (numpy.ndarray) – Topic weight variational parameters for each document. If not supplied, it will be inferred from the model.
lhoods (list of float of length self.num_topics) – The total log probability bound for each topic. Corresponds to phi from the linked paper.
lda (LdaModel
) – The trained LDA model of the previous iteration.
ldapost (LdaPost
) – Posterior probability variables for the given LDA model. This will be used as the true (but intractable)
posterior.
iter (int) – The current iteration.
bound (float) – The LDA bound produced after all iterations.
lda_inference_max_iter (int) – Maximum number of iterations for the inference step of LDA.
chunksize (int) – Number of documents to be processed in each chunk.
The first value is the highest lower bound for the true posterior. The second value is the list of optimized dirichlet variational parameters for the approximation of the posterior.
(float, list of float)
init_ldaseq_ss
(topic_chain_variance, topic_obs_variance, alpha, init_suffstats)¶Initialize State Space Language Model, topic-wise.
topic_chain_variance (float) – Gaussian parameter defined in the beta distribution to dictate how the beta values evolve.
topic_obs_variance (float) –
Observed variance used to approximate the true and forward variance as shown in David M. Blei, John D. Lafferty: “Dynamic Topic Models”.
alpha (float) – The prior probability for the model.
init_suffstats (numpy.ndarray) – Sufficient statistics used for initializing the model, expected shape (self.vocab_len, num_topics).
lda_seq_infer
(corpus, topic_suffstats, gammas, lhoods, iter_, lda_inference_max_iter, chunksize)¶Inference (or E-step) for the lower bound EM optimization.
This is used to set up the gensim LdaModel
to be used for each time-slice.
It also allows for Document Influence Model code to be written in.
corpus ({iterable of list of (int, float), scipy.sparse.csc}) – Stream of document vectors or sparse matrix of shape (num_terms, num_documents).
topic_suffstats (numpy.ndarray) – Sufficient statistics for time slice 0, used for initializing the model if initialize == ‘own’, expected shape (self.vocab_len, num_topics).
gammas (numpy.ndarray) – Topic weight variational parameters for each document. If not supplied, it will be inferred from the model.
lhoods (list of float) – The total log probability lower bound for each topic. Corresponds to the phi variational parameters in the linked paper.
iter (int) – Current iteration.
lda_inference_max_iter (int) – Maximum number of iterations for the inference step of LDA.
chunksize (int) – Number of documents to be processed in each chunk.
The first value is the highest lower bound for the true posterior. The second value is the list of optimized dirichlet variational parameters for the approximation of the posterior.
(float, list of float)
load
(fname, mmap=None)¶Load an object previously saved using save()
from a file.
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
Object loaded from fname.
object
AttributeError – When called on an object instance instead of class (this is a class method).
make_lda_seq_slice
(lda, time)¶Update the LDA model topic-word values using time slices.
print_topic
(topic, time=0, top_terms=20)¶Get the list of words most relevant to the given topic.
topic (int) – The index of the topic to be inspected.
time (int, optional) – The time slice in which we are interested in (since topics evolve over time, it is expected that the most relevant words will also gradually change).
top_terms (int, optional) – Number of words associated with the topic to be returned.
The representation of this topic. Each element in the list includes the word itself, along with the probability assigned to it by the topic.
list of (str, float)
print_topic_times
(topic, top_terms=20)¶Get the most relevant words for a topic, for each timeslice. This can be used to inspect the evolution of a topic through time.
topic (int) – The index of the topic.
top_terms (int, optional) – Number of most relevant words associated with the topic to be returned.
Top top_terms relevant terms for the topic for each time slice.
list of list of str
print_topics
(time=0, top_terms=20)¶Get the most relevant words for every topic.
time (int, optional) – The time slice in which we are interested in (since topics evolve over time, it is expected that the most relevant words will also gradually change).
top_terms (int, optional) – Number of most relevant words to be returned for each topic.
Representation of all topics. Each of them is represented by a list of pairs of words and their assigned probability.
list of list of (str, float)
save
(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)¶Save the object to a file.
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
gensim.models.ldaseqmodel.
df_obs
(x, *args)¶Derivative of the objective function which optimises obs.
x (list of float) – The obs values for this word.
sslm (sslm
) – The State Space Language Model for DTM.
word_counts (list of int) – Total word counts for each time slice.
totals (list of int of length len(self.time_slice)) – The totals for each time slice.
mean_deriv_mtx (list of float) – Mean derivative for each time slice.
word (int) – The word’s ID.
deriv (list of float) – Mean derivative for each time slice.
The derivative of the objective function evaluated at point x.
list of float
gensim.models.ldaseqmodel.
f_obs
(x, *args)¶Function which we are optimising for minimizing obs.
x (list of float) – The obs values for this word.
sslm (sslm
) – The State Space Language Model for DTM.
word_counts (list of int) – Total word counts for each time slice.
totals (list of int of length len(self.time_slice)) – The totals for each time slice.
mean_deriv_mtx (list of float) – Mean derivative for each time slice.
word (int) – The word’s ID.
deriv (list of float) – Mean derivative for each time slice.
The value of the objective function evaluated at point x.
list of float
gensim.models.ldaseqmodel.
sslm
(vocab_len=None, num_time_slices=None, num_topics=None, obs_variance=0.5, chain_variance=0.005)¶Bases: gensim.utils.SaveLoad
Encapsulate the inner State Space Language Model for DTM.
Some important attributes of this class:
obs is a matrix containing the document to topic ratios.
e_log_prob is a matrix containing the topic to word ratios.
mean contains the mean values to be used for inference for each word for a time slice.
variance contains the variance values to be used for inference of word in a time slice.
fwd_mean and`fwd_variance` are the forward posterior values for the mean and the variance.
zeta is an extra variational parameter with a value for each time slice.
compute_bound
(sstats, totals)¶Compute the maximized lower bound achieved for the log probability of the true posterior.
Uses the formula presented in the appendix of the DTM paper (formula no. 5).
sstats (numpy.ndarray) – Sufficient statistics for a particular topic. Corresponds to matrix beta in the linked paper for the first time slice, expected shape (self.vocab_len, num_topics).
totals (list of int of length len(self.time_slice)) – The totals for each time slice.
The maximized lower bound.
float
compute_expected_log_prob
()¶Compute the expected log probability given values of m.
The appendix describes the Expectation of log-probabilities in equation 5 of the DTM paper; The below implementation is the result of solving the equation and is implemented as in the original Blei DTM code.
The expected value for the log probabilities for each word and time slice.
numpy.ndarray of float
compute_mean_deriv
(word, time, deriv)¶Helper functions for optimizing a function.
Compute the derivative of:
word (int) – The word’s ID.
time (int) – The time slice.
deriv (list of float) – Derivative for each time slice.
Mean derivative for each time slice.
list of float
compute_obs_deriv
(word, word_counts, totals, mean_deriv_mtx, deriv)¶Derivation of obs which is used in derivative function df_obs while optimizing.
word (int) – The word’s ID.
word_counts (list of int) – Total word counts for each time slice.
totals (list of int of length len(self.time_slice)) – The totals for each time slice.
mean_deriv_mtx (list of float) – Mean derivative for each time slice.
deriv (list of float) – Mean derivative for each time slice.
Mean derivative for each time slice.
list of float
compute_post_mean
(word, chain_variance)¶Get the mean, based on the Variational Kalman Filtering approach for Approximate Inference (section 3.1).
Notes
This function essentially computes E[eta_{t,w}] for t = 1:T.
word (int) – The word’s ID.
chain_variance (float) – Gaussian parameter defined in the beta distribution to dictate how the beta values evolve over time.
The first returned value is the mean of each word in each time slice, the second value is the inferred posterior mean for the same pairs.
(numpy.ndarray, numpy.ndarray)
compute_post_variance
(word, chain_variance)¶Get the variance, based on the Variational Kalman Filtering approach for Approximate Inference (section 3.1).
This function accepts the word to compute variance for, along with the associated sslm class object, and returns the variance and the posterior approximation fwd_variance.
Notes
This function essentially computes Var[beta_{t,w}] for t = 1:T
word (int) – The word’s ID.
chain_variance (float) – Gaussian parameter defined in the beta distribution to dictate how the beta values evolve over time.
The first returned value is the variance of each word in each time slice, the second value is the inferred posterior variance for the same pairs.
(numpy.ndarray, numpy.ndarray)
fit_sslm
(sstats)¶Fits variational distribution.
This is essentially the m-step.
Maximizes the approximation of the true posterior for a particular topic using the provided sufficient
statistics. Updates the values using update_obs()
and
compute_expected_log_prob()
.
sstats (numpy.ndarray) – Sufficient statistics for a particular topic. Corresponds to matrix beta in the linked paper for the current time slice, expected shape (self.vocab_len, num_topics).
The lower bound for the true posterior achieved using the fitted approximate distribution.
float
load
(fname, mmap=None)¶Load an object previously saved using save()
from a file.
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
Object loaded from fname.
object
AttributeError – When called on an object instance instead of class (this is a class method).
save
(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)¶Save the object to a file.
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
sslm_counts_init
(obs_variance, chain_variance, sstats)¶Initialize the State Space Language Model with LDA sufficient statistics.
Called for each topic-chain and initializes initial mean, variance and Topic-Word probabilities for the first time-slice.
obs_variance (float, optional) – Observed variance used to approximate the true and forward variance.
chain_variance (float) – Gaussian parameter defined in the beta distribution to dictate how the beta values evolve over time.
sstats (numpy.ndarray) – Sufficient statistics of the LDA model. Corresponds to matrix beta in the linked paper for time slice 0, expected shape (self.vocab_len, num_topics).
update_obs
(sstats, totals)¶Optimize the bound with respect to the observed variables.
TODO: This is by far the slowest function in the whole algorithm. Replacing or improving the performance of this would greatly speed things up.
sstats (numpy.ndarray) – Sufficient statistics for a particular topic. Corresponds to matrix beta in the linked paper for the first time slice, expected shape (self.vocab_len, num_topics).
totals (list of int of length len(self.time_slice)) – The totals for each time slice.
The updated optimized values for obs and the zeta variational parameter.
(numpy.ndarray of float, numpy.ndarray of float)
update_zeta
()¶Update the Zeta variational parameter.
Zeta is described in the appendix and is equal to sum (exp(mean[word] + Variance[word] / 2)), over every time-slice. It is the value of variational parameter zeta which maximizes the lower bound.
The updated zeta values for each time slice.
list of float