gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

models.ldaseqmodel – Dynamic Topic Modeling in Python

models.ldaseqmodel – Dynamic Topic Modeling in Python

Inspired by the Blei’s original DTM code and paper. Original DTM C/C++ code: https://github.com/blei-lab/dtm DTM Paper: https://www.cs.princeton.edu/~blei/papers/BleiLafferty2006a.pdf

TODO: The next steps to take this forward would be:

  1. Include DIM mode. Most of the infrastructure for this is in place.
  2. See if LdaPost can be replaced by LdaModel completely without breaking anything.
  3. Heavy lifting going on in the sslm class - efforts can be made to cythonise mathematical methods.
    • in particular, update_obs and the optimization takes a lot time.
  4. Try and make it distributed, especially around the E and M step.
  5. Remove all C/C++ coding style/syntax.
class gensim.models.ldaseqmodel.LdaPost(doc=None, lda=None, max_doc_len=None, num_topics=None, gamma=None, lhood=None)

Bases: gensim.utils.SaveLoad

Posterior values associated with each set of documents. TODO: use Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation, NIPS 2010. to update phi, gamma. End game would be to somehow replace LdaPost entirely with LdaModel.

compute_lda_lhood()

compute the likelihood bound

fit_lda_post(doc_number, time, ldaseq, LDA_INFERENCE_CONVERGED=1e-08, lda_inference_max_iter=25, g=None, g3_matrix=None, g4_matrix=None, g5_matrix=None)

Posterior inference for lda. g, g3, g4 and g5 are matrices used in Document Influence Model and not used currently.

init_lda_post()

Initialize variational posterior, does not return anything.

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.

update_gamma()

update variational dirichlet parameters as described in the original Blei LDA paper: gamma = alpha + sum(phi), over every topic for every word.

update_lda_seq_ss(time, doc, topic_suffstats)

Update lda sequence sufficient statistics from an lda posterior. This is very similar to the update_gamma method and uses the same formula.

update_phi(doc_number, time)

Update variational multinomial parameters, based on a document and a time-slice. This is done based on the original Blei-LDA paper, where: log_phi := beta * exp(Ψ(gamma)), over every topic for every word.

TODO: incorporate lee-sueng trick used in Lee, Seung: Algorithms for non-negative matrix factorization, NIPS 2001.

class gensim.models.ldaseqmodel.LdaSeqModel(corpus=None, time_slice=None, id2word=None, alphas=0.01, num_topics=10, initialize='gensim', sstats=None, lda_model=None, obs_variance=0.5, chain_variance=0.005, passes=10, random_state=None, lda_inference_max_iter=25, em_min_iter=6, em_max_iter=20, chunksize=100)

Bases: gensim.utils.SaveLoad

The constructor estimates Dynamic Topic Model parameters based on a training corpus. If we have 30 documents, with 5 in the first time-slice, 10 in the second, and 15 in the third, we would set up our model like this:

>>> ldaseq = LdaSeqModel(corpus=corpus, time_slice= [5, 10, 15], num_topics=5)

Model persistency is achieved through inheriting utils.SaveLoad.

>>> ldaseq.save("ldaseq")

saves the model to disk.

corpus is any iterable gensim corpus

time_slice as described above is a list which contains the number of documents in each time-slice

id2word is a mapping from word ids (integers) to words (strings). It is used to determine the vocabulary size and printing topics.

alphas is a prior of your choice and should be a double or float value. default is 0.01

num_topics is the number of requested latent topics to be extracted from the training corpus.

initalize allows the user to decide how he wants to initialise the DTM model. Default is through gensim LDA. You can use your own sstats of an LDA model previously trained as well by specifying ‘own’ and passing a np matrix through sstats. If you wish to just pass a previously used LDA model, pass it through lda_model Shape of sstats is (vocab_len, num_topics)

chain_variance is a constant which dictates how the beta values evolve - it is a gaussian parameter defined in the beta distribution.

passes is the number of passes of the initial LdaModel.

random_state can be a np.random.RandomState object or the seed for one, for the LdaModel.

doc_topics(doc_number)

On passing the LdaSeqModel trained ldaseq object, the doc_number of your document in the corpus, it returns the doc-topic probabilities of that document.

dtm_coherence(time)

returns all topics of a particular time-slice without probabilitiy values for it to be used for either “u_mass” or “c_v” coherence.

dtm_vis(time, corpus)

returns term_frequency, vocab, doc_lengths, topic-term distributions and doc_topic distributions, specified by pyLDAvis format. all of these are needed to visualise topics for DTM for a particular time-slice via pyLDAvis. input parameter is the year to do the visualisation.

fit_lda_seq(corpus, lda_inference_max_iter, em_min_iter, em_max_iter, chunksize)
fit an lda sequence model:
for each time period:

set up lda model with E[log p(w|z)] and lpha

for each document:
perform posterior inference update sufficient statistics/likelihood

maximize topics

fit_lda_seq_topics(topic_suffstats)

Fit lda sequence topic wise.

inferDTMseq(corpus, topic_suffstats, gammas, lhoods, lda, ldapost, iter_, bound, lda_inference_max_iter, chunksize)

Computes the likelihood of a sequential corpus under an LDA seq model, and return the likelihood bound. Need to pass the LdaSeq model, corpus, sufficient stats, gammas and lhoods matrices previously created, and LdaModel and LdaPost class objects.

init_ldaseq_ss(topic_chain_variance, topic_obs_variance, alpha, init_suffstats)

Method to initialize State Space Language Model, topic wise.

lda_seq_infer(corpus, topic_suffstats, gammas, lhoods, iter_, lda_inference_max_iter, chunksize)

Inference or E- Step. This is used to set up the gensim LdaModel to be used for each time-slice. It also allows for Document Influence Model code to be written in.

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

make_lda_seq_slice(lda, time)

set up the LDA model topic-word values with that of ldaseq.

print_topic(topic, time=0, top_terms=20)

Topic is the topic number Time is for a particular time_slice top_terms is the number of terms to display

print_topic_times(topic, top_terms=20)

Prints one topic showing each time-slice.

print_topics(time=0, top_terms=20)

Prints all topics in a particular time-slice.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.

gensim.models.ldaseqmodel.df_obs(x, *args)

Derivative of function which optimises obs.

gensim.models.ldaseqmodel.f_obs(x, *args)

Function which we are optimising for minimizing obs.

class gensim.models.ldaseqmodel.sslm(vocab_len=None, num_time_slices=None, num_topics=None, obs_variance=0.5, chain_variance=0.005)

Bases: gensim.utils.SaveLoad

The sslm class is the State Space Language Model for DTM and contains the following information: obs values contain the doc - topic ratios e_log_prob contains topic - word ratios mean, fwd_mean contains the mean values to be used for inference for each word for a time_slice variance, fwd_variance contains the variance values to be used for inference for each word in a time_slice fwd_mean, fwd_variance are the forward posterior values. zeta is an extra variational parameter with a value for each time-slice

compute_bound(sstats, totals)

Compute log probability bound. Forumula is as described in appendix of DTM by Blei. (formula no. 5)

compute_expected_log_prob()

Compute the expected log probability given values of m. The appendix describes the Expectation of log-probabilities in equation 5 of the DTM paper; The below implementation is the result of solving the equation and is as implemented in the original Blei DTM code.

compute_mean_deriv(word, time, deriv)

Used in helping find the optimum function. computes derivative of E[eta_{t,w}]/d obs_{s,w} for t = 1:T. put the result in deriv, allocated T+1 vector

compute_obs_deriv(word, word_counts, totals, mean_deriv_mtx, deriv)

Derivation of obs which is used in derivative function [df_obs] while optimizing.

compute_post_mean(word, chain_variance)

Based on the Variational Kalman Filtering approach for Approximate Inference [https://www.cs.princeton.edu/~blei/papers/BleiLafferty2006a.pdf] This function accepts the word to compute mean for, along with the associated sslm class object, and returns mean and fwd_mean Essentially a forward-backward to compute E[eta_{t,w}] for t = 1:T.

Fwd_Mean(t) ≡ E(beta_{t,w} | beta_ˆ 1:t ) = (obs_variance / fwd_variance[t - 1] + chain_variance + obs_variance ) * fwd_mean[t - 1] + (1 - (obs_variance / fwd_variance[t - 1] + chain_variance + obs_variance)) * beta

Mean(t) ≡ E(beta_{t,w} | beta_ˆ 1:T ) = fwd_mean[t - 1] + (obs_variance / fwd_variance[t - 1] + obs_variance) + (1 - obs_variance / fwd_variance[t - 1] + obs_variance)) * mean[t]

compute_post_variance(word, chain_variance)

Based on the Variational Kalman Filtering approach for Approximate Inference [https://www.cs.princeton.edu/~blei/papers/BleiLafferty2006a.pdf] This function accepts the word to compute variance for, along with the associated sslm class object, and returns variance and fwd_variance Computes Var[eta_{t,w}] for t = 1:T

:math:

fwd\_variance[t] \equiv E((beta_{t,w}-mean_{t,w})^2 |beta_{t}\ for\ 1:t) = (obs\_variance / fwd\_variance[t - 1] + chain\_variance + obs\_variance ) * (fwd\_variance[t - 1] + obs\_variance)

:math:

variance[t] \equiv E((beta_{t,w}-mean\_cap_{t,w})^2 |beta\_cap_{t}\ for\ 1:t) = fwd\_variance[t - 1] + (fwd\_variance[t - 1] / fwd\_variance[t - 1] + obs\_variance)^2 * (variance[t - 1] - (fwd\_variance[t-1] + obs\_variance))
fit_sslm(sstats)

Fits variational distribution. This is essentially the m-step. Accepts the sstats for a particular topic for input and maximizes values for that topic. Updates the values in the update_obs() and compute_expected_log_prob methods.

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.

sslm_counts_init(obs_variance, chain_variance, sstats)

Initialize State Space Language Model with LDA sufficient statistics. Called for each topic-chain and initializes intial mean, variance and Topic-Word probabilities for the first time-slice.

update_obs(sstats, totals)

Function to perform optimization of obs. Parameters are suff_stats set up in the fit_sslm method.

TODO: This is by far the slowest function in the whole algorithm. Replacing or improving the performance of this would greatly speed things up.

update_zeta()

Updates the Zeta Variational Parameter. Zeta is described in the appendix and is equal to sum (exp(mean[word] + Variance[word] / 2)), over every time-slice. It is the value of variational parameter zeta which maximizes the lower bound.