models.ldaseqmodel – Dynamic Topic Modeling in Python

Lda Sequence model, inspired by David M. Blei, John D. Lafferty: “Dynamic Topic Models”. The original C/C++ implementation can be found on blei-lab/dtm.

TODO: The next steps to take this forward would be:

  1. Include DIM mode. Most of the infrastructure for this is in place.

  2. See if LdaPost can be replaced by LdaModel completely without breaking anything.

  3. Heavy lifting going on in the Sslm class - efforts can be made to cythonise mathematical methods, in particular, update_obs and the optimization takes a lot time.

  4. Try and make it distributed, especially around the E and M step.

  5. Remove all C/C++ coding style/syntax.

Examples

Set up a model using have 30 documents, with 5 in the first time-slice, 10 in the second, and 15 in the third

>>> from gensim.test.utils import common_corpus
>>> from gensim.models import LdaSeqModel
>>>
>>> ldaseq = LdaSeqModel(corpus=common_corpus, time_slice=[2, 4, 3], num_topics=2, chunksize=1)

Persist a model to disk and reload it later

>>> from gensim.test.utils import datapath
>>>
>>> temp_file = datapath("model")
>>> ldaseq.save(temp_file)
>>>
>>> # Load a potentially pre-trained model from disk.
>>> ldaseq = LdaSeqModel.load(temp_file)

Access the document embeddings generated from the DTM

>>> doc = common_corpus[1]
>>>
>>> embedding = ldaseq[doc]
class gensim.models.ldaseqmodel.LdaPost(doc=None, lda=None, max_doc_len=None, num_topics=None, gamma=None, lhood=None)

Bases: SaveLoad

Posterior values associated with each set of documents.

TODO: use Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation, NIPS 2010. to update phi, gamma. End game would be to somehow replace LdaPost entirely with LdaModel.

Initialize the posterior value structure for the given LDA model.

Parameters
  • doc (list of (int, int)) – A BOW representation of the document. Each element in the list is a pair of a word’s ID and its number of occurences in the document.

  • lda (LdaModel, optional) – The underlying LDA model.

  • max_doc_len (int, optional) – The maximum number of words in a document.

  • num_topics (int, optional) – Number of topics discovered by the LDA model.

  • gamma (numpy.ndarray, optional) – Topic weight variational parameters for each document. If not supplied, it will be inferred from the model.

  • lhood (float, optional) – The log likelihood lower bound.

add_lifecycle_event(event_name, log_level=20, **event)

Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.

Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.

The lifecycle_events attribute is persisted across object’s save() and load() operations. It has no impact on the use of the model, but is useful during debugging and support.

Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.

Parameters
  • event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.

  • event (dict) –

    Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.

    This method will automatically add the following key-values to event, so you don’t have to specify them:

    • datetime: the current date & time

    • gensim: the current Gensim version

    • python: the current Python version

    • platform: the current platform

    • event: the name of this event

  • log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.

compute_lda_lhood()

Compute the log likelihood bound.

Returns

The optimal lower bound for the true posterior using the approximate distribution.

Return type

float

fit_lda_post(doc_number, time, ldaseq, LDA_INFERENCE_CONVERGED=1e-08, lda_inference_max_iter=25, g=None, g3_matrix=None, g4_matrix=None, g5_matrix=None)

Posterior inference for lda.

Parameters
  • doc_number (int) – The documents number.

  • time (int) – Time slice.

  • ldaseq (object) – Unused.

  • LDA_INFERENCE_CONVERGED (float) – Epsilon value used to check whether the inference step has sufficiently converged.

  • lda_inference_max_iter (int) – Maximum number of iterations in the inference step.

  • g (object) – Unused. Will be useful when the DIM model is implemented.

  • g3_matrix (object) – Unused. Will be useful when the DIM model is implemented.

  • g4_matrix (object) – Unused. Will be useful when the DIM model is implemented.

  • g5_matrix (object) – Unused. Will be useful when the DIM model is implemented.

Returns

The optimal lower bound for the true posterior using the approximate distribution.

Return type

float

init_lda_post()

Initialize variational posterior.

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.

update_gamma()

Update variational dirichlet parameters.

This operations is described in the original Blei LDA paper: gamma = alpha + sum(phi), over every topic for every word.

Returns

The updated gamma parameters for each word in the document.

Return type

list of float

update_lda_seq_ss(time, doc, topic_suffstats)

Update lda sequence sufficient statistics from an lda posterior.

This is very similar to the update_gamma() method and uses the same formula.

Parameters
  • time (int) – The time slice.

  • doc (list of (int, float)) – Unused but kept here for backwards compatibility. The document set in the constructor (self.doc) is used instead.

  • topic_suffstats (list of float) – Sufficient statistics for each topic.

Returns

The updated sufficient statistics for each topic.

Return type

list of float

update_phi(doc_number, time)

Update variational multinomial parameters, based on a document and a time-slice.

This is done based on the original Blei-LDA paper, where: log_phi := beta * exp(Ψ(gamma)), over every topic for every word.

TODO: incorporate lee-sueng trick used in Lee, Seung: Algorithms for non-negative matrix factorization, NIPS 2001.

Parameters
  • doc_number (int) – Document number. Unused.

  • time (int) – Time slice. Unused.

Returns

Multinomial parameters, and their logarithm, for each word in the document.

Return type

(list of float, list of float)

class gensim.models.ldaseqmodel.LdaSeqModel(corpus=None, time_slice=None, id2word=None, alphas=0.01, num_topics=10, initialize='gensim', sstats=None, lda_model=None, obs_variance=0.5, chain_variance=0.005, passes=10, random_state=None, lda_inference_max_iter=25, em_min_iter=6, em_max_iter=20, chunksize=100)

Bases: SaveLoad

Estimate Dynamic Topic Model parameters based on a training corpus.

Parameters
  • corpus ({iterable of list of (int, float), scipy.sparse.csc}, optional) – Stream of document vectors or sparse matrix of shape (num_documents, num_terms). If not given, the model is left untrained (presumably because you want to call update() manually).

  • time_slice (list of int, optional) – Number of documents in each time-slice. Each time slice could for example represent a year’s published papers, in case the corpus comes from a journal publishing over multiple years. It is assumed that sum(time_slice) == num_documents.

  • id2word (dict of (int, str), optional) – Mapping from word IDs to words. It is used to determine the vocabulary size, as well as for debugging and topic printing.

  • alphas (float, optional) – The prior probability for the model.

  • num_topics (int, optional) – The number of requested latent topics to be extracted from the training corpus.

  • initialize ({'gensim', 'own', 'ldamodel'}, optional) –

    Controls the initialization of the DTM model. Supports three different modes:
    • ’gensim’: Uses gensim’s LDA initialization.

    • ’own’: Uses your own initialization matrix of an LDA model that has been previously trained.

    • ’lda_model’: Use a previously used LDA model, passing it through the lda_model argument.

  • sstats (numpy.ndarray , optional) – Sufficient statistics used for initializing the model if initialize == ‘own’. Corresponds to matrix beta in the linked paper for time slice 0, expected shape (self.vocab_len, num_topics).

  • lda_model (LdaModel) – Model whose sufficient statistics will be used to initialize the current object if initialize == ‘gensim’.

  • obs_variance (float, optional) –

    Observed variance used to approximate the true and forward variance as shown in David M. Blei, John D. Lafferty: “Dynamic Topic Models”.

  • chain_variance (float, optional) – Gaussian parameter defined in the beta distribution to dictate how the beta values evolve over time.

  • passes (int, optional) – Number of passes over the corpus for the initial LdaModel

  • random_state ({numpy.random.RandomState, int}, optional) – Can be a np.random.RandomState object, or the seed to generate one. Used for reproducibility of results.

  • lda_inference_max_iter (int, optional) – Maximum number of iterations in the inference step of the LDA training.

  • em_min_iter (int, optional) – Minimum number of iterations until converge of the Expectation-Maximization algorithm

  • em_max_iter (int, optional) – Maximum number of iterations until converge of the Expectation-Maximization algorithm.

  • chunksize (int, optional) – Number of documents in the corpus do be processed in in a chunk.

add_lifecycle_event(event_name, log_level=20, **event)

Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.

Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.

The lifecycle_events attribute is persisted across object’s save() and load() operations. It has no impact on the use of the model, but is useful during debugging and support.

Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.

Parameters
  • event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.

  • event (dict) –

    Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.

    This method will automatically add the following key-values to event, so you don’t have to specify them:

    • datetime: the current date & time

    • gensim: the current Gensim version

    • python: the current Python version

    • platform: the current platform

    • event: the name of this event

  • log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.

doc_topics(doc_number)

Get the topic mixture for a document.

Uses the priors for the dirichlet distribution that approximates the true posterior with the optimal lower bound, and therefore requires the model to be already trained.

Parameters

doc_number (int) – Index of the document for which the mixture is returned.

Returns

Probability for each topic in the mixture (essentially a point in the self.num_topics - 1 simplex.

Return type

list of length self.num_topics

dtm_coherence(time)

Get the coherence for each topic.

Can be used to measure the quality of the model, or to inspect the convergence through training via a callback.

Parameters

time (int) – The time slice.

Returns

The word representation for each topic, for each time slice. This can be used to check the time coherence of topics as time evolves: If the most relevant words remain the same then the topic has somehow converged or is relatively static, if they change rapidly the topic is evolving.

Return type

list of list of str

dtm_vis(time, corpus)

Get the information needed to visualize the corpus model at a given time slice, using the pyLDAvis format.

Parameters
  • time (int) – The time slice we are interested in.

  • corpus ({iterable of list of (int, float), scipy.sparse.csc}, optional) – The corpus we want to visualize at the given time slice.

Returns

  • doc_topics (list of length self.num_topics) – Probability for each topic in the mixture (essentially a point in the self.num_topics - 1 simplex.

  • topic_term (numpy.ndarray) – The representation of each topic as a multinomial over words in the vocabulary, expected shape (num_topics, vocabulary length).

  • doc_lengths (list of int) – The number of words in each document. These could be fixed, or drawn from a Poisson distribution.

  • term_frequency (numpy.ndarray) – The term frequency matrix (denoted as beta in the original Blei paper). This could also be the TF-IDF representation of the corpus, expected shape (number of documents, length of vocabulary).

  • vocab (list of str) – The set of unique terms existing in the cropuse’s vocabulary.

fit_lda_seq(corpus, lda_inference_max_iter, em_min_iter, em_max_iter, chunksize)

Fit a LDA Sequence model (DTM).

This method will iteratively setup LDA models and perform EM steps until the sufficient statistics convergence, or until the maximum number of iterations is reached. Because the true posterior is intractable, an appropriately tight lower bound must be used instead. This function will optimize this bound, by minimizing its true Kullback-Liebler Divergence with the true posterior.

Parameters
  • corpus ({iterable of list of (int, float), scipy.sparse.csc}) – Stream of document vectors or sparse matrix of shape (num_documents, num_terms).

  • lda_inference_max_iter (int) – Maximum number of iterations for the inference step of LDA.

  • em_min_iter (int) – Minimum number of time slices to be inspected.

  • em_max_iter (int) – Maximum number of time slices to be inspected.

  • chunksize (int) – Number of documents to be processed in each chunk.

Returns

The highest lower bound for the true posterior produced after all iterations.

Return type

float

fit_lda_seq_topics(topic_suffstats)

Fit the sequential model topic-wise.

Parameters

topic_suffstats (numpy.ndarray) – Sufficient statistics of the current model, expected shape (self.vocab_len, num_topics).

Returns

The sum of the optimized lower bounds for all topics.

Return type

float

inferDTMseq(corpus, topic_suffstats, gammas, lhoods, lda, ldapost, iter_, bound, lda_inference_max_iter, chunksize)

Compute the likelihood of a sequential corpus under an LDA seq model, and reports the likelihood bound.

Parameters
  • corpus ({iterable of list of (int, float), scipy.sparse.csc}) – Stream of document vectors or sparse matrix of shape (num_documents, num_terms).

  • topic_suffstats (numpy.ndarray) – Sufficient statistics of the current model, expected shape (self.vocab_len, num_topics).

  • gammas (numpy.ndarray) – Topic weight variational parameters for each document. If not supplied, it will be inferred from the model.

  • lhoods (list of float of length self.num_topics) – The total log probability bound for each topic. Corresponds to phi from the linked paper.

  • lda (LdaModel) – The trained LDA model of the previous iteration.

  • ldapost (LdaPost) – Posterior probability variables for the given LDA model. This will be used as the true (but intractable) posterior.

  • iter (int) – The current iteration.

  • bound (float) – The LDA bound produced after all iterations.

  • lda_inference_max_iter (int) – Maximum number of iterations for the inference step of LDA.

  • chunksize (int) – Number of documents to be processed in each chunk.

Returns

The first value is the highest lower bound for the true posterior. The second value is the list of optimized dirichlet variational parameters for the approximation of the posterior.

Return type

(float, list of float)

init_ldaseq_ss(topic_chain_variance, topic_obs_variance, alpha, init_suffstats)

Initialize State Space Language Model, topic-wise.

Parameters
  • topic_chain_variance (float) – Gaussian parameter defined in the beta distribution to dictate how the beta values evolve.

  • topic_obs_variance (float) –

    Observed variance used to approximate the true and forward variance as shown in David M. Blei, John D. Lafferty: “Dynamic Topic Models”.

  • alpha (float) – The prior probability for the model.

  • init_suffstats (numpy.ndarray) – Sufficient statistics used for initializing the model, expected shape (self.vocab_len, num_topics).

lda_seq_infer(corpus, topic_suffstats, gammas, lhoods, iter_, lda_inference_max_iter, chunksize)

Inference (or E-step) for the lower bound EM optimization.

This is used to set up the gensim LdaModel to be used for each time-slice. It also allows for Document Influence Model code to be written in.

Parameters
  • corpus ({iterable of list of (int, float), scipy.sparse.csc}) – Stream of document vectors or sparse matrix of shape (num_documents, num_terms).

  • topic_suffstats (numpy.ndarray) – Sufficient statistics for time slice 0, used for initializing the model if initialize == ‘own’, expected shape (self.vocab_len, num_topics).

  • gammas (numpy.ndarray) – Topic weight variational parameters for each document. If not supplied, it will be inferred from the model.

  • lhoods (list of float) – The total log probability lower bound for each topic. Corresponds to the phi variational parameters in the linked paper.

  • iter (int) – Current iteration.

  • lda_inference_max_iter (int) – Maximum number of iterations for the inference step of LDA.

  • chunksize (int) – Number of documents to be processed in each chunk.

Returns

The first value is the highest lower bound for the true posterior. The second value is the list of optimized dirichlet variational parameters for the approximation of the posterior.

Return type

(float, list of float)

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

make_lda_seq_slice(lda, time)

Update the LDA model topic-word values using time slices.

Parameters
  • lda (LdaModel) – The stationary model to be updated

  • time (int) – The time slice assigned to the stationary model.

Returns

lda – The stationary model updated to reflect the passed time slice.

Return type

LdaModel

print_topic(topic, time=0, top_terms=20)

Get the list of words most relevant to the given topic.

Parameters
  • topic (int) – The index of the topic to be inspected.

  • time (int, optional) – The time slice in which we are interested in (since topics evolve over time, it is expected that the most relevant words will also gradually change).

  • top_terms (int, optional) – Number of words associated with the topic to be returned.

Returns

The representation of this topic. Each element in the list includes the word itself, along with the probability assigned to it by the topic.

Return type

list of (str, float)

print_topic_times(topic, top_terms=20)

Get the most relevant words for a topic, for each timeslice. This can be used to inspect the evolution of a topic through time.

Parameters
  • topic (int) – The index of the topic.

  • top_terms (int, optional) – Number of most relevant words associated with the topic to be returned.

Returns

Top top_terms relevant terms for the topic for each time slice.

Return type

list of list of str

print_topics(time=0, top_terms=20)

Get the most relevant words for every topic.

Parameters
  • time (int, optional) – The time slice in which we are interested in (since topics evolve over time, it is expected that the most relevant words will also gradually change).

  • top_terms (int, optional) – Number of most relevant words to be returned for each topic.

Returns

Representation of all topics. Each of them is represented by a list of pairs of words and their assigned probability.

Return type

list of list of (str, float)

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.

gensim.models.ldaseqmodel.df_obs(x, *args)

Derivative of the objective function which optimises obs.

Parameters
  • x (list of float) – The obs values for this word.

  • sslm (sslm) – The State Space Language Model for DTM.

  • word_counts (list of int) – Total word counts for each time slice.

  • totals (list of int of length len(self.time_slice)) – The totals for each time slice.

  • mean_deriv_mtx (list of float) – Mean derivative for each time slice.

  • word (int) – The word’s ID.

  • deriv (list of float) – Mean derivative for each time slice.

Returns

The derivative of the objective function evaluated at point x.

Return type

list of float

gensim.models.ldaseqmodel.f_obs(x, *args)

Function which we are optimising for minimizing obs.

Parameters
  • x (list of float) – The obs values for this word.

  • sslm (sslm) – The State Space Language Model for DTM.

  • word_counts (list of int) – Total word counts for each time slice.

  • totals (list of int of length len(self.time_slice)) – The totals for each time slice.

  • mean_deriv_mtx (list of float) – Mean derivative for each time slice.

  • word (int) – The word’s ID.

  • deriv (list of float) – Mean derivative for each time slice.

Returns

The value of the objective function evaluated at point x.

Return type

list of float

class gensim.models.ldaseqmodel.sslm(vocab_len=None, num_time_slices=None, num_topics=None, obs_variance=0.5, chain_variance=0.005)

Bases: SaveLoad

Encapsulate the inner State Space Language Model for DTM.

Some important attributes of this class:

  • obs is a matrix containing the document to topic ratios.

  • e_log_prob is a matrix containing the topic to word ratios.

  • mean contains the mean values to be used for inference for each word for a time slice.

  • variance contains the variance values to be used for inference of word in a time slice.

  • fwd_mean and`fwd_variance` are the forward posterior values for the mean and the variance.

  • zeta is an extra variational parameter with a value for each time slice.

add_lifecycle_event(event_name, log_level=20, **event)

Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.

Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.

The lifecycle_events attribute is persisted across object’s save() and load() operations. It has no impact on the use of the model, but is useful during debugging and support.

Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.

Parameters
  • event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.

  • event (dict) –

    Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.

    This method will automatically add the following key-values to event, so you don’t have to specify them:

    • datetime: the current date & time

    • gensim: the current Gensim version

    • python: the current Python version

    • platform: the current platform

    • event: the name of this event

  • log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.

compute_bound(sstats, totals)

Compute the maximized lower bound achieved for the log probability of the true posterior.

Uses the formula presented in the appendix of the DTM paper (formula no. 5).

Parameters
  • sstats (numpy.ndarray) – Sufficient statistics for a particular topic. Corresponds to matrix beta in the linked paper for the first time slice, expected shape (self.vocab_len, num_topics).

  • totals (list of int of length len(self.time_slice)) – The totals for each time slice.

Returns

The maximized lower bound.

Return type

float

compute_expected_log_prob()

Compute the expected log probability given values of m.

The appendix describes the Expectation of log-probabilities in equation 5 of the DTM paper; The below implementation is the result of solving the equation and is implemented as in the original Blei DTM code.

Returns

The expected value for the log probabilities for each word and time slice.

Return type

numpy.ndarray of float

compute_mean_deriv(word, time, deriv)

Helper functions for optimizing a function.

Compute the derivative of:

Parameters
  • word (int) – The word’s ID.

  • time (int) – The time slice.

  • deriv (list of float) – Derivative for each time slice.

Returns

Mean derivative for each time slice.

Return type

list of float

compute_obs_deriv(word, word_counts, totals, mean_deriv_mtx, deriv)

Derivation of obs which is used in derivative function df_obs while optimizing.

Parameters
  • word (int) – The word’s ID.

  • word_counts (list of int) – Total word counts for each time slice.

  • totals (list of int of length len(self.time_slice)) – The totals for each time slice.

  • mean_deriv_mtx (list of float) – Mean derivative for each time slice.

  • deriv (list of float) – Mean derivative for each time slice.

Returns

Mean derivative for each time slice.

Return type

list of float

compute_post_mean(word, chain_variance)

Get the mean, based on the Variational Kalman Filtering approach for Approximate Inference (section 3.1).

Notes

This function essentially computes E[eta_{t,w}] for t = 1:T.

Parameters
  • word (int) – The word’s ID.

  • chain_variance (float) – Gaussian parameter defined in the beta distribution to dictate how the beta values evolve over time.

Returns

The first returned value is the mean of each word in each time slice, the second value is the inferred posterior mean for the same pairs.

Return type

(numpy.ndarray, numpy.ndarray)

compute_post_variance(word, chain_variance)

Get the variance, based on the Variational Kalman Filtering approach for Approximate Inference (section 3.1).

This function accepts the word to compute variance for, along with the associated sslm class object, and returns the variance and the posterior approximation fwd_variance.

Notes

This function essentially computes Var[beta_{t,w}] for t = 1:T

Parameters
  • word (int) – The word’s ID.

  • chain_variance (float) – Gaussian parameter defined in the beta distribution to dictate how the beta values evolve over time.

Returns

The first returned value is the variance of each word in each time slice, the second value is the inferred posterior variance for the same pairs.

Return type

(numpy.ndarray, numpy.ndarray)

fit_sslm(sstats)

Fits variational distribution.

This is essentially the m-step. Maximizes the approximation of the true posterior for a particular topic using the provided sufficient statistics. Updates the values using update_obs() and compute_expected_log_prob().

Parameters

sstats (numpy.ndarray) – Sufficient statistics for a particular topic. Corresponds to matrix beta in the linked paper for the current time slice, expected shape (self.vocab_len, num_topics).

Returns

The lower bound for the true posterior achieved using the fitted approximate distribution.

Return type

float

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.

sslm_counts_init(obs_variance, chain_variance, sstats)

Initialize the State Space Language Model with LDA sufficient statistics.

Called for each topic-chain and initializes initial mean, variance and Topic-Word probabilities for the first time-slice.

Parameters
  • obs_variance (float, optional) – Observed variance used to approximate the true and forward variance.

  • chain_variance (float) – Gaussian parameter defined in the beta distribution to dictate how the beta values evolve over time.

  • sstats (numpy.ndarray) – Sufficient statistics of the LDA model. Corresponds to matrix beta in the linked paper for time slice 0, expected shape (self.vocab_len, num_topics).

update_obs(sstats, totals)

Optimize the bound with respect to the observed variables.

TODO: This is by far the slowest function in the whole algorithm. Replacing or improving the performance of this would greatly speed things up.

Parameters
  • sstats (numpy.ndarray) – Sufficient statistics for a particular topic. Corresponds to matrix beta in the linked paper for the first time slice, expected shape (self.vocab_len, num_topics).

  • totals (list of int of length len(self.time_slice)) – The totals for each time slice.

Returns

The updated optimized values for obs and the zeta variational parameter.

Return type

(numpy.ndarray of float, numpy.ndarray of float)

update_zeta()

Update the Zeta variational parameter.

Zeta is described in the appendix and is equal to sum (exp(mean[word] + Variance[word] / 2)), over every time-slice. It is the value of variational parameter zeta which maximizes the lower bound.

Returns

The updated zeta values for each time slice.

Return type

list of float