models.ldaseqmodel
– Dynamic Topic Modeling in Python¶
Lda Sequence model, inspired by David M. Blei, John D. Lafferty: “Dynamic Topic Models”. The original C/C++ implementation can be found on blei-lab/dtm.
TODO: The next steps to take this forward would be:
Include DIM mode. Most of the infrastructure for this is in place.
See if LdaPost can be replaced by LdaModel completely without breaking anything.
Heavy lifting going on in the Sslm class - efforts can be made to cythonise mathematical methods, in particular, update_obs and the optimization takes a lot time.
Try and make it distributed, especially around the E and M step.
Remove all C/C++ coding style/syntax.
Examples
Set up a model using 9 documents, with 2 in the first time-slice, 4 in the second, and 3 in the third
>>> from gensim.test.utils import common_corpus
>>> from gensim.models import LdaSeqModel
>>>
>>> ldaseq = LdaSeqModel(corpus=common_corpus, time_slice=[2, 4, 3], num_topics=2, chunksize=1)
Persist a model to disk and reload it later
>>> from gensim.test.utils import datapath
>>>
>>> temp_file = datapath("model")
>>> ldaseq.save(temp_file)
>>>
>>> # Load a potentially pre-trained model from disk.
>>> ldaseq = LdaSeqModel.load(temp_file)
Access the document embeddings generated from the DTM
>>> doc = common_corpus[1]
>>>
>>> embedding = ldaseq[doc]
- class gensim.models.ldaseqmodel.LdaPost(doc=None, lda=None, max_doc_len=None, num_topics=None, gamma=None, lhood=None)¶
Bases:
SaveLoad
Posterior values associated with each set of documents.
TODO: use Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation, NIPS 2010. to update phi, gamma. End game would be to somehow replace LdaPost entirely with LdaModel.
Initialize the posterior value structure for the given LDA model.
- Parameters
doc (list of (int, int)) – A BOW representation of the document. Each element in the list is a pair of a word’s ID and its number of occurences in the document.
lda (
LdaModel
, optional) – The underlying LDA model.max_doc_len (int, optional) – The maximum number of words in a document.
num_topics (int, optional) – Number of topics discovered by the LDA model.
gamma (numpy.ndarray, optional) – Topic weight variational parameters for each document. If not supplied, it will be inferred from the model.
lhood (float, optional) – The log likelihood lower bound.
- add_lifecycle_event(event_name, log_level=20, **event)¶
Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.
Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.
The lifecycle_events attribute is persisted across object’s
save()
andload()
operations. It has no impact on the use of the model, but is useful during debugging and support.Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.
- Parameters
event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.
This method will automatically add the following key-values to event, so you don’t have to specify them:
datetime: the current date & time
gensim: the current Gensim version
python: the current Python version
platform: the current platform
event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.
- compute_lda_lhood()¶
Compute the log likelihood bound.
- Returns
The optimal lower bound for the true posterior using the approximate distribution.
- Return type
float
- fit_lda_post(doc_number, time, ldaseq, LDA_INFERENCE_CONVERGED=1e-08, lda_inference_max_iter=25, g=None, g3_matrix=None, g4_matrix=None, g5_matrix=None)¶
Posterior inference for lda.
- Parameters
doc_number (int) – The documents number.
time (int) – Time slice.
ldaseq (object) – Unused.
LDA_INFERENCE_CONVERGED (float) – Epsilon value used to check whether the inference step has sufficiently converged.
lda_inference_max_iter (int) – Maximum number of iterations in the inference step.
g (object) – Unused. Will be useful when the DIM model is implemented.
g3_matrix (object) – Unused. Will be useful when the DIM model is implemented.
g4_matrix (object) – Unused. Will be useful when the DIM model is implemented.
g5_matrix (object) – Unused. Will be useful when the DIM model is implemented.
- Returns
The optimal lower bound for the true posterior using the approximate distribution.
- Return type
float
- init_lda_post()¶
Initialize variational posterior.
- classmethod load(fname, mmap=None)¶
Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
- save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)¶
Save the object to a file.
- Parameters
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
- update_gamma()¶
Update variational dirichlet parameters.
This operations is described in the original Blei LDA paper: gamma = alpha + sum(phi), over every topic for every word.
- Returns
The updated gamma parameters for each word in the document.
- Return type
list of float
- update_lda_seq_ss(time, doc, topic_suffstats)¶
Update lda sequence sufficient statistics from an lda posterior.
This is very similar to the
update_gamma()
method and uses the same formula.- Parameters
time (int) – The time slice.
doc (list of (int, float)) – Unused but kept here for backwards compatibility. The document set in the constructor (self.doc) is used instead.
topic_suffstats (list of float) – Sufficient statistics for each topic.
- Returns
The updated sufficient statistics for each topic.
- Return type
list of float
- update_phi(doc_number, time)¶
Update variational multinomial parameters, based on a document and a time-slice.
This is done based on the original Blei-LDA paper, where: log_phi := beta * exp(Ψ(gamma)), over every topic for every word.
TODO: incorporate lee-sueng trick used in Lee, Seung: Algorithms for non-negative matrix factorization, NIPS 2001.
- Parameters
doc_number (int) – Document number. Unused.
time (int) – Time slice. Unused.
- Returns
Multinomial parameters, and their logarithm, for each word in the document.
- Return type
(list of float, list of float)
- class gensim.models.ldaseqmodel.LdaSeqModel(corpus=None, time_slice=None, id2word=None, alphas=0.01, num_topics=10, initialize='gensim', sstats=None, lda_model=None, obs_variance=0.5, chain_variance=0.005, passes=10, random_state=None, lda_inference_max_iter=25, em_min_iter=6, em_max_iter=20, chunksize=100)¶
Bases:
SaveLoad
Estimate Dynamic Topic Model parameters based on a training corpus.
- Parameters
corpus ({iterable of list of (int, float), scipy.sparse.csc}, optional) – Stream of document vectors or sparse matrix of shape (num_documents, num_terms). If not given, the model is left untrained (presumably because you want to call
update()
manually).time_slice (list of int, optional) – Number of documents in each time-slice. Each time slice could for example represent a year’s published papers, in case the corpus comes from a journal publishing over multiple years. It is assumed that sum(time_slice) == num_documents.
id2word (dict of (int, str), optional) – Mapping from word IDs to words. It is used to determine the vocabulary size, as well as for debugging and topic printing.
alphas (float, optional) – The prior probability for the model.
num_topics (int, optional) – The number of requested latent topics to be extracted from the training corpus.
initialize ({'gensim', 'own', 'ldamodel'}, optional) –
- Controls the initialization of the DTM model. Supports three different modes:
’gensim’: Uses gensim’s LDA initialization.
’own’: Uses your own initialization matrix of an LDA model that has been previously trained.
’lda_model’: Use a previously used LDA model, passing it through the lda_model argument.
sstats (numpy.ndarray , optional) – Sufficient statistics used for initializing the model if initialize == ‘own’. Corresponds to matrix beta in the linked paper for time slice 0, expected shape (self.vocab_len, num_topics).
lda_model (
LdaModel
) – Model whose sufficient statistics will be used to initialize the current object if initialize == ‘gensim’.obs_variance (float, optional) –
Observed variance used to approximate the true and forward variance as shown in David M. Blei, John D. Lafferty: “Dynamic Topic Models”.
chain_variance (float, optional) – Gaussian parameter defined in the beta distribution to dictate how the beta values evolve over time.
passes (int, optional) – Number of passes over the corpus for the initial
LdaModel
random_state ({numpy.random.RandomState, int}, optional) – Can be a np.random.RandomState object, or the seed to generate one. Used for reproducibility of results.
lda_inference_max_iter (int, optional) – Maximum number of iterations in the inference step of the LDA training.
em_min_iter (int, optional) – Minimum number of iterations until converge of the Expectation-Maximization algorithm
em_max_iter (int, optional) – Maximum number of iterations until converge of the Expectation-Maximization algorithm.
chunksize (int, optional) – Number of documents in the corpus do be processed in in a chunk.
- add_lifecycle_event(event_name, log_level=20, **event)¶
Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.
Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.
The lifecycle_events attribute is persisted across object’s
save()
andload()
operations. It has no impact on the use of the model, but is useful during debugging and support.Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.
- Parameters
event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.
This method will automatically add the following key-values to event, so you don’t have to specify them:
datetime: the current date & time
gensim: the current Gensim version
python: the current Python version
platform: the current platform
event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.
- doc_topics(doc_number)¶
Get the topic mixture for a document.
Uses the priors for the dirichlet distribution that approximates the true posterior with the optimal lower bound, and therefore requires the model to be already trained.
- Parameters
doc_number (int) – Index of the document for which the mixture is returned.
- Returns
Probability for each topic in the mixture (essentially a point in the self.num_topics - 1 simplex.
- Return type
list of length self.num_topics
- dtm_coherence(time)¶
Get the coherence for each topic.
Can be used to measure the quality of the model, or to inspect the convergence through training via a callback.
- Parameters
time (int) – The time slice.
- Returns
The word representation for each topic, for each time slice. This can be used to check the time coherence of topics as time evolves: If the most relevant words remain the same then the topic has somehow converged or is relatively static, if they change rapidly the topic is evolving.
- Return type
list of list of str
- dtm_vis(time, corpus)¶
Get the information needed to visualize the corpus model at a given time slice, using the pyLDAvis format.
- Parameters
time (int) – The time slice we are interested in.
corpus ({iterable of list of (int, float), scipy.sparse.csc}, optional) – The corpus we want to visualize at the given time slice.
- Returns
doc_topics (list of length self.num_topics) – Probability for each topic in the mixture (essentially a point in the self.num_topics - 1 simplex.
topic_term (numpy.ndarray) – The representation of each topic as a multinomial over words in the vocabulary, expected shape (num_topics, vocabulary length).
doc_lengths (list of int) – The number of words in each document. These could be fixed, or drawn from a Poisson distribution.
term_frequency (numpy.ndarray) – The term frequency matrix (denoted as beta in the original Blei paper). This could also be the TF-IDF representation of the corpus, expected shape (number of documents, length of vocabulary).
vocab (list of str) – The set of unique terms existing in the cropuse’s vocabulary.
- fit_lda_seq(corpus, lda_inference_max_iter, em_min_iter, em_max_iter, chunksize)¶
Fit a LDA Sequence model (DTM).
This method will iteratively setup LDA models and perform EM steps until the sufficient statistics convergence, or until the maximum number of iterations is reached. Because the true posterior is intractable, an appropriately tight lower bound must be used instead. This function will optimize this bound, by minimizing its true Kullback-Liebler Divergence with the true posterior.
- Parameters
corpus ({iterable of list of (int, float), scipy.sparse.csc}) – Stream of document vectors or sparse matrix of shape (num_documents, num_terms).
lda_inference_max_iter (int) – Maximum number of iterations for the inference step of LDA.
em_min_iter (int) – Minimum number of time slices to be inspected.
em_max_iter (int) – Maximum number of time slices to be inspected.
chunksize (int) – Number of documents to be processed in each chunk.
- Returns
The highest lower bound for the true posterior produced after all iterations.
- Return type
float
- fit_lda_seq_topics(topic_suffstats)¶
Fit the sequential model topic-wise.
- Parameters
topic_suffstats (numpy.ndarray) – Sufficient statistics of the current model, expected shape (self.vocab_len, num_topics).
- Returns
The sum of the optimized lower bounds for all topics.
- Return type
float
- inferDTMseq(corpus, topic_suffstats, gammas, lhoods, lda, ldapost, iter_, bound, lda_inference_max_iter, chunksize)¶
Compute the likelihood of a sequential corpus under an LDA seq model, and reports the likelihood bound.
- Parameters
corpus ({iterable of list of (int, float), scipy.sparse.csc}) – Stream of document vectors or sparse matrix of shape (num_documents, num_terms).
topic_suffstats (numpy.ndarray) – Sufficient statistics of the current model, expected shape (self.vocab_len, num_topics).
gammas (numpy.ndarray) – Topic weight variational parameters for each document. If not supplied, it will be inferred from the model.
lhoods (list of float of length self.num_topics) – The total log probability bound for each topic. Corresponds to phi from the linked paper.
lda (
LdaModel
) – The trained LDA model of the previous iteration.ldapost (
LdaPost
) – Posterior probability variables for the given LDA model. This will be used as the true (but intractable) posterior.iter (int) – The current iteration.
bound (float) – The LDA bound produced after all iterations.
lda_inference_max_iter (int) – Maximum number of iterations for the inference step of LDA.
chunksize (int) – Number of documents to be processed in each chunk.
- Returns
The first value is the highest lower bound for the true posterior. The second value is the list of optimized dirichlet variational parameters for the approximation of the posterior.
- Return type
(float, list of float)
- init_ldaseq_ss(topic_chain_variance, topic_obs_variance, alpha, init_suffstats)¶
Initialize State Space Language Model, topic-wise.
- Parameters
topic_chain_variance (float) – Gaussian parameter defined in the beta distribution to dictate how the beta values evolve.
topic_obs_variance (float) –
Observed variance used to approximate the true and forward variance as shown in David M. Blei, John D. Lafferty: “Dynamic Topic Models”.
alpha (float) – The prior probability for the model.
init_suffstats (numpy.ndarray) – Sufficient statistics used for initializing the model, expected shape (self.vocab_len, num_topics).
- lda_seq_infer(corpus, topic_suffstats, gammas, lhoods, iter_, lda_inference_max_iter, chunksize)¶
Inference (or E-step) for the lower bound EM optimization.
This is used to set up the gensim
LdaModel
to be used for each time-slice. It also allows for Document Influence Model code to be written in.- Parameters
corpus ({iterable of list of (int, float), scipy.sparse.csc}) – Stream of document vectors or sparse matrix of shape (num_documents, num_terms).
topic_suffstats (numpy.ndarray) – Sufficient statistics for time slice 0, used for initializing the model if initialize == ‘own’, expected shape (self.vocab_len, num_topics).
gammas (numpy.ndarray) – Topic weight variational parameters for each document. If not supplied, it will be inferred from the model.
lhoods (list of float) – The total log probability lower bound for each topic. Corresponds to the phi variational parameters in the linked paper.
iter (int) – Current iteration.
lda_inference_max_iter (int) – Maximum number of iterations for the inference step of LDA.
chunksize (int) – Number of documents to be processed in each chunk.
- Returns
The first value is the highest lower bound for the true posterior. The second value is the list of optimized dirichlet variational parameters for the approximation of the posterior.
- Return type
(float, list of float)
- classmethod load(fname, mmap=None)¶
Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
- make_lda_seq_slice(lda, time)¶
Update the LDA model topic-word values using time slices.
- print_topic(topic, time=0, top_terms=20)¶
Get the list of words most relevant to the given topic.
- Parameters
topic (int) – The index of the topic to be inspected.
time (int, optional) – The time slice in which we are interested in (since topics evolve over time, it is expected that the most relevant words will also gradually change).
top_terms (int, optional) – Number of words associated with the topic to be returned.
- Returns
The representation of this topic. Each element in the list includes the word itself, along with the probability assigned to it by the topic.
- Return type
list of (str, float)
- print_topic_times(topic, top_terms=20)¶
Get the most relevant words for a topic, for each timeslice. This can be used to inspect the evolution of a topic through time.
- Parameters
topic (int) – The index of the topic.
top_terms (int, optional) – Number of most relevant words associated with the topic to be returned.
- Returns
Top top_terms relevant terms for the topic for each time slice.
- Return type
list of list of str
- print_topics(time=0, top_terms=20)¶
Get the most relevant words for every topic.
- Parameters
time (int, optional) – The time slice in which we are interested in (since topics evolve over time, it is expected that the most relevant words will also gradually change).
top_terms (int, optional) – Number of most relevant words to be returned for each topic.
- Returns
Representation of all topics. Each of them is represented by a list of pairs of words and their assigned probability.
- Return type
list of list of (str, float)
- save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)¶
Save the object to a file.
- Parameters
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
- gensim.models.ldaseqmodel.df_obs(x, *args)¶
Derivative of the objective function which optimises obs.
- Parameters
x (list of float) – The obs values for this word.
sslm (
sslm
) – The State Space Language Model for DTM.word_counts (list of int) – Total word counts for each time slice.
totals (list of int of length len(self.time_slice)) – The totals for each time slice.
mean_deriv_mtx (list of float) – Mean derivative for each time slice.
word (int) – The word’s ID.
deriv (list of float) – Mean derivative for each time slice.
- Returns
The derivative of the objective function evaluated at point x.
- Return type
list of float
- gensim.models.ldaseqmodel.f_obs(x, *args)¶
Function which we are optimising for minimizing obs.
- Parameters
x (list of float) – The obs values for this word.
sslm (
sslm
) – The State Space Language Model for DTM.word_counts (list of int) – Total word counts for each time slice.
totals (list of int of length len(self.time_slice)) – The totals for each time slice.
mean_deriv_mtx (list of float) – Mean derivative for each time slice.
word (int) – The word’s ID.
deriv (list of float) – Mean derivative for each time slice.
- Returns
The value of the objective function evaluated at point x.
- Return type
list of float
- class gensim.models.ldaseqmodel.sslm(vocab_len=None, num_time_slices=None, num_topics=None, obs_variance=0.5, chain_variance=0.005)¶
Bases:
SaveLoad
Encapsulate the inner State Space Language Model for DTM.
Some important attributes of this class:
obs is a matrix containing the document to topic ratios.
e_log_prob is a matrix containing the topic to word ratios.
mean contains the mean values to be used for inference for each word for a time slice.
variance contains the variance values to be used for inference of word in a time slice.
fwd_mean and`fwd_variance` are the forward posterior values for the mean and the variance.
zeta is an extra variational parameter with a value for each time slice.
- add_lifecycle_event(event_name, log_level=20, **event)¶
Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.
Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.
The lifecycle_events attribute is persisted across object’s
save()
andload()
operations. It has no impact on the use of the model, but is useful during debugging and support.Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.
- Parameters
event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.
This method will automatically add the following key-values to event, so you don’t have to specify them:
datetime: the current date & time
gensim: the current Gensim version
python: the current Python version
platform: the current platform
event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.
- compute_bound(sstats, totals)¶
Compute the maximized lower bound achieved for the log probability of the true posterior.
Uses the formula presented in the appendix of the DTM paper (formula no. 5).
- Parameters
sstats (numpy.ndarray) – Sufficient statistics for a particular topic. Corresponds to matrix beta in the linked paper for the first time slice, expected shape (self.vocab_len, num_topics).
totals (list of int of length len(self.time_slice)) – The totals for each time slice.
- Returns
The maximized lower bound.
- Return type
float
- compute_expected_log_prob()¶
Compute the expected log probability given values of m.
The appendix describes the Expectation of log-probabilities in equation 5 of the DTM paper; The below implementation is the result of solving the equation and is implemented as in the original Blei DTM code.
- Returns
The expected value for the log probabilities for each word and time slice.
- Return type
numpy.ndarray of float
- compute_mean_deriv(word, time, deriv)¶
Helper functions for optimizing a function.
Compute the derivative of:
- Parameters
word (int) – The word’s ID.
time (int) – The time slice.
deriv (list of float) – Derivative for each time slice.
- Returns
Mean derivative for each time slice.
- Return type
list of float
- compute_obs_deriv(word, word_counts, totals, mean_deriv_mtx, deriv)¶
Derivation of obs which is used in derivative function df_obs while optimizing.
- Parameters
word (int) – The word’s ID.
word_counts (list of int) – Total word counts for each time slice.
totals (list of int of length len(self.time_slice)) – The totals for each time slice.
mean_deriv_mtx (list of float) – Mean derivative for each time slice.
deriv (list of float) – Mean derivative for each time slice.
- Returns
Mean derivative for each time slice.
- Return type
list of float
- compute_post_mean(word, chain_variance)¶
Get the mean, based on the Variational Kalman Filtering approach for Approximate Inference (section 3.1).
Notes
This function essentially computes E[eta_{t,w}] for t = 1:T.
- Parameters
word (int) – The word’s ID.
chain_variance (float) – Gaussian parameter defined in the beta distribution to dictate how the beta values evolve over time.
- Returns
The first returned value is the mean of each word in each time slice, the second value is the inferred posterior mean for the same pairs.
- Return type
(numpy.ndarray, numpy.ndarray)
- compute_post_variance(word, chain_variance)¶
Get the variance, based on the Variational Kalman Filtering approach for Approximate Inference (section 3.1).
This function accepts the word to compute variance for, along with the associated sslm class object, and returns the variance and the posterior approximation fwd_variance.
Notes
This function essentially computes Var[beta_{t,w}] for t = 1:T
- Parameters
word (int) – The word’s ID.
chain_variance (float) – Gaussian parameter defined in the beta distribution to dictate how the beta values evolve over time.
- Returns
The first returned value is the variance of each word in each time slice, the second value is the inferred posterior variance for the same pairs.
- Return type
(numpy.ndarray, numpy.ndarray)
- fit_sslm(sstats)¶
Fits variational distribution.
This is essentially the m-step. Maximizes the approximation of the true posterior for a particular topic using the provided sufficient statistics. Updates the values using
update_obs()
andcompute_expected_log_prob()
.- Parameters
sstats (numpy.ndarray) – Sufficient statistics for a particular topic. Corresponds to matrix beta in the linked paper for the current time slice, expected shape (self.vocab_len, num_topics).
- Returns
The lower bound for the true posterior achieved using the fitted approximate distribution.
- Return type
float
- classmethod load(fname, mmap=None)¶
Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
- save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)¶
Save the object to a file.
- Parameters
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
- sslm_counts_init(obs_variance, chain_variance, sstats)¶
Initialize the State Space Language Model with LDA sufficient statistics.
Called for each topic-chain and initializes initial mean, variance and Topic-Word probabilities for the first time-slice.
- Parameters
obs_variance (float, optional) – Observed variance used to approximate the true and forward variance.
chain_variance (float) – Gaussian parameter defined in the beta distribution to dictate how the beta values evolve over time.
sstats (numpy.ndarray) – Sufficient statistics of the LDA model. Corresponds to matrix beta in the linked paper for time slice 0, expected shape (self.vocab_len, num_topics).
- update_obs(sstats, totals)¶
Optimize the bound with respect to the observed variables.
TODO: This is by far the slowest function in the whole algorithm. Replacing or improving the performance of this would greatly speed things up.
- Parameters
sstats (numpy.ndarray) – Sufficient statistics for a particular topic. Corresponds to matrix beta in the linked paper for the first time slice, expected shape (self.vocab_len, num_topics).
totals (list of int of length len(self.time_slice)) – The totals for each time slice.
- Returns
The updated optimized values for obs and the zeta variational parameter.
- Return type
(numpy.ndarray of float, numpy.ndarray of float)
- update_zeta()¶
Update the Zeta variational parameter.
Zeta is described in the appendix and is equal to sum (exp(mean[word] + Variance[word] / 2)), over every time-slice. It is the value of variational parameter zeta which maximizes the lower bound.
- Returns
The updated zeta values for each time slice.
- Return type
list of float