models.hdpmodel – Hierarchical Dirichlet Process

`models.hdpmodel` – Hierarchical Dirichlet Process¶

Module for online Hierarchical Dirichlet Processing.

The core estimation code is directly adapted from the blei-lab/online-hdp from Wang, Paisley, Blei: “Online Variational Inference for the Hierarchical Dirichlet Process”, JMLR (2011).

Examples

Train HdpModel

>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.models import HdpModel
>>>
>>> hdp = HdpModel(common_corpus, common_dictionary)

You can then infer topic distributions on new, unseen documents, with

>>> unseen_document = [(1, 3.), (2, 4)]
>>> doc_hdp = hdp[unseen_document]

To print 20 topics with top 10 most probable words.

>>> topic_info = hdp.print_topics(num_topics=20, num_words=10)

The model can be updated (trained) with new documents via

>>> hdp.update([[(1, 2)], [(1, 1), (4, 5)]])

class gensim.models.hdpmodel.HdpModel(corpus, id2word, max_chunks=None, max_time=None, chunksize=256, kappa=1.0, tau=64.0, K=15, T=150, alpha=1, gamma=1, eta=0.01, scale=1.0, var_converge=0.0001, outputdir=None, random_state=None)¶

Bases: gensim.interfaces.TransformationABC, gensim.models.basemodel.BaseTopicModel

Hierarchical Dirichlet Process model

Topic models promise to help summarize and organize large archives of texts that cannot be easily analyzed by hand. Hierarchical Dirichlet process (HDP) is a powerful mixed-membership model for the unsupervised analysis of grouped data. Unlike its finite counterpart, latent Dirichlet allocation, the HDP topic model infers the number of topics from the data. Here we have used Online HDP, which provides the speed of online variational Bayes with the modeling flexibility of the HDP. The idea behind Online variational Bayes in general is to optimize the variational objective function with stochastic optimization.The challenge we face is that the existing coordinate ascent variational Bayes algorithms for the HDP require complicated approximation methods or numerical optimization. This model utilises stick breaking construction of Hdp which enables it to allow for coordinate-ascent variational Bayes without numerical approximation.

Stick breaking construction

To understand the HDP model we need to understand how it is modelled using the stick breaking construction. A very good analogy to understand the stick breaking construction is chinese restaurant franchise.

For this assume that there is a restaurant franchise (corpus) which has a large number of restaurants (documents, j) under it. They have a global menu of dishes (topics, $\Phi_{k}$ ) which they serve. Also, a single dish (topic, $\Phi_{k}$ ) is only served at a single table t for all the customers (words, $\theta_{j,i}$ ) who sit at that table. So, when a customer enters the restaurant he/she has the choice to make where he/she wants to sit. He/she can choose to sit at a table where some customers are already sitting , or he/she can choose to sit at a new table. Here the probability of choosing each option is not same.

Now, in this the global menu of dishes correspond to the global atoms $\Phi_{k}$ , and each restaurant correspond to a single document j. So the number of dishes served in a particular restaurant correspond to the number of topics in a particular document. And the number of people sitting at each table correspond to the number of words belonging to each topic inside the document j.

Now, coming on to the stick breaking construction, the concept understood from the chinese restaurant franchise is easily carried over to the stick breaking construction for hdp (“Figure 1” from “Online Variational Inference for the Hierarchical Dirichlet Process”).

A two level hierarchical dirichlet process is a collection of dirichlet processes $G_{j}$ , one for each group, which share a base distribution $G_{0}$ , which is also a dirichlet process. Also, all $G_{j}$ share the same set of atoms, $\Phi_{k}$ , and only the atom weights $\pi _{jt}$ differs.

There will be multiple document-level atoms $\psi_{jt}$ which map to the same corpus-level atom $\Phi_{k}$ . Here, the $\beta$ signify the weights given to each of the topics globally. Also, each factor $\theta_{j,i}$ is distributed according to $G_{j}$ , i.e., it takes on the value of $\Phi_{k}$ with probability $\pi _{jt}$ . $C_{j,t}$ is an indicator variable whose value k signifies the index of $\Phi$ . This helps to map $\psi_{jt}$ to $\Phi_{k}$ .

The top level (corpus level) stick proportions correspond the values of $\beta$ , bottom level (document level) stick proportions correspond to the values of $\pi$ . The truncation level for the corpus (K) and document (T) corresponds to the number of $\beta$ and $\pi$ which are in existence.

Now, whenever coordinate ascent updates are to be performed, they happen at two level. The document level as well as corpus level.

At document level, we update the following:

The parameters to the document level sticks, i.e, a and b parameters of $\beta$ distribution of the variable $\pi _{jt}$ .
The parameters to per word topic indicators, $Z_{j,n}$ . Here $Z_{j,n}$ selects topic parameter $\psi_{jt}$ .
The parameters to per document topic indices $\Phi_{jtk}$ .

At corpus level, we update the following:

The parameters to the top level sticks, i.e., the parameters of the $\beta$ distribution for the corpus level $\beta$ , which signify the topic distribution at corpus level.
The parameters to the topics $\Phi_{k}$ .

Now coming on to the steps involved, procedure for online variational inference for the Hdp model is as follows:

We initialise the corpus level parameters, topic parameters randomly and set current time to 1.
Fetch a random document j from the corpus.
Compute all the parameters required for document level updates.
Compute natural gradients of corpus level parameters.
Initialise the learning rate as a function of kappa, tau and current time. Also, increment current time by 1 each time it reaches this step.
Update corpus level parameters.

Repeat 2 to 6 until stopping condition is not met.

Here the stopping condition corresponds to

time limit expired
chunk limit reached
whole corpus processed

lda_alpha¶

Same as $\alpha$ from gensim.models.ldamodel.LdaModel.

Type: numpy.ndarray

lda_beta¶

Same as $\beta$ from from gensim.models.ldamodel.LdaModel.

Type: numpy.ndarray

m_D¶

Number of documents in the corpus.

Type: int

m_Elogbeta¶

Stores value of dirichlet expectation, i.e., compute $E[log \theta]$ for a vector $\theta \sim Dir(\alpha)$ .

Type: numpy.ndarray:

m_lambda¶

Drawn samples from the parameterized gamma distribution.

Type: {numpy.ndarray, float}

m_lambda_sum¶

An array with the same shape as m_lambda, with the specified axis (1) removed.

Type: {numpy.ndarray, float}

m_num_docs_processed¶

Number of documents finished processing.This is incremented in size of chunks.

Type: int

m_r¶

Acts as normaliser in lazy updating of m_lambda attribute.

Type: list

m_rhot¶

Assigns weight to the information obtained from the mini-chunk and its value it between 0 and 1.

Type: float

m_status_up_to_date¶

Flag to indicate whether lambda `and :math:`E[log theta] have been updated if True, otherwise - not.

Type: bool

m_timestamp¶

Helps to keep track and perform lazy updates on lambda.

Type: numpy.ndarray

m_updatect¶

Keeps track of current time and is incremented every time update_lambda() is called.

Type: int

m_var_sticks¶

Array of values for stick.

Type: numpy.ndarray

m_varphi_ss¶

Used to update top level sticks.

Type: numpy.ndarray

m_W¶

Length of dictionary for the input corpus.

Type: int

Parameters

corpus (iterable of list of (int, float)) – Corpus in BoW format.
id2word (Dictionary) – Dictionary for the input corpus.
max_chunks (int, optional) – Upper bound on how many chunks to process. It wraps around corpus beginning in another corpus pass, if there are not enough chunks in the corpus.
max_time (int, optional) – Upper bound on time (in seconds) for which model will be trained.
chunksize (int, optional) – Number of documents in one chuck.
kappa (float,optional) – Learning parameter which acts as exponential decay factor to influence extent of learning from each batch.
tau (float, optional) – Learning parameter which down-weights early iterations of documents.
K (int, optional) – Second level truncation level
T (int, optional) – Top level truncation level
alpha (int, optional) – Second level concentration
gamma (int, optional) – First level concentration
eta (float, optional) – The topic Dirichlet
scale (float, optional) – Weights information from the mini-chunk of corpus to calculate rhot.
var_converge (float, optional) – Lower bound on the right side of convergence. Used when updating variational parameters for a single document.
outputdir (str, optional) – Stores topic and options information in the specified directory.
random_state ({None, int, array_like, RandomState, optional}) – Adds a little random jitter to randomize results around same alpha when trying to fetch a closest corresponding lda model from suggested_lda_model()

doc_e_step(ss, Elogsticks_1st, unique_words, doc_word_ids, doc_word_counts, var_converge)¶

Performs E step for a single doc.

Parameters

ss (SuffStats) – Stats for all document(s) in the chunk.
Elogsticks_1st (numpy.ndarray) – Computed Elogsticks value by stick-breaking process.
unique_words (dict of (int, int)) – Number of unique words in the chunk.
doc_word_ids (iterable of int) – Word ids of for a single document.
doc_word_counts (iterable of int) – Word counts of all words in a single document.
var_converge (float) – Lower bound on the right side of convergence. Used when updating variational parameters for a single document.

Returns

Computed value of likelihood for a single document.

Return type

float

evaluate_test_corpus(corpus)¶

Evaluates the model on test corpus.

Parameters: corpus (iterable of list of (int, float)) – Test corpus in BoW format.
Returns: The value of total likelihood obtained by evaluating the model for all documents in the test corpus.
Return type: float

get_topics()¶

Get the term topic matrix learned during inference.

Returns: num_topics x vocabulary_size array of floats
Return type: np.ndarray

hdp_to_lda()¶

Get corresponding alpha and beta values of a LDA almost equivalent to current HDP.

Returns: Alpha and Beta arrays.
Return type: (numpy.ndarray, numpy.ndarray)

inference(chunk)¶

Infers the gamma value based for chunk.

Parameters: chunk (iterable of list of (int, float)) – Corpus in BoW format.
Returns: First level concentration, i.e., Gamma value.
Return type: numpy.ndarray
Raises: RuntimeError – If model doesn’t trained yet.

classmethod load(fname, mmap=None)¶

Load an object previously saved using save() from a file.

Parameters

fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

Get Expert Help From The Gensim Authors

models.hdpmodel – Hierarchical Dirichlet Process¶

`models.hdpmodel` – Hierarchical Dirichlet Process¶