gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

models.hdpmodel – Hierarchical Dirichlet Process

models.hdpmodel – Hierarchical Dirichlet Process

This module encapsulates functionality for the online Hierarchical Dirichlet Process algorithm.

It allows both model estimation from a training corpus and inference of topic distribution on new, unseen documents.

The core estimation code is directly adapted from the onlinelhdp.py script by C. Wang see Wang, Paisley, Blei: Online Variational Inference for the Hierarchical Dirichlet Process, JMLR (2011).

http://jmlr.csail.mit.edu/proceedings/papers/v15/wang11a/wang11a.pdf

The algorithm:

  • is streamed: training documents come in sequentially, no random access,
  • runs in constant memory w.r.t. the number of documents: size of the training corpus does not affect memory footprint
class gensim.models.hdpmodel.HdpModel(corpus, id2word, max_chunks=None, max_time=None, chunksize=256, kappa=1.0, tau=64.0, K=15, T=150, alpha=1, gamma=1, eta=0.01, scale=1.0, var_converge=0.0001, outputdir=None, random_state=None)

Bases: gensim.interfaces.TransformationABC, gensim.models.basemodel.BaseTopicModel

The constructor estimates Hierachical Dirichlet Process model parameters based on a training corpus:

>>> hdp = HdpModel(corpus, id2word)

You can infer topic distributions on new, unseen documents with

>>> doc_hdp = hdp[doc_bow]

Inference on new documents is based on the approximately LDA-equivalent topics.

To print 20 topics with top 10 most probable words

>>> hdp.print_topics(num_topics=20, num_words=10)

Model persistency is achieved through its load/save methods.

gamma: first level concentration alpha: second level concentration eta: the topic Dirichlet T: top level truncation level K: second level truncation level kappa: learning rate tau: slow down parameter max_time: stop training after this many seconds max_chunks: stop after having processed this many chunks (wrap around corpus beginning in another corpus pass, if there are not enough chunks in the corpus)

doc_e_step(ss, Elogsticks_1st, unique_words, doc_word_ids, doc_word_counts, var_converge)

e step for a single doc

evaluate_test_corpus(corpus)
get_topics()
Returns:num_topics x vocabulary_size array of floats which represents the term topic matrix learned during inference.
Return type:np.ndarray
hdp_to_lda()

Compute the LDA almost equivalent HDP.

inference(chunk)
classmethod load(fname, mmap=None)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
optimal_ordering()

ordering the topics

print_topic(topicno, topn=10)

Get a single topic as a formatted string.

Parameters:
  • topicno (int) – Topic id.
  • topn (int) – Number of words from topic that will be used.
Returns:

String representation of topic, like ‘-0.340 * “category” + 0.298 * “$M$” + 0.183 * “algebra” + … ‘.

Return type:

str

print_topics(num_topics=20, num_words=10)

Get the most significant topics (alias for show_topics() method).

Parameters:
  • num_topics (int, optional) – The number of topics to be selected, if -1 - all topics will be in result (ordered by significance).
  • num_words (int, optional) – The number of words to be included per topics (ordered by significance).
Returns:

Sequence with (topic_id, [(word, value), … ]).

Return type:

list of (int, list of (str, float))

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) – If None - automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently. If list of str - this attributes will be stored in separate files, the automatic check is not performed in this case.
  • sep_limit (int) – Limit for automatic separation.
  • ignore (frozenset of str) – Attributes that shouldn’t be serialize/store.
  • pickle_protocol (int) – Protocol number for pickle.

See also

load()

save_options()

legacy method; use self.save() instead

save_topics(doc_count=None)

legacy method; use self.save() instead

show_topic(topic_id, topn=20, log=False, formatted=False, num_words=None)

Print the num_words most probable words for topic topic_id.

Set formatted=True to return the topics as a list of strings, or False as lists of (weight, word) pairs.

show_topics(num_topics=20, num_words=20, log=False, formatted=True)

Print the num_words most probable words for num_topics number of topics. Set num_topics=-1 to print all topics.

Set formatted=True to return the topics as a list of strings, or False as lists of (weight, word) pairs.

suggested_lda_model()

Returns closest corresponding ldamodel object corresponding to current hdp model. The hdp_to_lda method only returns corresponding alpha, beta values, and this method returns a trained ldamodel. The num_topics is m_T (default is 150) so as to preserve the matrice shapes when we assign alpha and beta.

update(corpus)
update_chunk(chunk, update=True, opt_o=True)
update_expectations()

Since we’re doing lazy updates on lambda, at any given moment the current state of lambda may not be accurate. This function updates all of the elements of lambda and Elogbeta so that if (for example) we want to print out the topics we’ve learned we’ll get the correct behavior.

update_finished(start_time, chunks_processed, docs_processed)
update_lambda(sstats, word_list, opt_o)
class gensim.models.hdpmodel.HdpTopicFormatter(dictionary=None, topic_data=None, topic_file=None, style=None)

Bases: object

STYLE_GENSIM = 1
STYLE_PRETTY = 2
format_topic(topic_id, topic_terms)
print_topic(topic_id, topn=None, num_words=None)
print_topics(num_topics=10, num_words=10)
show_topic(topic_id, topn=20, log=False, formatted=False, num_words=None)
show_topic_terms(topic_data, num_words)
show_topics(num_topics=10, num_words=10, log=False, formatted=True)
class gensim.models.hdpmodel.SuffStats(T, Wt, Dt)

Bases: object

set_zero()
gensim.models.hdpmodel.expect_log_sticks(sticks)

For stick-breaking hdp, return the E[log(sticks)]

gensim.models.hdpmodel.lda_e_step(doc_word_ids, doc_word_counts, alpha, beta, max_iter=100)