gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

models.wrappers.dtmmodel – Dynamic Topic Models (DTM) and Dynamic Influence Models (DIM)

models.wrappers.dtmmodel – Dynamic Topic Models (DTM) and Dynamic Influence Models (DIM)

Python wrapper for Dynamic Topic Models (DTM) and the Document Influence Model (DIM) [1].

This module allows for DTM and DIM model estimation from a training corpus.

Example:

>>> model = gensim.models.wrappers.DtmModel('dtm-win64.exe', my_corpus, my_timeslices, num_topics=20, id2word=dictionary)
[1]https://github.com/magsilva/dtm/tree/master/bin
class gensim.models.wrappers.dtmmodel.DtmModel(dtm_path, corpus=None, time_slices=None, mode='fit', model='dtm', num_topics=100, id2word=None, prefix=None, lda_sequence_min_iter=6, lda_sequence_max_iter=20, lda_max_em_iter=10, alpha=0.01, top_chain_var=0.005, rng_seed=0, initialize_lda=True)

Bases: gensim.utils.SaveLoad

Class for DTM training using DTM binary. Communication between DTM and Python takes place by passing around data files on disk and executing the DTM binary as a subprocess.

dtm_path is path to the dtm executable, e.g. C:/dtm/dtm-win64.exe.

corpus is a gensim corpus, aka a stream of sparse document vectors.

id2word is a mapping between tokens ids and token.

mode controls the mode of the mode: ‘fit’ is for training, ‘time’ for analyzing documents through time according to a DTM, basically a held out set.

model controls the choice of model. ‘fixed’ is for DIM and ‘dtm’ for DTM.

lda_sequence_min_iter min iteration of LDA.

lda_sequence_max_iter max iteration of LDA.

lda_max_em_iter max em optiimzatiion iterations in LDA.

alpha is a hyperparameter that affects sparsity of the document-topics for the LDA models in each timeslice.

top_chain_var is a hyperparameter that affects.

rng_seed is the random seed.

initialize_lda initialize DTM with LDA.

convert_input(corpus, time_slices)

Serialize documents in LDA-C format to a temporary text file,.

dtm_coherence(time, num_words=20)

returns all topics of a particular time-slice without probabilitiy values for it to be used for either “u_mass” or “c_v” coherence. .. todo:

because of print format right now can only return for 1st time-slice.
should we fix the coherence printing or make changes to the print statements to mirror DTM python?
dtm_vis(corpus, time)

returns term_frequency, vocab, doc_lengths, topic-term distributions and doc_topic distributions, specified by pyLDAvis format. all of these are needed to visualise topics for DTM for a particular time-slice via pyLDAvis. input parameter is the year to do the visualisation.

fcorpus()
fcorpustxt()
fem_steps()
finit_alpha()
finit_beta()
flda_ss()
fout_gamma()
fout_influence()
fout_liklihoods()
fout_observations()
fout_prob()
foutname()
ftimeslices()
load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

print_topic(topicid, time, topn=10, num_words=None)

Return the given topic, formatted as a string.

print_topics(num_topics=10, times=5, num_words=10)
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.

show_topic(topicid, time, topn=50, num_words=None)

Return num_words most probable words for the given topicid, as a list of (word_probability, word) 2-tuples.

show_topics(num_topics=10, times=5, num_words=10, log=False, formatted=True)

Print the num_words most probable words for num_topics number of topics at ‘times’ time slices. Set topics=-1 to print all topics.

Set formatted=True to return the topics as a list of strings, or False as lists of (weight, word) pairs.

train(corpus, time_slices, mode, model)

Train DTM model using specified corpus and time slices.