gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

models.wrappers.dtmmodel – Dynamic Topic Models (DTM) and Dynamic Influence Models (DIM)

models.wrappers.dtmmodel – Dynamic Topic Models (DTM) and Dynamic Influence Models (DIM)

Python wrapper for Dynamic Topic Models (DTM) and the Document Influence Model (DIM).

Installation

You have 2 ways, how to make binaries:

  1. Use precompiled binaries for your OS version from /magsilva/dtm/

  2. Compile binaries manually from /blei-lab/dtm (original instruction available in https://github.com/blei-lab/dtm/blob/master/README.md), or use this

    git clone https://github.com/blei-lab/dtm.git
    sudo apt-get install libgsl0-dev
    cd dtm/dtm
    make
    

Examples

>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.models.wrappers import DtmModel
>>>
>>> path_to_dtm_binary = "/path/to/dtm/binary"
>>> model = DtmModel(
...    path_to_dtm_binary, corpus=common_corpus, id2word=common_dictionary,
...    time_slices=[1] * len(common_corpus)
... )
class gensim.models.wrappers.dtmmodel.DtmModel(dtm_path, corpus=None, time_slices=None, mode='fit', model='dtm', num_topics=100, id2word=None, prefix=None, lda_sequence_min_iter=6, lda_sequence_max_iter=20, lda_max_em_iter=10, alpha=0.01, top_chain_var=0.005, rng_seed=0, initialize_lda=True)

Bases: gensim.utils.SaveLoad

Python wrapper using DTM implementation.

Communication between DTM and Python takes place by passing around data files on disk and executing the DTM binary as a subprocess.

Warning

This is only python wrapper for DTM implementation, you need to install original implementation first and pass the path to binary to dtm_path.

Parameters:
  • dtm_path (str) – Path to the dtm binary, e.g. /home/username/dtm/dtm/main.
  • corpus (iterable of iterable of (int, int)) – Collection of texts in BoW format.
  • time_slices (list of int) – Sequence of timestamps.
  • mode ({'fit', 'time'}, optional) – Controls the mode of the mode: ‘fit’ is for training, ‘time’ for analyzing documents through time according to a DTM, basically a held out set.
  • model ({'fixed', 'dtm'}, optional) – Control model that will be runned: ‘fixed’ is for DIM and ‘dtm’ for DTM.
  • num_topics (int, optional) – Number of topics.
  • id2word (Dictionary, optional) – Mapping between tokens ids and words from corpus, if not specified - will be inferred from corpus.
  • prefix (str, optional) – Prefix for produced temporary files.
  • lda_sequence_min_iter (int, optional) – Min iteration of LDA.
  • lda_sequence_max_iter (int, optional) – Max iteration of LDA.
  • lda_max_em_iter (int, optional) – Max em optimization iterations in LDA.
  • alpha (int, optional) – Hyperparameter that affects sparsity of the document-topics for the LDA models in each timeslice.
  • top_chain_var (int, optional) – Hyperparameter that affects.
  • rng_seed (int, optional) – Random seed.
  • initialize_lda (bool, optional) – If True - initialize DTM with LDA.
convert_input(corpus, time_slices)

Convert corpus into LDA-C format by BleiCorpus and save to temp file. Path to temporary file produced by ftimeslices().

Parameters:
  • corpus (iterable of iterable of (int, float)) – Corpus in BoW format.
  • time_slices (list of int) – Sequence of timestamps.
dtm_coherence(time, num_words=20)

Get all topics of a particular time-slice without probability values for it to be used. For either “u_mass” or “c_v” coherence.

Parameters:
  • num_words (int) – Number of words.
  • time (int) – Timestamp
Returns:

coherence_topics – All topics of a particular time-slice without probability values for it to be used.

Return type:

list of list of str

Warning

TODO: because of print format right now can only return for 1st time-slice, should we fix the coherence printing or make changes to the print statements to mirror DTM python?

dtm_vis(corpus, time)

Get data specified by pyLDAvis format.

Parameters:
  • corpus (iterable of iterable of (int, float)) – Collection of texts in BoW format.
  • time (int) – Sequence of timestamp.

Notes

All of these are needed to visualise topics for DTM for a particular time-slice via pyLDAvis.

Returns:
  • doc_topic (numpy.ndarray) – Document-topic proportions.
  • topic_term (numpy.ndarray) – Calculated term of topic suitable for pyLDAvis format.
  • doc_lengths (list of int) – Length of each documents in corpus.
  • term_frequency (numpy.ndarray) – Frequency of each word from vocab.
  • vocab (list of str) – List of words from docpus.
fcorpus()

Get path to corpus file.

Returns:Path to corpus file.
Return type:str
fcorpustxt()

Get path to temporary file.

Returns:Path to multiple train binary file.
Return type:str
fem_steps()

Get path to temporary em_step data file.

Returns:Path to em_step data file.
Return type:str
finit_alpha()

Get path to initially trained lda alpha file.

Returns:Path to initially trained lda alpha file.
Return type:str
finit_beta()

Get path to initially trained lda beta file.

Returns:Path to initially trained lda beta file.
Return type:str
flda_ss()

Get path to initial lda binary file.

Returns:Path to initial lda binary file.
Return type:str
fout_gamma()

Get path to temporary gamma data file.

Returns:Path to gamma data file.
Return type:str
fout_influence()

Get template of path to temporary file.

Returns:Path to file.
Return type:str
fout_liklihoods()

Get path to temporary lhood data file.

Returns:Path to lhood data file.
Return type:str
fout_observations()

Get template of path to temporary file.

Returns:Path to file.
Return type:str
fout_prob()

Get template of path to temporary file.

Returns:Path to file.
Return type:str
foutname()

Get path to temporary file.

Returns:Path to file.
Return type:str
ftimeslices()

Get path to time slices binary file.

Returns:Path to time slices binary file.
Return type:str
classmethod load(fname, mmap=None)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
print_topic(topicid, time, topn=10, num_words=None)

Get the given topic, formatted as a string.

Parameters:
  • topicid (int) – Id of topic.
  • time (int) – Timestamp.
  • topn (int, optional) – Top number of topics that you’ll receive.
  • num_words (int, optional) – DEPRECATED PARAMETER, use topn instead.
Returns:

The given topic in string format, like ‘0.132*someword + 0.412*otherword + …’.

Return type:

str

print_topics(num_topics=10, times=5, num_words=10)

Alias for show_topics().

Parameters:
  • num_topics (int, optional) – Number of topics to return, set -1 to get all topics.
  • times (int, optional) – Number of times.
  • num_words (int, optional) – Number of words.
Returns:

Topics as a list of strings

Return type:

list of str

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) – If None - automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently. If list of str - this attributes will be stored in separate files, the automatic check is not performed in this case.
  • sep_limit (int) – Limit for automatic separation.
  • ignore (frozenset of str) – Attributes that shouldn’t be serialize/store.
  • pickle_protocol (int) – Protocol number for pickle.

See also

load()

show_topic(topicid, time, topn=50, num_words=None)

Get num_words most probable words for the given topicid.

Parameters:
  • topicid (int) – Id of topic.
  • time (int) – Timestamp.
  • topn (int, optional) – Top number of topics that you’ll receive.
  • num_words (int, optional) – DEPRECATED PARAMETER, use topn instead.
Returns:

Sequence of probable words, as a list of (word_probability, word).

Return type:

list of (float, str)

show_topics(num_topics=10, times=5, num_words=10, log=False, formatted=True)

Get the num_words most probable words for num_topics number of topics at ‘times’ time slices.

Parameters:
  • num_topics (int, optional) – Number of topics to return, set -1 to get all topics.
  • times (int, optional) – Number of times.
  • num_words (int, optional) – Number of words.
  • log (bool, optional) – THIS PARAMETER WILL BE IGNORED.
  • formatted (bool, optional) – If True - return the topics as a list of strings, otherwise as lists of (weight, word) pairs.
Returns:

  • list of str – Topics as a list of strings (if formatted=True) OR
  • list of (float, str) – Topics as list of (weight, word) pairs (if formatted=False)

train(corpus, time_slices, mode, model)

Train DTM model.

Parameters:
  • corpus (iterable of iterable of (int, int)) – Collection of texts in BoW format.
  • time_slices (list of int) – Sequence of timestamps.
  • mode ({'fit', 'time'}, optional) – Controls the mode of the mode: ‘fit’ is for training, ‘time’ for analyzing documents through time according to a DTM, basically a held out set.
  • model ({'fixed', 'dtm'}, optional) – Control model that will be runned: ‘fixed’ is for DIM and ‘dtm’ for DTM.