gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

models.wrappers.dtmmodel – Dynamic Topic Models (DTM) and Dynamic Influence Models (DIM)

models.wrappers.dtmmodel – Dynamic Topic Models (DTM) and Dynamic Influence Models (DIM)

Python wrapper for Dynamic Topic Models (DTM) and the Document Influence Model (DIM).

Installation

You have 2 ways, how to make binaries:

  1. Use precompiled binaries for your OS version from /magsilva/dtm/

  2. Compile binaries manually from /blei-lab/dtm (original instruction available in https://github.com/blei-lab/dtm/blob/master/README.md), or use this

    git clone https://github.com/blei-lab/dtm.git
    sudo apt-get install libgsl0-dev
    cd dtm/dtm
    make
    

Examples

>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.models.wrappers import DtmModel
>>>
>>> path_to_dtm_binary = "/path/to/dtm/binary"
>>> model = DtmModel(
...    path_to_dtm_binary, corpus=common_corpus, id2word=common_dictionary,
...    time_slices=[1] * len(common_corpus)
... )
class gensim.models.wrappers.dtmmodel.DtmModel(dtm_path, corpus=None, time_slices=None, mode='fit', model='dtm', num_topics=100, id2word=None, prefix=None, lda_sequence_min_iter=6, lda_sequence_max_iter=20, lda_max_em_iter=10, alpha=0.01, top_chain_var=0.005, rng_seed=0, initialize_lda=True)

Bases: gensim.utils.SaveLoad

Python wrapper using DTM implementation.

Communication between DTM and Python takes place by passing around data files on disk and executing the DTM binary as a subprocess.

Warning

This is only python wrapper for DTM implementation, you need to install original implementation first and pass the path to binary to dtm_path.

Parameters:
  • dtm_path (str) – Path to the dtm binary, e.g. /home/username/dtm/dtm/main.
  • corpus (iterable of iterable of (int, int)) – Collection of texts in BoW format.
  • time_slices (list of int) – Sequence of timestamps.
  • mode ({'fit', 'time'}, optional) – Controls the mode of the mode: ‘fit’ is for training, ‘time’ for analyzing documents through time according to a DTM, basically a held out set.
  • model ({'fixed', 'dtm'}, optional) – Control model that will be runned: ‘fixed’ is for DIM and ‘dtm’ for DTM.
  • num_topics (int, optional) – Number of topics.
  • id2word (Dictionary, optional) – Mapping between tokens ids and words from corpus, if not specified - will be inferred from corpus.
  • prefix (str, optional) – Prefix for produced temporary files.
  • lda_sequence_min_iter (int, optional) – Min iteration of LDA.
  • lda_sequence_max_iter (int, optional) – Max iteration of LDA.
  • lda_max_em_iter (int, optional) – Max em optimization iterations in LDA.
  • alpha (int, optional) – Hyperparameter that affects sparsity of the document-topics for the LDA models in each timeslice.
  • top_chain_var (int, optional) – Hyperparameter that affects.
  • rng_seed (int, optional) – Random seed.
  • initialize_lda (bool, optional) – If True - initialize DTM with LDA.
convert_input(corpus, time_slices)

Convert corpus into LDA-C format by BleiCorpus and save to temp file. Path to temporary file produced by ftimeslices().

Parameters:
  • corpus (iterable of iterable of (int, float)) – Corpus in BoW format.
  • time_slices (list of int) – Sequence of timestamps.
dtm_coherence(time, num_words=20)

Get all topics of a particular time-slice without probability values for it to be used. For either “u_mass” or “c_v” coherence.

Parameters:
  • num_words (int) – Number of words.
  • time (int) – Timestamp
Returns:

coherence_topics – All topics of a particular time-slice without probability values for it to be used.

Return type:

list of list of str

Warning

TODO: because of print format right now can only return for 1st time-slice, should we fix the coherence printing or make changes to the print statements to mirror DTM python?

dtm_vis(corpus, time)

Get data specified by pyLDAvis format.

Parameters:
  • corpus (iterable of iterable of (int, float)) – Collection of texts in BoW format.
  • time (int) – Sequence of timestamp.

Notes

All of these are needed to visualise topics for DTM for a particular time-slice via pyLDAvis.

Returns:
  • doc_topic (numpy.ndarray) – Document-topic proportions.
  • topic_term (numpy.ndarray) – Calculated term of topic suitable for pyLDAvis format.
  • doc_lengths (list of int) – Length of each documents in corpus.
  • term_frequency (numpy.ndarray) – Frequency of each word from vocab.
  • vocab (list of str) – List of words from docpus.
fcorpus()

Get path to corpus file.

Returns:Path to corpus file.
Return type:str
fcorpustxt()

Get path to temporary file.

Returns:Path to multiple train binary file.
Return type:str
fem_steps()

Get path to temporary em_step data file.

Returns:Path to em_step data file.
Return type:str
finit_alpha()

Get path to initially trained lda alpha file.

Returns:Path to initially trained lda alpha file.
Return type:str
finit_beta()

Get path to initially trained lda beta file.

Returns:Path to initially trained lda beta file.
Return type:str
flda_ss()

Get path to initial lda binary file.

Returns:Path to initial lda binary file.
Return type:str
fout_gamma()

Get path to temporary gamma data file.

Returns:Path to gamma data file.
Return type:str
fout_influence()

Get template of path to temporary file.

Returns:Path to file.
Return type:str
fout_liklihoods()

Get path to temporary lhood data file.

Returns:Path to lhood data file.
Return type:str
fout_observations()

Get template of path to temporary file.

Returns:Path to file.
Return type:str
fout_prob()

Get template of path to temporary file.

Returns:Path to file.
Return type:str
foutname()

Get path to temporary file.

Returns:Path to file.
Return type:str
ftimeslices()

Get path to time slices binary file.

Returns:Path to time slices binary file.
Return type:str
classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()
Save object to file.
Returns:Object loaded from fname.
Return type:object
Raises:AttributeError – When called on an object instance instead of class (this is a class method).
print_topic(topicid, time, topn=10, num_words=None)

Get the given topic, formatted as a string.

Parameters:
  • topicid (int) – Id of topic.
  • time (int) – Timestamp.
  • topn (int, optional) – Top number of topics that you’ll receive.
  • num_words (int, optional) – DEPRECATED PARAMETER, use topn instead.
Returns:

The given topic in string format, like ‘0.132*someword + 0.412*otherword + …’.

Return type:

str

print_topics(num_topics=10, times=5, num_words=10)

Alias for show_topics().

Parameters:
  • num_topics (int, optional) – Number of topics to return, set -1 to get all topics.
  • times (int, optional) – Number of times.
  • num_words (int, optional) – Number of words.
Returns:

Topics as a list of strings

Return type:

list of str

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to a file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()
Load object from file.
show_topic(topicid, time, topn=50, num_words=None)

Get num_words most probable words for the given topicid.

Parameters:
  • topicid (int) – Id of topic.
  • time (int) – Timestamp.
  • topn (int, optional) – Top number of topics that you’ll receive.
  • num_words (int, optional) – DEPRECATED PARAMETER, use topn instead.
Returns:

Sequence of probable words, as a list of (word_probability, word).

Return type:

list of (float, str)

show_topics(num_topics=10, times=5, num_words=10, log=False, formatted=True)

Get the num_words most probable words for num_topics number of topics at ‘times’ time slices.

Parameters:
  • num_topics (int, optional) – Number of topics to return, set -1 to get all topics.
  • times (int, optional) – Number of times.
  • num_words (int, optional) – Number of words.
  • log (bool, optional) – THIS PARAMETER WILL BE IGNORED.
  • formatted (bool, optional) – If True - return the topics as a list of strings, otherwise as lists of (weight, word) pairs.
Returns:

  • list of str – Topics as a list of strings (if formatted=True) OR
  • list of (float, str) – Topics as list of (weight, word) pairs (if formatted=False)

train(corpus, time_slices, mode, model)

Train DTM model.

Parameters:
  • corpus (iterable of iterable of (int, int)) – Collection of texts in BoW format.
  • time_slices (list of int) – Sequence of timestamps.
  • mode ({'fit', 'time'}, optional) – Controls the mode of the mode: ‘fit’ is for training, ‘time’ for analyzing documents through time according to a DTM, basically a held out set.
  • model ({'fixed', 'dtm'}, optional) – Control model that will be runned: ‘fixed’ is for DIM and ‘dtm’ for DTM.