models.wrappers.dtmmodel
– Dynamic Topic Models (DTM) and Dynamic Influence Models (DIM)¶Python wrapper for Dynamic Topic Models (DTM) and the Document Influence Model (DIM).
You have 2 ways, how to make binaries:
Use precompiled binaries for your OS version from /magsilva/dtm/
Compile binaries manually from /blei-lab/dtm (original instruction available in https://github.com/blei-lab/dtm/blob/master/README.md), or use this
git clone https://github.com/blei-lab/dtm.git
sudo apt-get install libgsl0-dev
cd dtm/dtm
make
Examples
>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.models.wrappers import DtmModel
>>>
>>> path_to_dtm_binary = "/path/to/dtm/binary"
>>> model = DtmModel(
... path_to_dtm_binary, corpus=common_corpus, id2word=common_dictionary,
... time_slices=[1] * len(common_corpus)
... )
gensim.models.wrappers.dtmmodel.
DtmModel
(dtm_path, corpus=None, time_slices=None, mode='fit', model='dtm', num_topics=100, id2word=None, prefix=None, lda_sequence_min_iter=6, lda_sequence_max_iter=20, lda_max_em_iter=10, alpha=0.01, top_chain_var=0.005, rng_seed=0, initialize_lda=True)¶Bases: gensim.utils.SaveLoad
Python wrapper using DTM implementation.
Communication between DTM and Python takes place by passing around data files on disk and executing the DTM binary as a subprocess.
Warning
This is only python wrapper for DTM implementation,
you need to install original implementation first and pass the path to binary to dtm_path
.
dtm_path (str) – Path to the dtm binary, e.g. /home/username/dtm/dtm/main.
corpus (iterable of iterable of (int, int)) – Collection of texts in BoW format.
time_slices (list of int) – Sequence of timestamps.
mode ({'fit', 'time'}, optional) – Controls the mode of the mode: ‘fit’ is for training, ‘time’ for analyzing documents through time according to a DTM, basically a held out set.
model ({'fixed', 'dtm'}, optional) – Control model that will be runned: ‘fixed’ is for DIM and ‘dtm’ for DTM.
num_topics (int, optional) – Number of topics.
id2word (Dictionary
, optional) – Mapping between tokens ids and words from corpus, if not specified - will be inferred from corpus.
prefix (str, optional) – Prefix for produced temporary files.
lda_sequence_min_iter (int, optional) – Min iteration of LDA.
lda_sequence_max_iter (int, optional) – Max iteration of LDA.
lda_max_em_iter (int, optional) – Max em optimization iterations in LDA.
alpha (int, optional) – Hyperparameter that affects sparsity of the document-topics for the LDA models in each timeslice.
top_chain_var (int, optional) – Hyperparameter that affects.
rng_seed (int, optional) – Random seed.
initialize_lda (bool, optional) – If True - initialize DTM with LDA.
convert_input
(corpus, time_slices)¶Convert corpus into LDA-C format by BleiCorpus
and save to temp file.
Path to temporary file produced by ftimeslices()
.
corpus (iterable of iterable of (int, float)) – Corpus in BoW format.
time_slices (list of int) – Sequence of timestamps.
dtm_coherence
(time, num_words=20)¶Get all topics of a particular time-slice without probability values for it to be used. For either “u_mass” or “c_v” coherence.
num_words (int) – Number of words.
time (int) – Timestamp
coherence_topics – All topics of a particular time-slice without probability values for it to be used.
list of list of str
Warning
TODO: because of print format right now can only return for 1st time-slice, should we fix the coherence printing or make changes to the print statements to mirror DTM python?
dtm_vis
(corpus, time)¶Get data specified by pyLDAvis format.
corpus (iterable of iterable of (int, float)) – Collection of texts in BoW format.
time (int) – Sequence of timestamp.
Notes
All of these are needed to visualise topics for DTM for a particular time-slice via pyLDAvis.
doc_topic (numpy.ndarray) – Document-topic proportions.
topic_term (numpy.ndarray) – Calculated term of topic suitable for pyLDAvis format.
doc_lengths (list of int) – Length of each documents in corpus.
term_frequency (numpy.ndarray) – Frequency of each word from vocab.
vocab (list of str) – List of words from docpus.
fcorpus
()¶Get path to corpus file.
Path to corpus file.
str
fcorpustxt
()¶Get path to temporary file.
Path to multiple train binary file.
str
fem_steps
()¶Get path to temporary em_step data file.
Path to em_step data file.
str
finit_alpha
()¶Get path to initially trained lda alpha file.
Path to initially trained lda alpha file.
str
finit_beta
()¶Get path to initially trained lda beta file.
Path to initially trained lda beta file.
str
flda_ss
()¶Get path to initial lda binary file.
Path to initial lda binary file.
str
fout_gamma
()¶Get path to temporary gamma data file.
Path to gamma data file.
str
fout_influence
()¶Get template of path to temporary file.
Path to file.
str
fout_liklihoods
()¶Get path to temporary lhood data file.
Path to lhood data file.
str
fout_observations
()¶Get template of path to temporary file.
Path to file.
str
fout_prob
()¶Get template of path to temporary file.
Path to file.
str
foutname
()¶Get path to temporary file.
Path to file.
str
ftimeslices
()¶Get path to time slices binary file.
Path to time slices binary file.
str
load
(fname, mmap=None)¶Load an object previously saved using save()
from a file.
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
Object loaded from fname.
object
AttributeError – When called on an object instance instead of class (this is a class method).
print_topic
(topicid, time, topn=10, num_words=None)¶Get the given topic, formatted as a string.
topicid (int) – Id of topic.
time (int) – Timestamp.
topn (int, optional) – Top number of topics that you’ll receive.
num_words (int, optional) – DEPRECATED PARAMETER, use topn instead.
The given topic in string format, like ‘0.132*someword + 0.412*otherword + …’.
str
print_topics
(num_topics=10, times=5, num_words=10)¶Alias for show_topics()
.
num_topics (int, optional) – Number of topics to return, set -1 to get all topics.
times (int, optional) – Number of times.
num_words (int, optional) – Number of words.
Topics as a list of strings
list of str
save
(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)¶Save the object to a file.
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
show_topic
(topicid, time, topn=50, num_words=None)¶Get num_words most probable words for the given topicid.
topicid (int) – Id of topic.
time (int) – Timestamp.
topn (int, optional) – Top number of topics that you’ll receive.
num_words (int, optional) – DEPRECATED PARAMETER, use topn instead.
Sequence of probable words, as a list of (word_probability, word).
list of (float, str)
show_topics
(num_topics=10, times=5, num_words=10, log=False, formatted=True)¶Get the num_words most probable words for num_topics number of topics at ‘times’ time slices.
num_topics (int, optional) – Number of topics to return, set -1 to get all topics.
times (int, optional) – Number of times.
num_words (int, optional) – Number of words.
log (bool, optional) – THIS PARAMETER WILL BE IGNORED.
formatted (bool, optional) – If True - return the topics as a list of strings, otherwise as lists of (weight, word) pairs.
list of str – Topics as a list of strings (if formatted=True) OR
list of (float, str) – Topics as list of (weight, word) pairs (if formatted=False)
train
(corpus, time_slices, mode, model)¶Train DTM model.
corpus (iterable of iterable of (int, int)) – Collection of texts in BoW format.
time_slices (list of int) – Sequence of timestamps.
mode ({'fit', 'time'}, optional) – Controls the mode of the mode: ‘fit’ is for training, ‘time’ for analyzing documents through time according to a DTM, basically a held out set.
model ({'fixed', 'dtm'}, optional) – Control model that will be runned: ‘fixed’ is for DIM and ‘dtm’ for DTM.