gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• tech trainings & IT consulting

models.ldamallet – Latent Dirichlet Allocation via Mallet

models.ldamallet – Latent Dirichlet Allocation via Mallet

Python wrapper for Latent Dirichlet Allocation (LDA) from MALLET, the Java topic modelling toolkit [1].

This module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents, using an (optimized version of) collapsed gibbs sampling from MALLET.

MALLET’s LDA training requires O(#corpus_words) of memory, keeping the entire corpus in RAM. If you find yourself running out of memory, either decrease the workers constructor parameter, or use LdaModel which needs only O(1) memory.

The wrapped model can NOT be updated with new documents for online training – use gensim’s LdaModel for that.

Example:

>>> model = gensim.models.LdaMallet('/Users/kofola/mallet-2.0.7/bin/mallet', corpus=my_corpus, num_topics=20, id2word=dictionary)
>>> print model[my_vector]  # print LDA topics of a document
[1]http://mallet.cs.umass.edu/
class gensim.models.ldamallet.LdaMallet(mallet_path, corpus=None, num_topics=100, id2word=None, workers=4, prefix=None, optimize_interval=0, iterations=1000)

Bases: gensim.utils.SaveLoad

Class for LDA training using MALLET. Communication between MALLET and Python takes place by passing around data files on disk and calling Java with subprocess.call().

mallet_path is path to the mallet executable, e.g. /home/kofola/mallet-2.0.7/bin/mallet. corpus is a gensim corpus, aka a stream of sparse document vectors. id2word is a mapping between tokens ids and token. workers is the number of threads, for parallel training. prefix is the string prefix under which all data files will be stored; default: system temp + random filename prefix. optimize_interval optimize hyperparameters every N iterations (sometimes leads to Java exception; 0 to switch off hyperparameter optimization). iterations is the number of sampling iterations.

convert_input(corpus, infer=False)

Serialize documents (lists of unicode tokens) to a temporary text file, then convert that text file to MALLET format outfile.

fcorpusmallet()
fcorpustxt()
fdoctopics()
finferencer()
fstate()
ftopickeys()
fwordweights()
classmethod load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

load_word_topics()
print_topic(topicid, topn=10)
print_topics(num_topics=10, num_words=10)
save(fname, separately=None, sep_limit=10485760, ignore=frozenset([]))

Save the object to file (also see load).

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

show_topic(topicid, topn=10)
show_topics(num_topics=10, num_words=10, log=False, formatted=True)

Print the num_words most probable words for num_topics number of topics. Set num_topics=-1 to print all topics.

Set formatted=True to return the topics as a list of strings, or False as lists of (weight, word) pairs.

train(corpus)
gensim.models.ldamallet.read_doctopics(fname, eps=1e-06)

Yield document topic vectors from MALLET’s “doc-topics” format, as sparse gensim vectors.