gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

models.wrappers.ldavowpalwabbit – Latent Dirichlet Allocation via Vowpal Wabbit

models.wrappers.ldavowpalwabbit – Latent Dirichlet Allocation via Vowpal Wabbit

Python wrapper around Vowpal Wabbit’s Latent Dirichlet Allocation (LDA) implementation [1].

This uses Matt Hoffman’s online algorithm, for LDA [2], i.e. the same algorithm that Gensim’s LdaModel is based on.

Note: Currently working and tested with Vowpal Wabbit versions 7.10 to 8.1.1. Vowpal Wabbit’s API isn’t currently stable, so this may or may not work with older/newer versions. The aim will be to ensure this wrapper always works with the latest release of Vowpal Wabbit.

Tested with python 2.6, 2.7, and 3.4.

Example

>>> # train model
>>> lda = gensim.models.wrappers.LdaVowpalWabbit('/usr/local/bin/vw',
                                                 corpus=corpus,
                                                 num_topics=20,
                                                 id2word=dictionary)
>>> # update an existing model
>>> lda.update(another_corpus)
>>> # get topic probability distributions for a document
>>> print(lda[doc_bow])
>>> # print 10 topics
>>> print(lda.print_topics())
>>> # save/load the trained model:
>>> lda.save('vw_lda.model')
>>> lda = gensim.models.wrappers.LdaVowpalWabbit.load('vw_lda.model')
>>> # get bound on log perplexity for given test set
>>> print(lda.log_perpexity(test_corpus))

Vowpal Wabbit works on files, so this wrapper maintains a temporary directory while it’s around, reading/writing there as necessary.

Output from Vowpal Wabbit is logged at either INFO or DEBUG levels, enable logging to view this.

[1]https://github.com/JohnLangford/vowpal_wabbit/wiki
[2]http://www.cs.princeton.edu/~mdhoffma/
class gensim.models.wrappers.ldavowpalwabbit.LdaVowpalWabbit(vw_path, corpus=None, num_topics=100, id2word=None, chunksize=256, passes=1, alpha=0.1, eta=0.1, decay=0.5, offset=1, gamma_threshold=0.001, random_seed=None, cleanup_files=True, tmp_prefix=u'tmp')

Bases: gensim.utils.SaveLoad

Class for LDA training using Vowpal Wabbit’s online LDA. Communication between Vowpal Wabbit and Python takes place by passing around data files on disk and calling the ‘vw’ binary with the subprocess module.

vw_path is the path to Vowpal Wabbit’s ‘vw’ executable.

corpus is an iterable training corpus. If given, training will start immediately, otherwise the model is left untrained (presumably because you want to call update() manually).

num_topics is the number of requested latent topics to be extracted from the training corpus. Corresponds to VW’s ‘–lda <num_topics>’ argument.

id2word is a mapping from word ids (integers) to words (strings). It is used to determine the vocabulary size, as well as for debugging and topic printing.

chunksize is the number of documents examined in each batch. Corresponds to VW’s ‘–minibatch <batch_size>’ argument.

passes is the number of passes over the dataset to use. Corresponds to VW’s ‘–passes <passes>’ argument.

alpha is a float effecting sparsity of per-document topic weights. This is applied symmetrically, and should be set higher to when documents are thought to look more similar. Corresponds to VW’s ‘–lda_alpha <alpha>’ argument.

eta is a float which affects the sparsity of topic distributions. This is applied symmetrically, and should be set higher when topics are thought to look more similar. Corresponds to VW’s ‘–lda_rho <rho>’ argument.

decay learning rate decay, affects how quickly learnt values are forgotten. Should be set to a value between 0.5 and 1.0 to guarantee convergence. Corresponds to VW’s ‘–power_t <tau>’ argument.

offset integer learning offset, set to higher values to slow down learning on early iterations of the algorithm. Corresponds to VW’s ‘–initial_t <tau>’ argument.

gamma_threshold affects when learning loop will be broken out of, higher values will result in earlier loop completion. Corresponds to VW’s ‘–epsilon <eps>’ argument.

random_seed sets Vowpal Wabbit’s random seed when learning. Corresponds to VW’s ‘–random_seed <seed>’ argument.

cleanup_files whether or not to delete temporary directory and files used by this wrapper. Setting to False can be useful for debugging, or for re-using Vowpal Wabbit files elsewhere.

tmp_prefix used to prefix temporary working directory name.

get_topics()
Returns:num_topics x vocabulary_size array of floats which represents the term topic matrix learned during inference.
Return type:np.ndarray
classmethod load(fname, *args, **kwargs)

Load LDA model from file with given name.

log_perplexity(chunk)

Return per-word lower bound on log perplexity.

Also logs this and perplexity at INFO level.

print_topic(topicid, topn=10)
print_topics(num_topics=10, num_words=10)
save(fname, *args, **kwargs)

Serialise this model to file with given name.

show_topic(topicid, topn=10)
show_topics(num_topics=10, num_words=10, log=False, formatted=True)
train(corpus)

Clear any existing model state, and train on given corpus.

update(corpus)

Update existing model (if any) on corpus.

gensim.models.wrappers.ldavowpalwabbit.corpus_to_vw(corpus)

Iterate over corpus, yielding lines in Vowpal Wabbit format.

For LDA, this consists of each document on a single line consisting of space separated lists of <word_id>:<count>. Each line starts with a ‘|’ character.

E.g.:
4:7 14:1 22:8 6:3
14:22 22:4 0:1 1:3
7:2 8:2
gensim.models.wrappers.ldavowpalwabbit.vwmodel2ldamodel(vw_model, iterations=50)

Function to convert vowpal wabbit model to gensim LdaModel. This works by simply copying the training model weights (alpha, beta…) from a trained vwmodel into the gensim model.

Parameters:
  • vw_model – Trained vowpal wabbit model.
  • iterations – Number of iterations to be used for inference of the new LdaModel.
Returns:

LdaModel instance; copied gensim LdaModel.

Return type:

model_gensim

gensim.models.wrappers.ldavowpalwabbit.write_corpus_as_vw(corpus, filename)

Iterate over corpus, writing each document as a line to given file.

Returns the number of lines written.