gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

Corporate trainings in Data Science, NLP and Deep Learning

models.wrappers.ldavowpalwabbit – Latent Dirichlet Allocation via Vowpal Wabbit

models.wrappers.ldavowpalwabbit – Latent Dirichlet Allocation via Vowpal Wabbit

Python wrapper for Vowpal Wabbit’s Latent Dirichlet Allocation.

This uses Matt Hoffman’s online algorithm, i.e. the same algorithm that Gensim’s LdaModel is based on.

Installation

Use official guide or this one

git clone https://github.com/JohnLangford/vowpal_wabbit.git
cd vowpal_wabbit
make
make test
sudo make install

Warning

Currently working and tested with Vowpal Wabbit versions 7.10 to 8.1.1. Vowpal Wabbit’s API isn’t currently stable, so this may or may not work with older/newer versions. The aim will be to ensure this wrapper always works with the latest release of Vowpal Wabbit.

Examples

Train model

>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.models.wrappers import LdaVowpalWabbit
>>>
>>> path_to_wv_binary = "/path/to/vw/binary"
>>> model = LdaVowpalWabbit(path_to_wv_binary, corpus=common_corpus, num_topics=20, id2word=common_dictionary)

Update existing model

>>> another_corpus = [[(1, 1), (2, 1)], [(3, 5)]]
>>> model.update(another_corpus)

Get topic probability distributions for a document

>>> document_bow = [(1, 1)]
>>> print(model[document_bow])

Print topics

>>> print(model.print_topics())

Save/load the trained model

>>> from gensim.test.utils import get_tmpfile
>>>
>>> temp_path = get_tmpfile("vw_lda.model")
>>> model.save(temp_path)
>>>
>>> loaded_lda = LdaVowpalWabbit.load(temp_path)

Calculate log-perplexoty on given corpus

>>> another_corpus = [[(1, 1), (2, 1)], [(3, 5)]]
>>> print(model.log_perpexity(another_corpus))

Vowpal Wabbit works on files, so this wrapper maintains a temporary directory while it’s around, reading/writing there as necessary.

class gensim.models.wrappers.ldavowpalwabbit.LdaVowpalWabbit(vw_path, corpus=None, num_topics=100, id2word=None, chunksize=256, passes=1, alpha=0.1, eta=0.1, decay=0.5, offset=1, gamma_threshold=0.001, random_seed=None, cleanup_files=True, tmp_prefix='tmp')

Bases: gensim.utils.SaveLoad

Python wrapper using Vowpal Wabbit’s online LDA.

Communication between Vowpal Wabbit and Python takes place by passing around data files on disk and calling the ‘vw’ binary with the subprocess module.

Warning

This is only python wrapper for Vowpal Wabbit’s online LDA, you need to install original implementation first and pass the path to binary to vw_path.

Parameters
  • vw_path (str) – Path to Vowpal Wabbit’s binary.

  • corpus (iterable of list of (int, int), optional) – Collection of texts in BoW format. If given, training will start immediately, otherwise, you should call train() or update() manually for training.

  • num_topics (int, optional) – Number of requested latent topics to be extracted from the training corpus. Corresponds to VW’s --lda <num_topics> argument.

  • id2word (Dictionary, optional) – Mapping from word ids (integers) to words (strings).

  • chunksize (int, optional) – Number of documents examined in each batch. Corresponds to VW’s --minibatch <batch_size> argument.

  • passes (int, optional) – Number of passes over the dataset to use. Corresponds to VW’s --passes <passes> argument.

  • alpha (float, optional) – Float effecting sparsity of per-document topic weights. This is applied symmetrically, and should be set higher to when documents are thought to look more similar. Corresponds to VW’s --lda_alpha <alpha> argument.

  • eta (float, optional) – Affects the sparsity of topic distributions. This is applied symmetrically, and should be set higher when topics are thought to look more similar. Corresponds to VW’s --lda_rho <rho> argument.

  • decay (float, optional) – Learning rate decay, affects how quickly learnt values are forgotten. Should be set to a value between 0.5 and 1.0 to guarantee convergence. Corresponds to VW’s --power_t <tau> argument.

  • offset (int, optional) – Learning offset, set to higher values to slow down learning on early iterations of the algorithm. Corresponds to VW’s --initial_t <tau> argument.

  • gamma_threshold (float, optional) – Affects when learning loop will be broken out of, higher values will result in earlier loop completion. Corresponds to VW’s --epsilon <eps> argument.

  • random_seed (int, optional) – Sets random seed when learning. Corresponds to VW’s --random_seed <seed> argument.

  • cleanup_files (bool, optional) – Whether or not to delete temporary directory and files used by this wrapper. Setting to False can be useful for debugging, or for re-using Vowpal Wabbit files elsewhere.

  • tmp_prefix (str, optional) – To prefix temporary working directory name.

get_topics()

Get topics X words matrix.

Returns

num_topics x vocabulary_size array of floats which represents the learned term topic matrix.

Return type

numpy.ndarray

classmethod load(fname, *args, **kwargs)

Load model from fname.

Parameters

fname (str) – Path to file with LdaVowpalWabbit.

log_perplexity(chunk)

Get per-word lower bound on log perplexity.

Parameters

chunk (iterable of list of (int, int)) – Collection of texts in BoW format.

Returns

bound – Per-word lower bound on log perplexity.

Return type

float

print_topic(topicid, topn=10)

Get text representation of topic.

Parameters
  • topicid (int) – Id of topic.

  • topn (int, optional) – Top number of words in topic.

Returns

Topic topicid in text representation.

Return type

str

print_topics(num_topics=10, num_words=10)

Alias for show_topics().

Parameters
  • num_topics (int, optional) – Number of topics to return, set -1 to get all topics.

  • num_words (int, optional) – Number of words.

Returns

Topics as a list of strings

Return type

list of str

save(fname, *args, **kwargs)

Save model to file.

Parameters

fname (str) – Path to output file.

show_topic(topicid, topn=10)

Get num_words most probable words for the given topicid.

Parameters
  • topicid (int) – Id of topic.

  • topn (int, optional) – Top number of topics that you’ll receive.

Returns

Sequence of probable words, as a list of (word, word_probability) for topicid topic.

Return type

list of (str, float)

show_topics(num_topics=10, num_words=10, log=False, formatted=True)

Get the num_words most probable words for num_topics number of topics.

Parameters
  • num_topics (int, optional) – Number of topics to return, set -1 to get all topics.

  • num_words (int, optional) – Number of words.

  • log (bool, optional) – If True - will write topics with logger.

  • formatted (bool, optional) – If True - return the topics as a list of strings, otherwise as lists of (weight, word) pairs.

Returns

  • list of str – Topics as a list of strings (if formatted=True) OR

  • list of (float, str) – Topics as list of (weight, word) pairs (if formatted=False)

train(corpus)

Clear any existing model state, and train on given corpus.

Parameters

corpus (iterable of list of (int, int)) – Collection of texts in BoW format.

update(corpus)

Update existing model with corpus.

Parameters

corpus (iterable of list of (int, int)) – Collection of texts in BoW format.

gensim.models.wrappers.ldavowpalwabbit.corpus_to_vw(corpus)

Convert corpus to Vowpal Wabbit format.

Parameters

corpus (iterable of list of (int, int)) – Collection of texts in BoW format.

Notes

Vowpal Wabbit format

| 4:7 14:1 22:8 6:3
| 14:22 22:4 0:1 1:3
| 7:2 8:2
Yields

str – Corpus in Vowpal Wabbit, line by line.

gensim.models.wrappers.ldavowpalwabbit.vwmodel2ldamodel(vw_model, iterations=50)

Convert LdaVowpalWabbit to LdaModel.

This works by simply copying the training model weights (alpha, beta…) from a trained vwmodel into the gensim model.

Parameters
  • vw_model (LdaVowpalWabbit) – Trained Vowpal Wabbit model.

  • iterations (int) – Number of iterations to be used for inference of the new LdaModel.

Returns

Gensim native LDA.

Return type

LdaModel.

gensim.models.wrappers.ldavowpalwabbit.write_corpus_as_vw(corpus, filename)

Covert corpus to Vowpal Wabbit format and save it to filename.

Parameters
  • corpus (iterable of list of (int, int)) – Collection of texts in BoW format.

  • filename (str) – Path to output file.

Returns

Number of lines in filename.

Return type

int