models.wrappers.ldavowpalwabbit
– Latent Dirichlet Allocation via Vowpal Wabbit¶Python wrapper for Vowpal Wabbit’s Latent Dirichlet Allocation.
This uses Matt Hoffman’s online algorithm, i.e. the same algorithm
that Gensim’s LdaModel
is based on.
Use official guide or this one
git clone https://github.com/JohnLangford/vowpal_wabbit.git
cd vowpal_wabbit
make
make test
sudo make install
Warning
Currently working and tested with Vowpal Wabbit versions 7.10 to 8.1.1. Vowpal Wabbit’s API isn’t currently stable, so this may or may not work with older/newer versions. The aim will be to ensure this wrapper always works with the latest release of Vowpal Wabbit.
Examples
Train model
>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.models.wrappers import LdaVowpalWabbit
>>>
>>> path_to_wv_binary = "/path/to/vw/binary"
>>> model = LdaVowpalWabbit(path_to_wv_binary, corpus=common_corpus, num_topics=20, id2word=common_dictionary)
Update existing model
>>> another_corpus = [[(1, 1), (2, 1)], [(3, 5)]]
>>> model.update(another_corpus)
Get topic probability distributions for a document
>>> document_bow = [(1, 1)]
>>> print(model[document_bow])
Print topics
>>> print(model.print_topics())
Save/load the trained model
>>> from gensim.test.utils import get_tmpfile
>>>
>>> temp_path = get_tmpfile("vw_lda.model")
>>> model.save(temp_path)
>>>
>>> loaded_lda = LdaVowpalWabbit.load(temp_path)
Calculate log-perplexoty on given corpus
>>> another_corpus = [[(1, 1), (2, 1)], [(3, 5)]]
>>> print(model.log_perpexity(another_corpus))
Vowpal Wabbit works on files, so this wrapper maintains a temporary directory while it’s around, reading/writing there as necessary.
gensim.models.wrappers.ldavowpalwabbit.
LdaVowpalWabbit
(vw_path, corpus=None, num_topics=100, id2word=None, chunksize=256, passes=1, alpha=0.1, eta=0.1, decay=0.5, offset=1, gamma_threshold=0.001, random_seed=None, cleanup_files=True, tmp_prefix='tmp')¶Bases: gensim.utils.SaveLoad
Python wrapper using Vowpal Wabbit’s online LDA.
Communication between Vowpal Wabbit and Python takes place by passing around data files on disk and calling the ‘vw’ binary with the subprocess module.
Warning
This is only python wrapper for Vowpal Wabbit’s online LDA,
you need to install original implementation first and pass the path to binary to vw_path
.
vw_path (str) – Path to Vowpal Wabbit’s binary.
corpus (iterable of list of (int, int), optional) – Collection of texts in BoW format. If given, training will start immediately,
otherwise, you should call train()
or
update()
manually for training.
num_topics (int, optional) – Number of requested latent topics to be extracted from the training corpus.
Corresponds to VW’s --lda <num_topics>
argument.
id2word (Dictionary
, optional) – Mapping from word ids (integers) to words (strings).
chunksize (int, optional) – Number of documents examined in each batch.
Corresponds to VW’s --minibatch <batch_size>
argument.
passes (int, optional) – Number of passes over the dataset to use.
Corresponds to VW’s --passes <passes>
argument.
alpha (float, optional) – Float effecting sparsity of per-document topic weights.
This is applied symmetrically, and should be set higher to when documents are thought to look more similar.
Corresponds to VW’s --lda_alpha <alpha>
argument.
eta (float, optional) – Affects the sparsity of topic distributions.
This is applied symmetrically, and should be set higher when topics
are thought to look more similar.
Corresponds to VW’s --lda_rho <rho>
argument.
decay (float, optional) – Learning rate decay, affects how quickly learnt values are forgotten.
Should be set to a value between 0.5 and 1.0 to guarantee convergence.
Corresponds to VW’s --power_t <tau>
argument.
offset (int, optional) – Learning offset, set to higher values to slow down learning on early iterations of the algorithm.
Corresponds to VW’s --initial_t <tau>
argument.
gamma_threshold (float, optional) – Affects when learning loop will be broken out of, higher values will result in earlier loop completion.
Corresponds to VW’s --epsilon <eps>
argument.
random_seed (int, optional) – Sets random seed when learning.
Corresponds to VW’s --random_seed <seed>
argument.
cleanup_files (bool, optional) – Whether or not to delete temporary directory and files used by this wrapper. Setting to False can be useful for debugging, or for re-using Vowpal Wabbit files elsewhere.
tmp_prefix (str, optional) – To prefix temporary working directory name.
get_topics
()¶Get topics X words matrix.
num_topics x vocabulary_size array of floats which represents the learned term topic matrix.
numpy.ndarray
load
(fname, *args, **kwargs)¶Load model from fname.
fname (str) – Path to file with LdaVowpalWabbit
.
log_perplexity
(chunk)¶Get per-word lower bound on log perplexity.
chunk (iterable of list of (int, int)) – Collection of texts in BoW format.
bound – Per-word lower bound on log perplexity.
float
print_topic
(topicid, topn=10)¶Get text representation of topic.
topicid (int) – Id of topic.
topn (int, optional) – Top number of words in topic.
Topic topicid in text representation.
str
print_topics
(num_topics=10, num_words=10)¶Alias for show_topics()
.
num_topics (int, optional) – Number of topics to return, set -1 to get all topics.
num_words (int, optional) – Number of words.
Topics as a list of strings
list of str
save
(fname, *args, **kwargs)¶Save model to file.
fname (str) – Path to output file.
show_topic
(topicid, topn=10)¶Get num_words most probable words for the given topicid.
topicid (int) – Id of topic.
topn (int, optional) – Top number of topics that you’ll receive.
Sequence of probable words, as a list of (word, word_probability) for topicid topic.
list of (str, float)
show_topics
(num_topics=10, num_words=10, log=False, formatted=True)¶Get the num_words most probable words for num_topics number of topics.
num_topics (int, optional) – Number of topics to return, set -1 to get all topics.
num_words (int, optional) – Number of words.
log (bool, optional) – If True - will write topics with logger.
formatted (bool, optional) – If True - return the topics as a list of strings, otherwise as lists of (weight, word) pairs.
list of str – Topics as a list of strings (if formatted=True) OR
list of (float, str) – Topics as list of (weight, word) pairs (if formatted=False)
train
(corpus)¶Clear any existing model state, and train on given corpus.
corpus (iterable of list of (int, int)) – Collection of texts in BoW format.
update
(corpus)¶Update existing model with corpus.
corpus (iterable of list of (int, int)) – Collection of texts in BoW format.
gensim.models.wrappers.ldavowpalwabbit.
corpus_to_vw
(corpus)¶Convert corpus to Vowpal Wabbit format.
corpus (iterable of list of (int, int)) – Collection of texts in BoW format.
Notes
Vowpal Wabbit format
| 4:7 14:1 22:8 6:3
| 14:22 22:4 0:1 1:3
| 7:2 8:2
str – Corpus in Vowpal Wabbit, line by line.
gensim.models.wrappers.ldavowpalwabbit.
vwmodel2ldamodel
(vw_model, iterations=50)¶Convert LdaVowpalWabbit
to
LdaModel
.
This works by simply copying the training model weights (alpha, beta…) from a trained vwmodel into the gensim model.
vw_model (LdaVowpalWabbit
) – Trained Vowpal Wabbit model.
iterations (int) – Number of iterations to be used for inference of the new LdaModel
.
Gensim native LDA.
gensim.models.wrappers.ldavowpalwabbit.
write_corpus_as_vw
(corpus, filename)¶Covert corpus to Vowpal Wabbit format and save it to filename.
corpus (iterable of list of (int, int)) – Collection of texts in BoW format.
filename (str) – Path to output file.
Number of lines in filename.
int