gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

models.atmodel – Author-topic models

models.atmodel – Author-topic models

Author-topic model in Python.

This module trains the author-topic model on documents and corresponding author-document dictionaries. The training is online and is constant in memory w.r.t. the number of documents. The model is not constant in memory w.r.t. the number of authors.

The model can be updated with additional documents after training has been completed. It is also possible to continue training on the existing data.

The model is closely related to Latent Dirichlet Allocation. The AuthorTopicModel class inherits the LdaModel class, and its usage is thus similar.

Distributed computation and multiprocessing is not implemented at the moment, but may be coming in the future.

The model was introduced by Rosen-Zvi and co-authors in 2004 (https://mimno.infosci.cornell.edu/info6150/readings/398.pdf).

A tutorial can be found at https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks/atmodel_tutorial.ipynb.

class gensim.models.atmodel.AuthorTopicModel(corpus=None, num_topics=100, id2word=None, author2doc=None, doc2author=None, chunksize=2000, passes=1, iterations=50, decay=0.5, offset=1.0, alpha='symmetric', eta='symmetric', update_every=1, eval_every=10, gamma_threshold=0.001, serialized=False, serialization_path=None, minimum_probability=0.01, random_state=None)

Bases: gensim.models.ldamodel.LdaModel

The constructor estimates the author-topic model parameters based on a training corpus:

>>> model = AuthorTopicModel(corpus, num_topics=10, author2doc=author2doc, id2word=id2word)

The model can be updated (trained) with new documents via

>>> model.update(other_corpus, other_author2doc)

Model persistency is achieved through its load/save methods.

If the iterable corpus and one of author2doc/doc2author dictionaries are given, start training straight away. If not given, the model is left untrained (presumably because you want to call the update method manually).

num_topics is the number of requested latent topics to be extracted from the training corpus.

id2word is a mapping from word ids (integers) to words (strings). It is used to determine the vocabulary size, as well as for debugging and topic printing.

author2doc is a dictionary where the keys are the names of authors, and the values are lists of documents that the author contributes to.

doc2author is a dictionary where the keys are document IDs (indexes to corpus) and the values are lists of author names. I.e. this is the reverse mapping of author2doc. Only one of the two, author2doc and doc2author have to be supplied.

passes is the number of times the model makes a pass over the entire trianing data.

iterations is the maximum number of times the model loops over each document (M-step). The iterations stop when convergence is reached.

chunksize controls the size of the mini-batches.

alpha and eta are hyperparameters that affect sparsity of the author-topic (theta) and topic-word (lambda) distributions. Both default to a symmetric 1.0/num_topics prior.

alpha can be set to an explicit array = prior of your choice. It also support special values of ‘asymmetric’ and ‘auto’: the former uses a fixed normalized asymmetric 1.0/topicno prior, the latter learns an asymmetric prior directly from your data.

eta can be a scalar for a symmetric prior over topic/word distributions, or a vector of shape num_words, which can be used to impose (user defined) asymmetric priors over the word distribution. It also supports the special value ‘auto’, which learns an asymmetric prior over words directly from your data. eta can also be a matrix of shape num_topics x num_words, which can be used to impose asymmetric priors over the word distribution on a per-topic basis (can not be learned from data).

Calculate and log perplexity estimate from the latest mini-batch every eval_every model updates. Set to None to disable perplexity estimation.

decay and offset parameters are the same as Kappa and Tau_0 in Hoffman et al, respectively. decay controls how quickly old documents are forgotten, while offset down-weights early iterations.

minimum_probability controls filtering the topics returned for a document (bow).

random_state can be an integer or a numpy.random.RandomState object. Set the state of the random number generator inside the author-topic model, to ensure reproducibility of your experiments, for example.

serialized indicates whether the input corpora to the model are simple in-memory lists (serialized = False) or saved to the hard-drive (serialized = True). Note that this behaviour is quite different from other Gensim models. If your data is too large to fit in to memory, use this functionality. Note that calling AuthorTopicModel.update with new data may be cumbersome as it requires all the existing data to be re-serialized.

serialization_path must be set to a filepath, if serialized = True is used. Use, for example, serialization_path = /tmp/serialized_model.mm or use your working directory by setting serialization_path = serialized_model.mm. An existing file cannot be overwritten; either delete the old file or choose a different name.

Example:

>>> model = AuthorTopicModel(corpus, num_topics=100, author2doc=author2doc, id2word=id2word)  # train model
>>> model.update(corpus2)  # update the author-topic model with additional documents
>>> model = AuthorTopicModel(corpus, num_topics=50, author2doc=author2doc, id2word=id2word, alpha='auto', eval_every=5)  # train asymmetric alpha from data
bound(chunk, chunk_doc_idx=None, subsample_ratio=1.0, author2doc=None, doc2author=None)

Estimate the variational bound of documents from corpus: E_q[log p(corpus)] - E_q[log q(corpus)]

There are basically two use cases of this method: 1. chunk is a subset of the training corpus, and chunk_doc_idx is provided, indicating the indexes of the documents in the training corpus. 2. chunk is a test set (held-out data), and author2doc and doc2author corrsponding to this test set are provided. There must not be any new authors passed to this method. chunk_doc_idx is not needed in this case.

To obtain the per-word bound, compute:

>>> corpus_words = sum(cnt for document in corpus for _, cnt in document)
>>> model.bound(corpus, author2doc=author2doc, doc2author=doc2author) / corpus_words
clear()

Clear model state (free up some memory). Used in the distributed algo.

compute_phinorm(expElogthetad, expElogbetad)

Efficiently computes the normalizing factor in phi.

diff(other, distance='kullback_leibler', num_words=100, n_ann_terms=10, diagonal=False, annotation=True, normed=True)

Calculate difference topic2topic between two Lda models other instances of LdaMulticore or LdaModel distance is function that will be applied to calculate difference between any topic pair. Available values: kullback_leibler, hellinger, jaccard and jensen_shannon num_words is quantity of most relevant words that used if distance == jaccard (also used for annotation) n_ann_terms is max quantity of words in intersection/symmetric difference between topics (used for annotation) diagonal set to True if the difference is required only between the identical topic no.s (returns diagonal of diff matrix) annotation whether the intersection or difference of words between two topics should be returned Returns a matrix Z with shape (m1.num_topics, m2.num_topics), where Z[i][j] - difference between topic_i and topic_j and matrix annotation (if True) with shape (m1.num_topics, m2.num_topics, 2, None), where:

annotation[i][j] = [[int_1, int_2, …], [diff_1, diff_2, …]] and int_k is word from intersection of topic_i and topic_j and diff_l is word from symmetric difference of topic_i and topic_j normed is a flag. If true, matrix Z will be normalized

Example:

>>> m1, m2 = LdaMulticore.load(path_1), LdaMulticore.load(path_2)
>>> mdiff, annotation = m1.diff(m2)
>>> print(mdiff) # get matrix with difference for each topic pair from `m1` and `m2`
>>> print(annotation) # get array with positive/negative words for each topic pair from `m1` and `m2`
do_estep(chunk, author2doc, doc2author, rhot, state=None, chunk_doc_idx=None)

Perform inference on a chunk of documents, and accumulate the collected sufficient statistics in state (or self.state if None).

do_mstep(rho, other, extra_pass=False)

M step: use linear interpolation between the existing topics and collected sufficient statistics in other to update the topics.

extend_corpus(corpus)

Add new documents in corpus to self.corpus. If serialization is used, then the entire corpus (self.corpus) is re-serialized and the new documents are added in the process. If serialization is not used, the corpus, as a list of documents, is simply extended.

get_author_topics(author_name, minimum_probability=None)

Return topic distribution the given author, as a list of (topic_id, topic_probability) 2-tuples. Ignore topics with very low probability (below minimum_probability).

Obtaining topic probabilities of each word, as in LDA (via per_word_topics), is not supported.

get_document_topics(word_id, minimum_probability=None)

This method overwrites LdaModel.get_document_topics and simply raises an exception. get_document_topics is not valid for the author-topic model, use get_author_topics instead.

get_term_topics(word_id, minimum_probability=None)
Parameters:
  • word_id (int) – ID of the word to get topic probabilities for.
  • minimum_probability (float) – Only include topic probabilities above this value (None by default). If set to None, use 1e-8 to prevent including 0s.
Returns:

The most likely topics for the given word. Each topic is represented as a tuple of (topic_id, term_probability).

Return type:

list

get_topic_terms(topicid, topn=10)
Parameters:topn (int) – Only return 2-tuples for the topn most probable words (ignore the rest).
Returns:(word_id, probability) 2-tuples for the most probable words in topic with id topicid.
Return type:list
get_topics()
Returns:num_topics x vocabulary_size array of floats which represents the term topic matrix learned during inference.
Return type:np.ndarray
inference(chunk, author2doc, doc2author, rhot, collect_sstats=False, chunk_doc_idx=None)

Given a chunk of sparse document vectors, update gamma (parameters controlling the topic weights) for each author corresponding to the documents in the chunk.

The whole input chunk of document is assumed to fit in RAM; chunking of a large corpus must be done earlier in the pipeline.

If collect_sstats is True, also collect sufficient statistics needed to update the model’s topic-word distributions, and return a 2-tuple (gamma_chunk, sstats). Otherwise, return (gamma_chunk, None). gamma_cunk is of shape len(chunk_authors) x self.num_topics, where chunk_authors is the number of authors in the documents in the current chunk.

Avoids computing the phi variational parameter directly using the optimization presented in Lee, Seung: Algorithms for non-negative matrix factorization, NIPS 2001.

init_dir_prior(prior, name)
init_empty_corpus()

Initialize an empty corpus. If the corpora are to be treated as lists, simply initialize an empty list. If serialization is used, initialize an empty corpus of the class gensim.corpora.MmCorpus.

load(fname, *args, **kwargs)

Load a previously saved object from file (also see save).

Large arrays can be memmap’ed back as read-only (shared memory) by setting mmap=’r’:

>>> LdaModel.load(fname, mmap='r')
log_perplexity(chunk, chunk_doc_idx=None, total_docs=None)

Calculate and return per-word likelihood bound, using the chunk of documents as evaluation corpus. Also output the calculated statistics. incl. perplexity=2^(-bound), to log at INFO level.

print_topic(topicno, topn=10)

Return a single topic as a formatted string. See show_topic() for parameters.

>>> lsimodel.print_topic(10, topn=5)
'-0.340 * "category" + 0.298 * "$M$" + 0.183 * "algebra" + -0.174 * "functor" + -0.168 * "operator"'
print_topics(num_topics=20, num_words=10)

Alias for show_topics() that prints the num_words most probable words for topics number of topics to log. Set topics=-1 to print all topics.

save(fname, ignore=('state', 'dispatcher'), separately=None, *args, **kwargs)

Save the model to file.

Large internal arrays may be stored into separate files, with fname as prefix.

separately can be used to define which arrays should be stored in separate files.

ignore parameter can be used to define which variables should be ignored, i.e. left out from the pickled lda model. By default the internal state is ignored as it uses its own serialisation not the one provided by LdaModel. The state and dispatcher will be added to any ignore parameter defined.

Note: do not save as a compressed file if you intend to load the file back with mmap.

Note: If you intend to use models across Python 2/3 versions there are a few things to keep in mind:

  1. The pickled Python dictionaries will not work across Python versions
  2. The save method does not automatically save all np arrays using np, only those ones that exceed sep_limit set in gensim.utils.SaveLoad.save. The main concern here is the alpha array if for instance using alpha=’auto’.

Please refer to the wiki recipes section (https://github.com/piskvorky/gensim/wiki/Recipes-&-FAQ#q9-how-do-i-load-a-model-in-python-3-that-was-trained-and-saved-using-python-2) for an example on how to work around these issues.

show_topic(topicid, topn=10)
Parameters:topn (int) – Only return 2-tuples for the topn most probable words (ignore the rest).
Returns:of (word, probability) 2-tuples for the most probable words in topic topicid.
Return type:list
show_topics(num_topics=10, num_words=10, log=False, formatted=True)
Parameters:
  • num_topics (int) – show results for first num_topics topics. Unlike LSA, there is no natural ordering between the topics in LDA. The returned num_topics <= self.num_topics subset of all topics is therefore arbitrary and may change between two LDA training runs.
  • num_words (int) – include top num_words with highest probabilities in topic.
  • log (bool) – If True, log output in addition to returning it.
  • formatted (bool) – If True, format topics as strings, otherwise return them as (word, probability) 2-tuples.
Returns:

num_words most significant words for num_topics number of topics (10 words for top 10 topics, by default).

Return type:

list

sync_state()
top_topics(corpus=None, texts=None, dictionary=None, window_size=None, coherence='u_mass', topn=20, processes=-1)

Calculate the coherence for each topic; default is Umass coherence.

See the gensim.models.CoherenceModel constructor for more info on the parameters and the different coherence metrics.

Returns:tuples with (topic_repr, coherence_score), where topic_repr is a list of representations of the topn terms for the topic. The terms are represented as tuples of (membership_in_topic, token). The coherence_score is a float.
Return type:list
update(corpus=None, author2doc=None, doc2author=None, chunksize=None, decay=None, offset=None, passes=None, update_every=None, eval_every=None, iterations=None, gamma_threshold=None, chunks_as_numpy=False)

Train the model with new documents, by EM-iterating over corpus until the topics converge (or until the maximum number of allowed iterations is reached). corpus must be an iterable (repeatable stream of documents),

This update also supports updating an already trained model (self) with new documents from corpus; the two models are then merged in proportion to the number of old vs. new documents. This feature is still experimental for non-stationary input streams.

For stationary input (no topic drift in new documents), on the other hand, this equals the online update of Hoffman et al. and is guaranteed to converge for any decay in (0.5, 1.0>. Additionally, for smaller corpus sizes, an increasing offset may be beneficial (see Table 1 in Hoffman et al.)

If update is called with authors that already exist in the model, it will resume training on not only new documents for that author, but also the previously seen documents. This is necessary for those authors’ topic distributions to converge.

Every time update(corpus, author2doc) is called, the new documents are to appended to all the previously seen documents, and author2doc is combined with the previously seen authors.

To resume training on all the data seen by the model, simply call update().

It is not possible to add new authors to existing documents, as all documents in corpus are assumed to be new documents.

Parameters:
  • corpus (gensim corpus) – The corpus with which the author-topic model should be updated.
  • author2doc (dictionary) – author to document mapping corresponding to indexes in input corpus.
  • doc2author (dictionary) – document to author mapping corresponding to indexes in input corpus.
  • chunks_as_numpy (bool) – Whether each chunk passed to .inference should be a np array of not. np can in some settings turn the term IDs into floats, these will be converted back into integers in inference, which incurs a performance hit. For distributed computing it may be desirable to keep the chunks as np arrays.

For other parameter settings, see AuthorTopicModel constructor.

update_alpha(gammat, rho)

Update parameters for the Dirichlet prior on the per-document topic weights alpha given the last gammat.

update_eta(lambdat, rho)

Update parameters for the Dirichlet prior on the per-topic word weights eta given the last lambdat.

class gensim.models.atmodel.AuthorTopicState(eta, lambda_shape, gamma_shape)

Bases: gensim.models.ldamodel.LdaState

NOTE: distributed mode not available yet in the author-topic model. This AuthorTopicState object is kept so that when the time comes to imlement it, it will be easier.

Encapsulate information for distributed computation of AuthorTopicModel objects.

Objects of this class are sent over the network, so try to keep them lean to reduce traffic.

blend(rhot, other, targetsize=None)

Given LdaState other, merge it with the current state. Stretch both to targetsize documents before merging, so that they are of comparable magnitude.

Merging is done by average weighting: in the extremes, rhot=0.0 means other is completely ignored; rhot=1.0 means self is completely ignored.

This procedure corresponds to the stochastic gradient update from Hoffman et al., algorithm 2 (eq. 14).

blend2(rhot, other, targetsize=None)

Alternative, more simple blend.

get_Elogbeta()
get_lambda()
load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

merge(other)

Merge the result of an E step from one node with that of another node (summing up sufficient statistics).

The merging is trivial and after merging all cluster nodes, we have the exact same result as if the computation was run on a single node (no approximation).

reset()

Prepare the state for a new EM iteration (reset sufficient stats).

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.

gensim.models.atmodel.construct_author2doc(doc2author)

Make a mapping from author IDs to document IDs.

gensim.models.atmodel.construct_doc2author(corpus, author2doc)

Make a mapping from document IDs to author IDs.