gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

models.atmodel – Author-topic models

models.atmodel – Author-topic models

Author-topic model.

This module trains the author-topic model on documents and corresponding author-document dictionaries. The training is online and is constant in memory w.r.t. the number of documents. The model is not constant in memory w.r.t. the number of authors.

The model can be updated with additional documents after training has been completed. It is also possible to continue training on the existing data.

The model is closely related to LdaModel. The AuthorTopicModel class inherits LdaModel, and its usage is thus similar.

The model was introduced by Rosen-Zvi and co-authors: “The Author-Topic Model for Authors and Documents”. The model correlates the authorship information with the topics to give a better insight on the subject knowledge of an author.

Example

>>> from gensim.models import AuthorTopicModel
>>> from gensim.corpora import mmcorpus
>>> from gensim.test.utils import common_dictionary, datapath, temporary_file
>>> author2doc = {
...     'john': [0, 1, 2, 3, 4, 5, 6],
...     'jane': [2, 3, 4, 5, 6, 7, 8],
...     'jack': [0, 2, 4, 6, 8]
... }
>>>
>>> corpus = mmcorpus.MmCorpus(datapath('testcorpus.mm'))
>>>
>>> with temporary_file("serialized") as s_path:
...     model = AuthorTopicModel(
...          corpus, author2doc=author2doc, id2word=common_dictionary, num_topics=4,
...          serialized=True, serialization_path=s_path
...     )
...
...     model.update(corpus, author2doc)  # update the author-topic model with additional documents
>>>
>>> # construct vectors for authors
>>> author_vecs = [model.get_author_topics(author) for author in model.id2author.values()]
class gensim.models.atmodel.AuthorTopicModel(corpus=None, num_topics=100, id2word=None, author2doc=None, doc2author=None, chunksize=2000, passes=1, iterations=50, decay=0.5, offset=1.0, alpha='symmetric', eta='symmetric', update_every=1, eval_every=10, gamma_threshold=0.001, serialized=False, serialization_path=None, minimum_probability=0.01, random_state=None)

Bases: gensim.models.ldamodel.LdaModel

The constructor estimates the author-topic model parameters based on a training corpus.

Parameters:
  • corpus (iterable of list of (int, float), optional) – Corpus in BoW format
  • num_topics (int, optional) – Number of topics to be extracted from the training corpus.
  • id2word (Dictionary, optional) – A mapping from word ids (integers) to words (strings).
  • author2doc (dict of (str, list of int), optional) – A dictionary where keys are the names of authors and values are lists of document IDs that the author contributes to.
  • doc2author (dict of (int, list of str), optional) – A dictionary where the keys are document IDs and the values are lists of author names.
  • chunksize (int, optional) – Controls the size of the mini-batches.
  • passes (int, optional) – Number of times the model makes a pass over the entire training data.
  • iterations (int, optional) – Maximum number of times the model loops over each document.
  • decay (float, optional) – Controls how old documents are forgotten.
  • offset (float, optional) – Controls down-weighting of iterations.
  • alpha (float, optional) – Hyperparameters for author-topic model.Supports special values of ‘asymmetric’ and ‘auto’: the former uses a fixed normalized asymmetric 1.0/topicno prior, the latter learns an asymmetric prior directly from your data.
  • eta (float, optional) – Hyperparameters for author-topic model.
  • update_every (int, optional) – Make updates in topic probability for latest mini-batch.
  • eval_every (int, optional) – Calculate and estimate log perplexity for latest mini-batch.
  • gamma_threshold (float, optional) – Threshold value of gamma(topic difference between consecutive two topics) until which the iterations continue.
  • serialized (bool, optional) – Indicates whether the input corpora to the model are simple lists or saved to the hard-drive.
  • serialization_path (str, optional) – Must be set to a filepath, if serialized = True is used.
  • minimum_probability (float, optional) – Controls filtering the topics returned for a document (bow).
  • random_state ({int, numpy.random.RandomState}, optional) – Set the state of the random number generator inside the author-topic model.
bound(chunk, chunk_doc_idx=None, subsample_ratio=1.0, author2doc=None, doc2author=None)

Estimate the variational bound of documents from corpus.

\mathbb{E_{q}}[\log p(corpus)] - \mathbb{E_{q}}[\log q(corpus)]

Notes

There are basically two use cases of this method:

  1. chunk is a subset of the training corpus, and chunk_doc_idx is provided, indicating the indexes of the documents in the training corpus.
  2. chunk is a test set (held-out data), and author2doc and doc2author corresponding to this test set are provided. There must not be any new authors passed to this method, chunk_doc_idx is not needed in this case.
Parameters:
  • chunk (iterable of list of (int, float)) – Corpus in BoW format.
  • chunk_doc_idx (numpy.ndarray, optional) – Assigns the value for document index.
  • subsample_ratio (float, optional) – Used for calculation of word score for estimation of variational bound.
  • author2doc (dict of (str, list of int), optinal) – A dictionary where keys are the names of authors and values are lists of documents that the author contributes to.
  • doc2author (dict of (int, list of str), optional) – A dictionary where the keys are document IDs and the values are lists of author names.
Returns:

Value of variational bound score.

Return type:

float

clear()

Clear the model’s state to free some memory. Used in the distributed implementation.

compute_phinorm(expElogthetad, expElogbetad)

Efficiently computes the normalizing factor in phi.

Parameters:
  • expElogthetad (numpy.ndarray) – Value of variational distribution q( heta|\gamma).
  • expElogbetad (numpy.ndarray) – Value of variational distribution q(\beta|\lambda).
Returns:

Value of normalizing factor.

Return type:

float

diff(other, distance='kullback_leibler', num_words=100, n_ann_terms=10, diagonal=False, annotation=True, normed=True)

Calculate the difference in topic distributions between two models: self and other.

Parameters:
  • other (LdaModel) – The model which will be compared against the current object.
  • distance ({'kullback_leibler', 'hellinger', 'jaccard', 'jensen_shannon'}) – The distance metric to calculate the difference with.
  • num_words (int, optional) – The number of most relevant words used if distance == ‘jaccard’. Also used for annotating topics.
  • n_ann_terms (int, optional) – Max number of words in intersection/symmetric difference between topics. Used for annotation.
  • diagonal (bool, optional) – Whether we need the difference between identical topics (the diagonal of the difference matrix).
  • annotation (bool, optional) – Whether the intersection or difference of words between two topics should be returned.
  • normed (bool, optional) – Whether the matrix should be normalized or not.
Returns:

  • numpy.ndarray – A difference matrix. Each element corresponds to the difference between the two topics, shape (self.num_topics, other.num_topics)
  • numpy.ndarray, optional – Annotation matrix where for each pair we include the word from the intersection of the two topics, and the word from the symmetric difference of the two topics. Only included if annotation == True. Shape (self.num_topics, other_model.num_topics, 2).

Examples

Get the differences between each pair of topics inferred by two models

>>> from gensim.models.ldamulticore import LdaMulticore
>>> from gensim.test.utils import datapath
>>>
>>> m1, m2 = LdaMulticore.load(datapath("lda_3_0_1_model")), LdaMulticore.load(datapath("ldamodel_python_3_5"))
>>> mdiff, annotation = m1.diff(m2)
>>> topic_diff = mdiff # get matrix with difference for each topic pair from `m1` and `m2`
do_estep(chunk, author2doc, doc2author, rhot, state=None, chunk_doc_idx=None)

Performs inference (E-step) on a chunk of documents, and accumulate the collected sufficient statistics.

Parameters:
  • chunk (iterable of list of (int, float)) – Corpus in BoW format.
  • author2doc (dict of (str, list of int), optional) – A dictionary where keys are the names of authors and values are lists of document IDs that the author contributes to.
  • doc2author (dict of (int, list of str), optional) – A dictionary where the keys are document IDs and the values are lists of author names.
  • rhot (float) – Value of rho for conducting inference on documents.
  • state (int, optional) – Initializes the state for a new E iteration.
  • chunk_doc_idx (numpy.ndarray, optional) – Assigns the value for document index.
Returns:

Value of gamma for training of model.

Return type:

float

do_mstep(rho, other, extra_pass=False)

Maximization step: use linear interpolation between the existing topics and collected sufficient statistics in other to update the topics.

Parameters:
  • rho (float) – Learning rate.
  • other (LdaModel) – The model whose sufficient statistics will be used to update the topics.
  • extra_pass (bool, optional) – Whether this step required an additional pass over the corpus.
extend_corpus(corpus)

Add new documents from corpus to self.corpus.

If serialization is used, then the entire corpus (self.corpus) is re-serialized and the new documents are added in the process. If serialization is not used, the corpus, as a list of documents, is simply extended.

Parameters:corpus (iterable of list of (int, float)) – Corpus in BoW format
Raises:AssertionError – If serialized == False and corpus isn’t list.
get_author_topics(author_name, minimum_probability=None)

Get topic distribution the given author.

Parameters:
  • author_name (str) – Name of the author for which the topic distribution needs to be estimated.
  • minimum_probability (float, optional) – Sets the minimum probability value for showing the topics of a given author, topics with probability < minimum_probability will be ignored.
Returns:

Topic distribution of an author.

Return type:

list of (int, float)

Example

>>> from gensim.models import AuthorTopicModel
>>> from gensim.corpora import mmcorpus
>>> from gensim.test.utils import common_dictionary, datapath, temporary_file
>>> author2doc = {
...     'john': [0, 1, 2, 3, 4, 5, 6],
...     'jane': [2, 3, 4, 5, 6, 7, 8],
...     'jack': [0, 2, 4, 6, 8]
... }
>>>
>>> corpus = mmcorpus.MmCorpus(datapath('testcorpus.mm'))
>>>
>>> with temporary_file("serialized") as s_path:
...     model = AuthorTopicModel(
...          corpus, author2doc=author2doc, id2word=common_dictionary, num_topics=4,
...          serialized=True, serialization_path=s_path
...     )
...
...     model.update(corpus, author2doc)  # update the author-topic model with additional documents
>>>
>>> # construct vectors for authors
>>> author_vecs = [model.get_author_topics(author) for author in model.id2author.values()]
get_document_topics(word_id, minimum_probability=None)

Override get_document_topics() and simply raises an exception.

Warning

This method invalid for model, use get_author_topics() or get_new_author_topics() instead.

Raises:NotImplementedError – Always.
get_new_author_topics(corpus, minimum_probability=None)

Infers topics for new author.

Infers a topic distribution for a new author over the passed corpus of docs, assuming that all documents are from this single new author.

Parameters:
  • corpus (iterable of list of (int, float)) – Corpus in BoW format.
  • minimum_probability (float, optional) – Ignore topics with probability below this value, if None - 1e-8 is used.
Returns:

Topic distribution for the given corpus.

Return type:

list of (int, float)

get_term_topics(word_id, minimum_probability=None)

Get the most relevant topics to the given word.

Parameters:
  • word_id (int) – The word for which the topic distribution will be computed.
  • minimum_probability (float, optional) – Topics with an assigned probability below this threshold will be discarded.
Returns:

The relevant topics represented as pairs of their ID and their assigned probability, sorted by relevance to the given word.

Return type:

list of (int, float)

get_topic_terms(topicid, topn=10)

Get the representation for a single topic. Words the integer IDs, in constrast to show_topic() that represents words by the actual strings.

Parameters:
  • topicid (int) – The ID of the topic to be returned
  • topn (int, optional) – Number of the most significant words that are associated with the topic.
Returns:

Word ID - probability pairs for the most relevant words generated by the topic.

Return type:

list of (int, float)

get_topics()

Get the term-topic matrix learned during inference.

Returns:The probability for each word in each topic, shape (num_topics, vocabulary_size).
Return type:numpy.ndarray
inference(chunk, author2doc, doc2author, rhot, collect_sstats=False, chunk_doc_idx=None)

Give a chunk of sparse document vectors, update gamma for each author corresponding to the chuck.

Warning

The whole input chunk of document is assumed to fit in RAM, chunking of a large corpus must be done earlier in the pipeline.

Avoids computing the phi variational parameter directly using the optimization presented in Lee, Seung: “Algorithms for non-negative matrix factorization”, NIPS 2001.

Parameters:
  • chunk (iterable of list of (int, float)) – Corpus in BoW format.
  • author2doc (dict of (str, list of int), optional) – A dictionary where keys are the names of authors and values are lists of document IDs that the author contributes to.
  • doc2author (dict of (int, list of str), optional) – A dictionary where the keys are document IDs and the values are lists of author names.
  • rhot (float) – Value of rho for conducting inference on documents.
  • collect_sstats (boolean, optional) – If True - collect sufficient statistics needed to update the model’s topic-word distributions, and return (gamma_chunk, sstats). Otherwise, return (gamma_chunk, None). gamma_chunk is of shape len(chunk_authors) x self.num_topics,where chunk_authors is the number of authors in the documents in the current chunk.
  • chunk_doc_idx (numpy.ndarray, optional) – Assigns the value for document index.
Returns:

gamma_chunk and sstats (if collect_sstats == True, otherwise - None)

Return type:

(numpy.ndarray, numpy.ndarray)

init_dir_prior(prior, name)

Initialize priors for the Dirichlet distribution.

Parameters:
  • prior ({str, list of float, numpy.ndarray of float, float}) –

    A-priori belief on word probability. If name == ‘eta’ then the prior can be:

    • scalar for a symmetric prior over topic/word probability,
    • vector of length num_words to denote an asymmetric user defined probability for each word,
    • matrix of shape (num_topics, num_words) to assign a probability for each word-topic combination,
    • the string ‘auto’ to learn the asymmetric prior from the data.

    If name == ‘alpha’, then the prior can be:

    • an 1D array of length equal to the number of expected topics,
    • ’asymmetric’: Uses a fixed normalized assymetric prior of 1.0 / topicno.
    • ’default’: Learns an assymetric prior from the corpus.
  • name ({'alpha', 'eta'}) – Whether the prior is parameterized by the alpha vector (1 parameter per topic) or by the eta (1 parameter per unique term in the vocabulary).
init_empty_corpus()

Initialize an empty corpus. If the corpora are to be treated as lists, simply initialize an empty list. If serialization is used, initialize an empty corpus using MmCorpus.

load(fname, *args, **kwargs)

Load a previously saved gensim.models.ldamodel.LdaModel from file.

See also

save()
Save model.
Parameters:
  • fname (str) – Path to the file where the model is stored.
  • *args – Positional arguments propagated to load().
  • **kwargs – Key word arguments propagated to load().

Examples

Large arrays can be memmap’ed back as read-only (shared memory) by setting mmap=’r’:

>>> from gensim.test.utils import datapath
>>>
>>> fname = datapath("lda_3_0_1_model")
>>> lda = LdaModel.load(fname, mmap='r')
log_perplexity(chunk, chunk_doc_idx=None, total_docs=None)

Calculate per-word likelihood bound, using the chunk of documents as evaluation corpus.

Parameters:
  • chunk (iterable of list of (int, float)) – Corpus in BoW format.
  • chunk_doc_idx (numpy.ndarray, optional) – Assigns the value for document index.
  • total_docs (int, optional) – Initializes the value for total number of documents.
Returns:

Value of per-word likelihood bound.

Return type:

float

print_topic(topicno, topn=10)

Get a single topic as a formatted string.

Parameters:
  • topicno (int) – Topic id.
  • topn (int) – Number of words from topic that will be used.
Returns:

String representation of topic, like ‘-0.340 * “category” + 0.298 * “$M$” + 0.183 * “algebra” + … ‘.

Return type:

str

print_topics(num_topics=20, num_words=10)

Get the most significant topics (alias for show_topics() method).

Parameters:
  • num_topics (int, optional) – The number of topics to be selected, if -1 - all topics will be in result (ordered by significance).
  • num_words (int, optional) – The number of words to be included per topics (ordered by significance).
Returns:

Sequence with (topic_id, [(word, value), … ]).

Return type:

list of (int, list of (str, float))

save(fname, ignore=('state', 'dispatcher'), separately=None, *args, **kwargs)

Save the model to a file.

Large internal arrays may be stored into separate files, with fname as prefix.

Notes

If you intend to use models across Python 2/3 versions there are a few things to keep in mind:

  1. The pickled Python dictionaries will not work across Python versions
  2. The save method does not automatically save all numpy arrays separately, only those ones that exceed sep_limit set in save(). The main concern here is the alpha array if for instance using alpha=’auto’.

Please refer to the wiki recipes section for an example on how to work around these issues.

See also

load()
Load model.
Parameters:
  • fname (str) – Path to the system file where the model will be persisted.
  • ignore (tuple of str, optional) – The named attributes in the tuple will be left out of the pickled model. The reason why the internal state is ignored by default is that it uses its own serialisation rather than the one provided by this method.
  • separately ({list of str, None}, optional) – If None - automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently. If list of str - this attributes will be stored in separate files, the automatic check is not performed in this case.
  • *args – Positional arguments propagated to save().
  • **kwargs – Key word arguments propagated to save().
show_topic(topicid, topn=10)

Get the representation for a single topic. Words here are the actual strings, in constrast to get_topic_terms() that represents words by their vocabulary ID.

Parameters:
  • topicid (int) – The ID of the topic to be returned
  • topn (int, optional) – Number of the most significant words that are associated with the topic.
Returns:

Word - probability pairs for the most relevant words generated by the topic.

Return type:

list of (str, float)

show_topics(num_topics=10, num_words=10, log=False, formatted=True)

Get a representation for selected topics.

Parameters:
  • num_topics (int, optional) – Number of topics to be returned. Unlike LSA, there is no natural ordering between the topics in LDA. The returned topics subset of all topics is therefore arbitrary and may change between two LDA training runs.
  • num_words (int, optional) – Number of words to be presented for each topic. These will be the most relevant words (assigned the highest probability for each topic).
  • log (bool, optional) – Whether the output is also logged, besides being returned.
  • formatted (bool, optional) – Whether the topic representations should be formatted as strings. If False, they are returned as 2 tuples of (word, probability).
Returns:

a list of topics, each represented either as a string (when formatted == True) or word-probability pairs.

Return type:

list of {str, tuple of (str, float)}

sync_state()

Propagate the states topic probabilities to the inner object’s attribute.

top_topics(corpus=None, texts=None, dictionary=None, window_size=None, coherence='u_mass', topn=20, processes=-1)

Get the topics with the highest coherence score the coherence for each topic.

Parameters:
  • corpus (iterable of list of (int, float), optional) – Corpus in BoW format.
  • texts (list of list of str, optional) – Tokenized texts, needed for coherence models that use sliding window based (i.e. coherence=`c_something`) probability estimator .
  • dictionary (Dictionary, optional) – Gensim dictionary mapping of id word to create corpus. If model.id2word is present, this is not needed. If both are provided, passed dictionary will be used.
  • window_size (int, optional) – Is the size of the window to be used for coherence measures using boolean sliding window as their probability estimator. For ‘u_mass’ this doesn’t matter. If None - the default window sizes are used which are: ‘c_v’ - 110, ‘c_uci’ - 10, ‘c_npmi’ - 10.
  • coherence ({'u_mass', 'c_v', 'c_uci', 'c_npmi'}, optional) – Coherence measure to be used. Fastest method - ‘u_mass’, ‘c_uci’ also known as c_pmi. For ‘u_mass’ corpus should be provided, if texts is provided, it will be converted to corpus using the dictionary. For ‘c_v’, ‘c_uci’ and ‘c_npmi’ texts should be provided (corpus isn’t needed)
  • topn (int, optional) – Integer corresponding to the number of top words to be extracted from each topic.
  • processes (int, optional) – Number of processes to use for probability estimation phase, any value less than 1 will be interpreted as num_cpus - 1.
Returns:

Each element in the list is a pair of a topic representation and its coherence score. Topic representations are distributions of words, represented as a list of pairs of word IDs and their probabilities.

Return type:

list of (list of (int, str), float)

update(corpus=None, author2doc=None, doc2author=None, chunksize=None, decay=None, offset=None, passes=None, update_every=None, eval_every=None, iterations=None, gamma_threshold=None, chunks_as_numpy=False)

Train the model with new documents, by EM-iterating over corpus until the topics converge (or until the maximum number of allowed iterations is reached).

Notes

This update also supports updating an already trained model (self) with new documents from corpus: the two models are then merged in proportion to the number of old vs. new documents. This feature is still experimental for non-stationary input streams.

For stationary input (no topic drift in new documents), on the other hand, this equals the online update of Hoffman et al. Stochastic Variational Inference and is guaranteed to converge for any decay in (0.5, 1.0>. Additionally, for smaller corpus sizes, an increasing offset may be beneficial (see Table 1 in Hoffman et al.)

If update is called with authors that already exist in the model, it will resume training on not only new documents for that author, but also the previously seen documents. This is necessary for those authors’ topic distributions to converge.

Every time update(corpus, author2doc) is called, the new documents are to appended to all the previously seen documents, and author2doc is combined with the previously seen authors.

To resume training on all the data seen by the model, simply call update().

It is not possible to add new authors to existing documents, as all documents in corpus are assumed to be new documents.

Parameters:
  • corpus (iterable of list of (int, float)) – The corpus in BoW format.
  • author2doc (dict of (str, list of int), optional) – A dictionary where keys are the names of authors and values are lists of document IDs that the author contributes to.
  • doc2author (dict of (int, list of str), optional) – A dictionary where the keys are document IDs and the values are lists of author names.
  • chunksize (int, optional) – Controls the size of the mini-batches.
  • decay (float, optional) – Controls how old documents are forgotten.
  • offset (float, optional) – Controls down-weighting of iterations.
  • passes (int, optional) – Number of times the model makes a pass over the entire training data.
  • update_every (int, optional) – Make updates in topic probability for latest mini-batch.
  • eval_every (int, optional) – Calculate and estimate log perplexity for latest mini-batch.
  • iterations (int, optional) – Maximum number of times the model loops over each document
  • gamma_threshold (float, optional) – Threshold value of gamma(topic difference between consecutive two topics) until which the iterations continue.
  • chunks_as_numpy (bool, optional) – Whether each chunk passed to inference() should be a numpy array of not. Numpy can in some settings turn the term IDs into floats, these will be converted back into integers in inference, which incurs a performance hit. For distributed computing (not supported now) it may be desirable to keep the chunks as numpy arrays.
update_alpha(gammat, rho)

Update parameters for the Dirichlet prior on the per-document topic weights.

Parameters:
  • gammat (numpy.ndarray) – Previous topic weight parameters.
  • rho (float) – Learning rate.
Returns:

Sequence of alpha parameters.

Return type:

numpy.ndarray

update_eta(lambdat, rho)

Update parameters for the Dirichlet prior on the per-topic word weights.

Parameters:
  • lambdat (numpy.ndarray) – Previous lambda parameters.
  • rho (float) – Learning rate.
Returns:

The updated eta parameters.

Return type:

numpy.ndarray

class gensim.models.atmodel.AuthorTopicState(eta, lambda_shape, gamma_shape)

Bases: gensim.models.ldamodel.LdaState

Encapsulate information for computation of AuthorTopicModel.

Parameters:
  • eta (numpy.ndarray) – Dirichlet topic parameter for sparsity.
  • lambda_shape ((int, int)) – Initialize topic parameters.
  • gamma_shape (int) – Initialize topic parameters.
blend(rhot, other, targetsize=None)

Merge the current state with another one using a weighted average for the sufficient statistics.

The number of documents is stretched in both state objects, so that they are of comparable magnitude. This procedure corresponds to the stochastic gradient update from Hoffman et al. :”Online Learning for Latent Dirichlet Allocation”, see equations (5) and (9).

Parameters:
  • rhot (float) – Weight of the other state in the computed average. A value of 0.0 means that other is completely ignored. A value of 1.0 means self is completely ignored.
  • other (LdaState) – The state object with which the current one will be merged.
  • targetsize (int, optional) – The number of documents to stretch both states to.
blend2(rhot, other, targetsize=None)

Merge the current state with another one using a weighted sum for the sufficient statistics.

In contrast to blend(), the sufficient statistics are not scaled prior to aggregation.

Parameters:
  • rhot (float) – Unused.
  • other (LdaState) – The state object with which the current one will be merged.
  • targetsize (int, optional) – The number of documents to stretch both states to.
get_Elogbeta()

Get the log (posterior) probabilities for each topic.

Returns:Posterior probabilities for each topic.
Return type:numpy.ndarray
get_lambda()

Get the parameters of the posterior over the topics, also referred to as “the topics”.

Returns:Parameters of the posterior probability over topics.
Return type:numpy.ndarray
load(fname, *args, **kwargs)

Load a previously stored state from disk.

Overrides load by enforcing the dtype parameter to ensure backwards compatibility.

Parameters:
  • fname (str) – Path to file that contains the needed object.
  • args (object) – Positional parameters to be propagated to class:~gensim.utils.SaveLoad.load
  • kwargs (object) – Key-word parameters to be propagated to class:~gensim.utils.SaveLoad.load
Returns:

The state loaded from the given file.

Return type:

LdaState

merge(other)

Merge the result of an E step from one node with that of another node (summing up sufficient statistics).

The merging is trivial and after merging all cluster nodes, we have the exact same result as if the computation was run on a single node (no approximation).

Parameters:other (LdaState) – The state object with which the current one will be merged.
reset()

Prepare the state for a new EM iteration (reset sufficient stats).

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to a file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()
Load object from file.
gensim.models.atmodel.construct_author2doc(doc2author)

Make a mapping from author IDs to document IDs.

Parameters:doc2author (dict of (int, list of str)) – Mapping of document id to authors.
Returns:Mapping of authors to document ids.
Return type:dict of (str, list of int)
gensim.models.atmodel.construct_doc2author(corpus, author2doc)

Create a mapping from document IDs to author IDs.

Parameters:
  • corpus (iterable of list of (int, float)) – Corpus in BoW format.
  • author2doc (dict of (str, list of int)) – Mapping of authors to documents.
Returns:

Document to Author mapping.

Return type:

dict of (int, list of str)