gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

sklearn_api.atmodel – Scikit learn wrapper for Author-topic model

sklearn_api.atmodel – Scikit learn wrapper for Author-topic model

Scikit learn interface for AuthorTopicModel.

Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.

Examples

>>> from gensim.test.utils import common_texts, common_dictionary, common_corpus
>>> from gensim.sklearn_api.atmodel import AuthorTopicTransformer
>>>
>>> # Pass a mapping from authors to the documents they contributed to.
>>> author2doc = {
...     'john': [0, 1, 2, 3, 4, 5, 6],
...     'jane': [2, 3, 4, 5, 6, 7, 8],
...     'jack': [0, 2, 4, 6, 8]
... }
>>>
>>> # Lets use the model to discover 2 different topics.
>>> model = AuthorTopicTransformer(id2word=common_dictionary, author2doc=author2doc, num_topics=2, passes=100)
>>>
>>> # In which of those 2 topics does jack mostly contribute to?
>>> topic_dist = model.fit(common_corpus).transform('jack')
class gensim.sklearn_api.atmodel.AuthorTopicTransformer(num_topics=100, id2word=None, author2doc=None, doc2author=None, chunksize=2000, passes=1, iterations=50, decay=0.5, offset=1.0, alpha='symmetric', eta='symmetric', update_every=1, eval_every=10, gamma_threshold=0.001, serialized=False, serialization_path=None, minimum_probability=0.01, random_state=None)

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Base Author Topic module, wraps AuthorTopicModel.

The model’s internal workings are heavily based on “The Author-Topic Model for Authors and Documents”, Osen-Zvi et. al 2004.

Parameters:
  • num_topics (int, optional) – Number of requested latent topics to be extracted from the training corpus.
  • id2word (Dictionary, optional) – Mapping from a words’ ID to the word itself. Used to determine the vocabulary size, as well as for debugging and topic printing.
  • author2doc (dict of (str, list of int), optional) – Maps an authors name to a list of document IDs where has has contributed. Either author2doc or doc2author must be supplied.
  • doc2author (dict of (int, list of str)) – Maps a document (using its ID) to a list of author names that contributed to it. Either author2doc or doc2author must be supplied.
  • chunksize (int, optional) – Number of documents to be processed by the model in each mini-batch.
  • passes (int, optional) – Number of times the model can make a pass over the corpus during training.
  • iterations (int, optional) – Maximum number of times the model before convergence during the M step of the EM algorithm.
  • decay (float, optional) –

    A number between (0.5, 1] to weight what percentage of the previous lambda value is forgotten when each new document is examined. Corresponds to Kappa from “The Author-Topic Model for Authors and Documents”, Osen-Zvi et. al 2004.

  • offset (float, optional) –

    Hyper-parameter that controls how much we will slow down the first steps the first few iterations. Corresponds to Tau_0 from “The Author-Topic Model for Authors and Documents”, Osen-Zvi et. al 2004.

  • alpha ({np.ndarray, str}, optional) –

    Can be set to an 1D array of length equal to the number of expected topics that expresses our a-priori belief for the each topics’ probability. Alternatively default prior selecting strategies can be employed by supplying a string:

    • ’asymmetric’: Uses a fixed normalized assymetric prior of 1.0 / topicno.
    • ’default’: Learns an assymetric prior from the corpus.
  • eta ({float, np.array, str}, optional) –

    A-priori belief on word probability, this can be:

    • scalar for a symmetric prior over topic/word probability,
    • vector of length num_words to denote an asymmetric user defined probability for each word,
    • matrix of shape (num_topics, num_words) to assign a probability for each word-topic combination,
    • the string ‘auto’ to learn the asymmetric prior from the data.
  • update_every (int, optional) – Number of mini-batches between each model update.
  • eval_every (int, optional) – Number of updates between two log perplexity estimates. Set to None to disable perplexity estimation.
  • gamma_threshold (float, optional) – Minimum change in the value of the gamma parameters to continue iterating.
  • serialized (bool, optional) – Indicates whether the input corpora to the model are simple in-memory lists (serialized = False) or saved to the hard-drive (serialized = True). Note that this behaviour is quite different from other Gensim models. If your data is too large to fit in to memory, use this functionality.
  • serialization_path (str, optional) – Path to file that used for storing the serialized object, must be supplied if `serialized = True`. An existing file cannot be overwritten, either delete the old file or choose a different name.
  • minimum_probability (float, optional) – Topics with a probability lower than this threshold will be filtered out.
  • random_state ({np.random.RandomState, int}, optional) – Either a randomState object or a seed to generate one. Useful for reproducibility.
fit(X, y=None)

Fit the model according to the given training data.

Parameters:X (iterable of list of (int, number)) – Sequence of documents in BoW format.
Returns:The trained model.
Return type:AuthorTopicTransformer
fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (numpy array of shape [n_samples, n_features]) – Training set.
  • y (numpy array of shape [n_samples]) – Target values.
Returns:

X_new – Transformed array.

Return type:

numpy array of shape [n_samples, n_features_new]

get_params(deep=True)

Get parameters for this estimator.

Parameters:deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
partial_fit(X, author2doc=None, doc2author=None)

Train model over a potentially incomplete set of documents.

This method can be used in two ways: * On an unfitted model in which case the model is initialized and trained on X. * On an already fitted model in which case the model is updated by X.

Parameters:
  • X (iterable of list of (int, number)) – Sequence of documents in BoW format.
  • author2doc (dict of (str, list of int), optional) – Maps an authors name to a list of document IDs where has has contributed. Either author2doc or doc2author must be supplied.
  • doc2author (dict of (int, list of str)) – Maps a document (using its ID) to a list of author names that contributed to it. Either author2doc or doc2author must be supplied.
Returns:

The trained model.

Return type:

AuthorTopicTransformer

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
Return type:self
transform(author_names)

Infer the topic probabilities for each author.

Parameters:author_names ({iterable of str, str}) – Author name or sequence of author names whose topics will be identified.
Returns:Topic distribution for each input author.
Return type:numpy.ndarray