gensim logo

gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

Corporate trainings in Data Science, NLP and Deep Learning

sklearn_api.ldamodel – Scikit learn wrapper for Latent Dirichlet Allocation

sklearn_api.ldamodel – Scikit learn wrapper for Latent Dirichlet Allocation

Scikit learn interface for LdaModel.

Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.


>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.sklearn_api import LdaTransformer
>>> # Reduce each document to 2 dimensions (topics) using the sklearn interface.
>>> model = LdaTransformer(num_topics=2, id2word=common_dictionary, iterations=20, random_state=1)
>>> docvecs = model.fit_transform(common_corpus)
class gensim.sklearn_api.ldamodel.LdaTransformer(num_topics=100, id2word=None, chunksize=2000, passes=1, update_every=1, alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10, iterations=50, gamma_threshold=0.001, minimum_probability=0.01, random_state=None, scorer='perplexity', dtype=<class 'numpy.float32'>)

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Base LDA module, wraps LdaModel.

The inner workings of this class depends heavily on Matthew D. Hoffman, David M. Blei, Francis Bach: “Online Learning for Latent Dirichlet Allocation NIPS’10” and David M. Blei, Andrew Y. Ng, Michael I. Jordan: “Latent Dirichlet Allocation”.

  • num_topics (int, optional) – The number of requested latent topics to be extracted from the training corpus.

  • id2word (Dictionary, optional) – Mapping from integer ID to words in the corpus. Used to determine vocabulary size and logging.

  • chunksize (int, optional) – Number of documents in batch.

  • passes (int, optional) – Number of passes through the corpus during training.

  • update_every (int, optional) – Number of documents to be iterated through for each update. Set to 0 for batch learning, > 1 for online iterative learning.

  • alpha ({np.ndarray, str}, optional) –

    Can be set to an 1D array of length equal to the number of expected topics that expresses our a-priori belief for the each topics’ probability. Alternatively default prior selecting strategies can be employed by supplying a string:

    • ’asymmetric’: Uses a fixed normalized asymmetric prior of 1.0 / topicno.

    • ’auto’: Learns an asymmetric prior from the corpus.

  • eta ({float, np.array, str}, optional) –

    A-priori belief on word probability, this can be:

    • scalar for a symmetric prior over topic/word probability,

    • vector of length num_words to denote an asymmetric user defined probability for each word,

    • matrix of shape (num_topics, num_words) to assign a probability for each word-topic combination,

    • the string ‘auto’ to learn the asymmetric prior from the data.

  • decay (float, optional) –

    A number between (0.5, 1] to weight what percentage of the previous lambda value is forgotten when each new document is examined. Corresponds to Kappa from Matthew D. Hoffman, David M. Blei, Francis Bach: “Online Learning for Latent Dirichlet Allocation NIPS’10”.

  • offset (float, optional) –

    Hyper-parameter that controls how much we will slow down the first steps the first few iterations. Corresponds to Tau_0 from Matthew D. Hoffman, David M. Blei, Francis Bach: “Online Learning for Latent Dirichlet Allocation NIPS’10”.

  • eval_every (int, optional) – Log perplexity is estimated every that many updates. Setting this to one slows down training by ~2x.

  • iterations (int, optional) – Maximum number of iterations through the corpus when inferring the topic distribution of a corpus.

  • gamma_threshold (float, optional) – Minimum change in the value of the gamma parameters to continue iterating.

  • minimum_probability (float, optional) – Topics with a probability lower than this threshold will be filtered out.

  • random_state ({np.random.RandomState, int}, optional) – Either a randomState object or a seed to generate one. Useful for reproducibility.

  • scorer (str, optional) –

    Method to compute a score reflecting how well the model has fit the input corpus, allowed values are:
    • ’perplexity’: Perplexity of language model

    • ’mass_u’: Use CoherenceModel to compute a topics coherence.

  • dtype ({numpy.float16, numpy.float32, numpy.float64}, optional) – Data-type to use during calculations inside model. All inputs are also converted.


Configure passes and update_every params to choose the mode among:
  • online (single-pass): update_every != None and passes == 1

  • online (multi-pass): update_every != None and passes > 1

  • batch: update_every == None

By default, ‘online (single-pass)’ mode is used for training the LDA model.

fit(X, y=None)

Fit the model according to the given training data.


X ({iterable of iterable of (int, int), scipy.sparse matrix}) – A collection of documents in BOW format used for training the model.


The trained model.

Return type


fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

  • X (numpy array of shape [n_samples, n_features]) – Training set.

  • y (numpy array of shape [n_samples]) – Target values.


X_new – Transformed array.

Return type

numpy array of shape [n_samples, n_features_new]


Get parameters for this estimator.


deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.


params – Parameter names mapped to their values.

Return type

mapping of string to any


Train model over a potentially incomplete set of documents.

Uses the parameters set in the constructor. This method can be used in two ways: * On an unfitted model in which case the model is initialized and trained on X. * On an already fitted model in which case the model is updated by X.


X ({iterable of iterable of (int, int), scipy.sparse matrix}) – A collection of documents in BOW format used for training the model.


The trained model.

Return type


score(X, y=None)

Compute score reflecting how well the model has fitted for the input data.

The scoring method is set using the scorer argument in LdaTransformer(). Higher score is better.


X (iterable of list of (int, number)) – Sequence of documents in BOW format.


The score computed based on the selected method.

Return type



Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.


Return type



Infer the topic distribution for docs.


docs ({iterable of list of (int, number), list of (int, number)}) – Document or sequence of documents in BoW format.


The topic distribution for each input document.

Return type

numpy.ndarray of shape [len(docs), num_topics]