gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

sklearn_api.ldamodel – Scikit learn wrapper for Latent Dirichlet Allocation

sklearn_api.ldamodel – Scikit learn wrapper for Latent Dirichlet Allocation

Scikit learn interface for LdaModel.

Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.

Examples

>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.sklearn_api import LdaTransformer
>>>
>>> # Reduce each document to 2 dimensions (topics) using the sklearn interface.
>>> model = LdaTransformer(num_topics=2, id2word=common_dictionary, iterations=20, random_state=1)
>>> docvecs = model.fit_transform(common_corpus)
class gensim.sklearn_api.ldamodel.LdaTransformer(num_topics=100, id2word=None, chunksize=2000, passes=1, update_every=1, alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10, iterations=50, gamma_threshold=0.001, minimum_probability=0.01, random_state=None, scorer='perplexity', dtype=<type 'numpy.float32'>)

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Base LDA module, wraps LdaModel.

The inner workings of this class depends heavily on Matthew D. Hoffman, David M. Blei, Francis Bach: “Online Learning for Latent Dirichlet Allocation NIPS‘10” and David M. Blei, Andrew Y. Ng, Michael I. Jordan: “Latent Dirichlet Allocation”.

Parameters:
  • num_topics (int, optional) – The number of requested latent topics to be extracted from the training corpus.
  • id2word (Dictionary, optional) – Mapping from integer ID to words in the corpus. Used to determine vocabulary size and logging.
  • chunksize (int, optional) – Number of documents in batch.
  • passes (int, optional) – Number of passes through the corpus during training.
  • update_every (int, optional) – Number of documents to be iterated through for each update. Set to 0 for batch learning, > 1 for online iterative learning.
  • alpha ({np.ndarray, str}, optional) –

    Can be set to an 1D array of length equal to the number of expected topics that expresses our a-priori belief for the each topics’ probability. Alternatively default prior selecting strategies can be employed by supplying a string:

    • ’asymmetric’: Uses a fixed normalized assymetric prior of 1.0 / topicno.
    • ’default’: Learns an assymetric prior from the corpus.
  • eta ({float, np.array, str}, optional) –

    A-priori belief on word probability, this can be:

    • scalar for a symmetric prior over topic/word probability,
    • vector of length num_words to denote an asymmetric user defined probability for each word,
    • matrix of shape (num_topics, num_words) to assign a probability for each word-topic combination,
    • the string ‘auto’ to learn the asymmetric prior from the data.
  • decay (float, optional) –

    A number between (0.5, 1] to weight what percentage of the previous lambda value is forgotten when each new document is examined. Corresponds to Kappa from Matthew D. Hoffman, David M. Blei, Francis Bach: “Online Learning for Latent Dirichlet Allocation NIPS‘10”.

  • offset (float, optional) –

    Hyper-parameter that controls how much we will slow down the first steps the first few iterations. Corresponds to Tau_0 from Matthew D. Hoffman, David M. Blei, Francis Bach: “Online Learning for Latent Dirichlet Allocation NIPS‘10”.

  • eval_every (int, optional) – Log perplexity is estimated every that many updates. Setting this to one slows down training by ~2x.
  • iterations (int, optional) – Maximum number of iterations through the corpus when inferring the topic distribution of a corpus.
  • gamma_threshold (float, optional) – Minimum change in the value of the gamma parameters to continue iterating.
  • minimum_probability (float, optional) – Topics with a probability lower than this threshold will be filtered out.
  • random_state ({np.random.RandomState, int}, optional) – Either a randomState object or a seed to generate one. Useful for reproducibility.
  • scorer (str, optional) –
    Method to compute a score reflecting how well the model has fit the input corpus, allowed values are:
    • ’perplexity’: Perplexity of language model
    • ’mass_u’: Use CoherenceModel to compute a topics coherence.
  • dtype ({numpy.float16, numpy.float32, numpy.float64}, optional) – Data-type to use during calculations inside model. All inputs are also converted.

Notes

Configure passes and update_every params to choose the mode among:
  • online (single-pass): update_every != None and passes == 1
  • online (multi-pass): update_every != None and passes > 1
  • batch: update_every == None

By default, ‘online (single-pass)’ mode is used for training the LDA model.

fit(X, y=None)

Fit the model according to the given training data.

Parameters:X ({iterable of iterable of (int, int), scipy.sparse matrix}) – A collection of documents in BOW format used for training the model.
Returns:The trained model.
Return type:LdaTransformer
fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (numpy array of shape [n_samples, n_features]) – Training set.
  • y (numpy array of shape [n_samples]) – Target values.
Returns:

X_new – Transformed array.

Return type:

numpy array of shape [n_samples, n_features_new]

get_params(deep=True)

Get parameters for this estimator.

Parameters:deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
partial_fit(X)

Train model over a potentially incomplete set of documents.

Uses the parameters set in the constructor. This method can be used in two ways: * On an unfitted model in which case the model is initialized and trained on X. * On an already fitted model in which case the model is updated by X.

Parameters:X ({iterable of iterable of (int, int), scipy.sparse matrix}) – A collection of documents in BOW format used for training the model.
Returns:The trained model.
Return type:LdaTransformer
score(X, y=None)

Compute score reflecting how well the model has fitted for the input data.

The scoring method is set using the scorer argument in LdaTransformer(). Higher score is better.

Parameters:X (iterable of list of (int, number)) – Sequence of documents in BOW format.
Returns:The score computed based on the selected method.
Return type:float
set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
Return type:self
transform(docs)

Infer the topic distribution for docs.

Parameters:docs ({iterable of list of (int, number), list of (int, number)}) – Document or sequence of documents in BoW format.
Returns:The topic distribution for each input document.
Return type:numpy.ndarray of shape [len(docs), num_topics]