gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

sklearn_api.ldaseqmodel – Scikit learn wrapper for LdaSeq model

sklearn_api.ldaseqmodel – Scikit learn wrapper for LdaSeq model

Scikit learn interface for LdaSeqModel.

Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.

Examples

>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.sklearn_api.ldaseqmodel import LdaSeqTransformer
>>>
>>> # Create a sequential LDA transformer to extract 2 topics from the common corpus.
>>> # Divide the work into 3 unequal time slices.
>>> model = LdaSeqTransformer(id2word=common_dictionary, num_topics=2, time_slice=[3, 4, 2], initialize='gensim')
>>>
>>> # Each document almost entirely belongs to one of the two topics.
>>> transformed_corpus = model.fit_transform(common_corpus)
class gensim.sklearn_api.ldaseqmodel.LdaSeqTransformer(time_slice=None, id2word=None, alphas=0.01, num_topics=10, initialize='gensim', sstats=None, lda_model=None, obs_variance=0.5, chain_variance=0.005, passes=10, random_state=None, lda_inference_max_iter=25, em_min_iter=6, em_max_iter=20, chunksize=100)

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Base Sequential LDA module, wraps LdaSeqModel model.

For more information take a look at David M. Blei, John D. Lafferty: “Dynamic Topic Models”.

Parameters:
  • time_slice (list of int, optional) – Number of documents in each time-slice.
  • id2word (Dictionary, optional) – Mapping from an ID to the word it represents in the vocabulary.
  • alphas (float, optional) – The prior probability of each topic.
  • num_topics (int, optional) – Number of latent topics to be discovered in the corpus.
  • initialize ({'gensim', 'own', 'ldamodel'}, optional) –
    Controls the initialization of the DTM model. Supports three different modes:
    • ’gensim’: Uses gensim’s own LDA initialization.
    • ’own’: Uses your own initialization matrix of an LDA model that has been previously trained.
    • ’lda_model’: Use a previously used LDA model, passing it through the lda_model argument.
  • sstats (np.ndarray of shape [vocab_len, num_topics], optional) – If initialize is set to ‘own’ this will be used to initialize the DTM model.
  • lda_model (LdaModel, optional) – If initialize is set to ‘lda_model’ this object will be used to create the sstats initialization matrix.
  • obs_variance (float, optional) –

    Observed variance used to approximate the true and forward variance as shown in David M. Blei, John D. Lafferty: “Dynamic Topic Models”.

  • chain_variance (float, optional) – Gaussian parameter defined in the beta distribution to dictate how the beta values evolve.
  • passes (int, optional) – Number of passes over the corpus for the initial LdaModel
  • random_state ({numpy.random.RandomState, int}, optional) – Can be a np.random.RandomState object, or the seed to generate one. Used for reproducibility of results.
  • lda_inference_max_iter (int, optional) – Maximum number of iterations in the inference step of the LDA training.
  • em_min_iter (int, optional) – Minimum number of iterations until converge of the Expectation-Maximization algorithm
  • em_max_iter (int, optional) – Maximum number of iterations until converge of the Expectation-Maximization algorithm
  • chunksize (int, optional) – Number of documents in the corpus do be processed in in a chunk.
fit(X, y=None)

Fit the model according to the given training data.

Parameters:X ({iterable of list of (int, number), scipy.sparse matrix}) – A collection of documents in BOW format used for training the model.
Returns:The trained model.
Return type:LdaSeqTransformer
fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (numpy array of shape [n_samples, n_features]) – Training set.
  • y (numpy array of shape [n_samples]) – Target values.
Returns:

X_new – Transformed array.

Return type:

numpy array of shape [n_samples, n_features_new]

get_params(deep=True)

Get parameters for this estimator.

Parameters:deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
Return type:self
transform(docs)

Infer the topic distribution for docs.

Parameters:docs ({iterable of list of (int, number), scipy.sparse matrix}) – A collection of documents in BOW format to be transformed.
Returns:The topic representation of each document.
Return type:numpy.ndarray of shape [len(docs), num_topics]