gensim logo

gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine:

Corporate trainings in Python Data Science and Deep Learning

sklearn_api.lsimodel – Scikit learn wrapper for Latent Semantic Indexing

sklearn_api.lsimodel – Scikit learn wrapper for Latent Semantic Indexing

Scikit learn interface for gensim.models.lsimodel.LsiModel.

Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.


Integrate with sklearn Pipelines:

>>> from sklearn.pipeline import Pipeline
>>> from sklearn import linear_model
>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.sklearn_api import LsiTransformer
>>> # Create stages for our pipeline (including gensim and sklearn models alike).
>>> model = LsiTransformer(num_topics=15, id2word=common_dictionary)
>>> clf = linear_model.LogisticRegression(penalty='l2', C=0.1)
>>> pipe = Pipeline([('features', model,), ('classifier', clf)])
>>> # Create some random binary labels for our documents.
>>> labels = np.random.choice([0, 1], len(common_corpus))
>>> # How well does our pipeline perform on the training set?
>>> score =, labels).score(common_corpus, labels)
class gensim.sklearn_api.lsimodel.LsiTransformer(num_topics=200, id2word=None, chunksize=20000, decay=1.0, onepass=True, power_iters=2, extra_samples=100)

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Base LSI module, wraps LsiModel.

For more information please have a look to Latent semantic analysis.

  • num_topics (int, optional) – Number of requested factors (latent dimensions).
  • id2word (Dictionary, optional) – ID to word mapping, optional.
  • chunksize (int, optional) – Number of documents to be used in each training chunk.
  • decay (float, optional) – Weight of existing observations relatively to new ones.
  • onepass (bool, optional) – Whether the one-pass algorithm should be used for training, pass False to force a multi-pass stochastic algorithm.
  • power_iters (int, optional) – Number of power iteration steps to be used. Increasing the number of power iterations improves accuracy, but lowers performance.
  • extra_samples (int, optional) – Extra samples to be used besides the rank k. Can improve accuracy.
fit(X, y=None)

Fit the model according to the given training data.

Parameters:X ({iterable of list of (int, number), scipy.sparse matrix}) – A collection of documents in BOW format to be transformed.
Returns:The trained model.
Return type:LsiTransformer
fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

  • X (numpy array of shape [n_samples, n_features]) – Training set.
  • y (numpy array of shape [n_samples]) – Target values.

X_new – Transformed array.

Return type:

numpy array of shape [n_samples, n_features_new]


Get parameters for this estimator.

Parameters:deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any

Train model over a potentially incomplete set of documents.

This method can be used in two ways:
  1. On an unfitted model in which case the model is initialized and trained on X.
  2. On an already fitted model in which case the model is further trained on X.
Parameters:X ({iterable of list of (int, number), scipy.sparse matrix}) – Stream of document vectors or sparse matrix of shape: [num_terms, num_documents].
Returns:The trained model.
Return type:LsiTransformer

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Return type:self

Computes the latent factors for docs.

Parameters:docs ({iterable of list of (int, number), list of (int, number), scipy.sparse matrix}) – Document or collection of documents in BOW format to be transformed.
Returns:Topic distribution matrix.
Return type:numpy.ndarray of shape [len(docs), num_topics]