gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

Corporate trainings in Data Science, NLP and Deep Learning

sklearn_api.lsimodel – Scikit learn wrapper for Latent Semantic Indexing

sklearn_api.lsimodel – Scikit learn wrapper for Latent Semantic Indexing

Scikit learn interface for gensim.models.lsimodel.LsiModel.

Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.

Examples

Integrate with sklearn Pipelines:

>>> from sklearn.pipeline import Pipeline
>>> from sklearn import linear_model
>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.sklearn_api import LsiTransformer
>>>
>>> # Create stages for our pipeline (including gensim and sklearn models alike).
>>> model = LsiTransformer(num_topics=15, id2word=common_dictionary)
>>> clf = linear_model.LogisticRegression(penalty='l2', C=0.1)
>>> pipe = Pipeline([('features', model,), ('classifier', clf)])
>>>
>>> # Create some random binary labels for our documents.
>>> labels = np.random.choice([0, 1], len(common_corpus))
>>>
>>> # How well does our pipeline perform on the training set?
>>> score = pipe.fit(common_corpus, labels).score(common_corpus, labels)
class gensim.sklearn_api.lsimodel.LsiTransformer(num_topics=200, id2word=None, chunksize=20000, decay=1.0, onepass=True, power_iters=2, extra_samples=100)

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Base LSI module, wraps LsiModel.

For more information please have a look to Latent semantic analysis.

Parameters
  • num_topics (int, optional) – Number of requested factors (latent dimensions).

  • id2word (Dictionary, optional) – ID to word mapping, optional.

  • chunksize (int, optional) – Number of documents to be used in each training chunk.

  • decay (float, optional) – Weight of existing observations relatively to new ones.

  • onepass (bool, optional) – Whether the one-pass algorithm should be used for training, pass False to force a multi-pass stochastic algorithm.

  • power_iters (int, optional) – Number of power iteration steps to be used. Increasing the number of power iterations improves accuracy, but lowers performance.

  • extra_samples (int, optional) – Extra samples to be used besides the rank k. Can improve accuracy.

fit(X, y=None)

Fit the model according to the given training data.

Parameters

X ({iterable of list of (int, number), scipy.sparse matrix}) – A collection of documents in BOW format to be transformed.

Returns

The trained model.

Return type

LsiTransformer

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
  • X (numpy array of shape [n_samples, n_features]) – Training set.

  • y (numpy array of shape [n_samples]) – Target values.

Returns

X_new – Transformed array.

Return type

numpy array of shape [n_samples, n_features_new]

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

mapping of string to any

partial_fit(X)

Train model over a potentially incomplete set of documents.

This method can be used in two ways:
  1. On an unfitted model in which case the model is initialized and trained on X.

  2. On an already fitted model in which case the model is further trained on X.

Parameters

X ({iterable of list of (int, number), scipy.sparse matrix}) – Stream of document vectors or sparse matrix of shape: [num_terms, num_documents].

Returns

The trained model.

Return type

LsiTransformer

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns

Return type

self

transform(docs)

Computes the latent factors for docs.

Parameters

docs ({iterable of list of (int, number), list of (int, number), scipy.sparse matrix}) – Document or collection of documents in BOW format to be transformed.

Returns

Topic distribution matrix.

Return type

numpy.ndarray of shape [len(docs), num_topics]