sklearn_api.lsimodel – Scikit learn wrapper for Latent Semantic Indexing

`sklearn_api.lsimodel` – Scikit learn wrapper for Latent Semantic Indexing¶

Scikit learn interface for gensim.models.lsimodel.LsiModel.

Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.

Examples

Integrate with sklearn Pipelines:

>>> from sklearn.pipeline import Pipeline
>>> from sklearn import linear_model
>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.sklearn_api import LsiTransformer
>>>
>>> # Create stages for our pipeline (including gensim and sklearn models alike).
>>> model = LsiTransformer(num_topics=15, id2word=common_dictionary)
>>> clf = linear_model.LogisticRegression(penalty='l2', C=0.1)
>>> pipe = Pipeline([('features', model,), ('classifier', clf)])
>>>
>>> # Create some random binary labels for our documents.
>>> labels = np.random.choice([0, 1], len(common_corpus))
>>>
>>> # How well does our pipeline perform on the training set?
>>> score = pipe.fit(common_corpus, labels).score(common_corpus, labels)

class gensim.sklearn_api.lsimodel.LsiTransformer(num_topics=200, id2word=None, chunksize=20000, decay=1.0, onepass=True, power_iters=2, extra_samples=100)¶

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Base LSI module, wraps LsiModel.

For more information please have a look to Latent semantic analysis.

Parameters

num_topics (int, optional) – Number of requested factors (latent dimensions).
id2word (Dictionary, optional) – ID to word mapping, optional.
chunksize (int, optional) – Number of documents to be used in each training chunk.
decay (float, optional) – Weight of existing observations relatively to new ones.
onepass (bool, optional) – Whether the one-pass algorithm should be used for training, pass False to force a multi-pass stochastic algorithm.
power_iters (int, optional) – Number of power iteration steps to be used. Increasing the number of power iterations improves accuracy, but lowers performance.
extra_samples (int, optional) – Extra samples to be used besides the rank k. Can improve accuracy.

fit(X, y=None)¶

Fit the model according to the given training data.

Parameters: X ({iterable of list of (int, number), scipy.sparse matrix}) – A collection of documents in BOW format to be transformed.
Returns: The trained model.
Return type: LsiTransformer

fit_transform(X, y=None, **fit_params)¶

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.

Returns

X_new – Transformed array.

Return type

numpy array of shape [n_samples, n_features_new]

get_params(deep=True)¶

Get parameters for this estimator.

Parameters: deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: mapping of string to any

partial_fit(X)¶

Train model over a potentially incomplete set of documents.

This method can be used in two ways:

On an unfitted model in which case the model is initialized and trained on X.
On an already fitted model in which case the model is further trained on X.

Parameters: X ({iterable of list of (int, number), scipy.sparse matrix}) – Stream of document vectors or sparse matrix of shape: [num_terms, num_documents].
Returns: The trained model.
Return type: LsiTransformer

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns
Return type: self

transform(docs)¶

Computes the latent factors for docs.

Parameters: docs ({iterable of list of (int, number), list of (int, number), scipy.sparse matrix}) – Document or collection of documents in BOW format to be transformed.
Returns: Topic distribution matrix.
Return type: numpy.ndarray of shape [len(docs), num_topics]

Get Expert Help From The Gensim Authors

sklearn_api.lsimodel – Scikit learn wrapper for Latent Semantic Indexing¶

`sklearn_api.lsimodel` – Scikit learn wrapper for Latent Semantic Indexing¶