sklearn_api.lsimodel
– Scikit learn wrapper for Latent Semantic Indexing¶Scikit learn interface for gensim.models.lsimodel.LsiModel
.
Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.
Examples
Integrate with sklearn Pipelines:
>>> from sklearn.pipeline import Pipeline
>>> from sklearn import linear_model
>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.sklearn_api import LsiTransformer
>>>
>>> # Create stages for our pipeline (including gensim and sklearn models alike).
>>> model = LsiTransformer(num_topics=15, id2word=common_dictionary)
>>> clf = linear_model.LogisticRegression(penalty='l2', C=0.1)
>>> pipe = Pipeline([('features', model,), ('classifier', clf)])
>>>
>>> # Create some random binary labels for our documents.
>>> labels = np.random.choice([0, 1], len(common_corpus))
>>>
>>> # How well does our pipeline perform on the training set?
>>> score = pipe.fit(common_corpus, labels).score(common_corpus, labels)
gensim.sklearn_api.lsimodel.
LsiTransformer
(num_topics=200, id2word=None, chunksize=20000, decay=1.0, onepass=True, power_iters=2, extra_samples=100)¶Bases: sklearn.base.TransformerMixin
, sklearn.base.BaseEstimator
Base LSI module, wraps LsiModel
.
For more information please have a look to Latent semantic analysis.
num_topics (int, optional) – Number of requested factors (latent dimensions).
id2word (Dictionary
, optional) – ID to word mapping, optional.
chunksize (int, optional) – Number of documents to be used in each training chunk.
decay (float, optional) – Weight of existing observations relatively to new ones.
onepass (bool, optional) – Whether the one-pass algorithm should be used for training, pass False to force a multi-pass stochastic algorithm.
power_iters (int, optional) – Number of power iteration steps to be used. Increasing the number of power iterations improves accuracy, but lowers performance.
extra_samples (int, optional) – Extra samples to be used besides the rank k. Can improve accuracy.
fit
(X, y=None)¶Fit the model according to the given training data.
X ({iterable of list of (int, number), scipy.sparse matrix}) – A collection of documents in BOW format to be transformed.
The trained model.
fit_transform
(X, y=None, **fit_params)¶Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
X_new – Transformed array.
numpy array of shape [n_samples, n_features_new]
get_params
(deep=True)¶Get parameters for this estimator.
deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
params – Parameter names mapped to their values.
mapping of string to any
partial_fit
(X)¶Train model over a potentially incomplete set of documents.
On an unfitted model in which case the model is initialized and trained on X.
On an already fitted model in which case the model is further trained on X.
X ({iterable of list of (int, number), scipy.sparse matrix}) – Stream of document vectors or sparse matrix of shape: [num_terms, num_documents].
The trained model.
set_params
(**params)¶Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects
(such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each
component of a nested object.
self
transform
(docs)¶Computes the latent factors for docs.
docs ({iterable of list of (int, number), list of (int, number), scipy.sparse matrix}) – Document or collection of documents in BOW format to be transformed.
Topic distribution matrix.
numpy.ndarray of shape [len(docs), num_topics]