sklearn_api.text2bow
– Scikit learn wrapper word<->id mapping¶Scikit learn interface for Dictionary
.
Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.
Examples
>>> from gensim.sklearn_api import Text2BowTransformer
>>>
>>> # Get a corpus as an iterable of unicode strings.
>>> texts = [u'complier system computer', u'loading computer system']
>>>
>>> # Create a transformer..
>>> model = Text2BowTransformer()
>>>
>>> # Use sklearn-style `fit_transform` to get the BOW representation of each document.
>>> model.fit_transform(texts)
[[(0, 1), (1, 1), (2, 1)], [(1, 1), (2, 1), (3, 1)]]
gensim.sklearn_api.text2bow.
Text2BowTransformer
(prune_at=2000000, tokenizer=<function tokenize>)¶Bases: sklearn.base.TransformerMixin
, sklearn.base.BaseEstimator
Base Text2Bow module , wraps Dictionary
.
For more information please have a look to Bag-of-words model.
prune_at (int, optional) – Total number of unique words. Dictionary will keep not more than prune_at words.
tokenizer (callable (str -> list of str), optional) – A callable to split a document into a list of each terms, default is gensim.utils.tokenize()
.
fit
(X, y=None)¶Fit the model according to the given training data.
X (iterable of str) – A collection of documents used for training the model.
The trained model.
fit_transform
(X, y=None, **fit_params)¶Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
X_new – Transformed array.
numpy array of shape [n_samples, n_features_new]
get_params
(deep=True)¶Get parameters for this estimator.
deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
params – Parameter names mapped to their values.
mapping of string to any
partial_fit
(X)¶Train model over a potentially incomplete set of documents.
On an unfitted model in which case the dictionary is initialized and trained on X.
On an already fitted model in which case the dictionary is expanded by X.
X (iterable of str) – A collection of documents used to train the model.
The trained model.
set_params
(**params)¶Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects
(such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each
component of a nested object.
self
transform
(docs)¶Get the BOW format for the docs.
docs ({iterable of str, str}) – A collection of documents to be transformed.
The BOW representation of each document.
iterable of list (int, int) 2-tuples.