sklearn_api.text2bow – Scikit learn wrapper word<->id mapping

Scikit learn interface for Dictionary.

Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.


>>> from gensim.sklearn_api import Text2BowTransformer
>>> # Get a corpus as an iterable of unicode strings.
>>> texts = [u'complier system computer', u'loading computer system']
>>> # Create a transformer..
>>> model = Text2BowTransformer()
>>> # Use sklearn-style `fit_transform` to get the BOW representation of each document.
>>> model.fit_transform(texts)
[[(0, 1), (1, 1), (2, 1)], [(1, 1), (2, 1), (3, 1)]]
class gensim.sklearn_api.text2bow.Text2BowTransformer(prune_at=2000000, tokenizer=<function tokenize>)

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Base Text2Bow module , wraps Dictionary.

For more information please have a look to Bag-of-words model.

  • prune_at (int, optional) – Total number of unique words. Dictionary will keep not more than prune_at words.

  • tokenizer (callable (str -> list of str), optional) – A callable to split a document into a list of each terms, default is gensim.utils.tokenize().

fit(X, y=None)

Fit the model according to the given training data.


X (iterable of str) – A collection of documents used for training the model.


The trained model.

Return type


fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

  • X (numpy array of shape [n_samples, n_features]) – Training set.

  • y (numpy array of shape [n_samples]) – Target values.


X_new – Transformed array.

Return type

numpy array of shape [n_samples, n_features_new]


Get parameters for this estimator.


deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.


params – Parameter names mapped to their values.

Return type

mapping of string to any


Train model over a potentially incomplete set of documents.

This method can be used in two ways:
  1. On an unfitted model in which case the dictionary is initialized and trained on X.

  2. On an already fitted model in which case the dictionary is expanded by X.


X (iterable of str) – A collection of documents used to train the model.


The trained model.

Return type



Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.


Return type



Get the BOW format for the docs.


docs ({iterable of str, str}) – A collection of documents to be transformed.


The BOW representation of each document.

Return type

iterable of list (int, int) 2-tuples.