gensim logo

gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

Corporate trainings in Data Science, NLP and Deep Learning

sklearn_api.phrases – Scikit learn wrapper for phrase (collocation) detection

sklearn_api.phrases – Scikit learn wrapper for phrase (collocation) detection

Scikit learn interface for gensim.models.phrases.Phrases.

Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.


>>> from gensim.sklearn_api.phrases import PhrasesTransformer
>>> # Create the model. Make sure no term is ignored and combinations seen 3+ times are captured.
>>> m = PhrasesTransformer(min_count=1, threshold=3)
>>> texts = [
...     ['I', 'love', 'computer', 'science'],
...     ['computer', 'science', 'is', 'my', 'passion'],
...     ['I', 'studied', 'computer', 'science']
... ]
>>> # Use sklearn fit_transform to see the transformation.
>>> # Since computer and science were seen together 3+ times they are considered a phrase.
>>> assert ['I', 'love', 'computer_science'] == m.fit_transform(texts)[0]
class gensim.sklearn_api.phrases.PhrasesTransformer(min_count=5, threshold=10.0, max_vocab_size=40000000, delimiter=b'_', progress_per=10000, scoring='default', common_terms=frozenset({}))

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Base Phrases module, wraps Phrases.

For more information, please have a look to Mikolov, et. al: “Distributed Representations of Words and Phrases and their Compositionality” and Gerlof Bouma: “Normalized (Pointwise) Mutual Information in Collocation Extraction”.

  • min_count (int, optional) – Terms with a count lower than this will be ignored

  • threshold (float, optional) – Only phrases scoring above this will be accepted, see scoring below.

  • max_vocab_size (int, optional) – Maximum size of the vocabulary. Used to control pruning of less common words, to keep memory under control. The default of 40M needs about 3.6GB of RAM.

  • delimiter (str, optional) – Character used to join collocation tokens, should be a byte string (e.g. b’_’).

  • progress_per (int, optional) – Training will report to the logger every that many phrases are learned.

  • scoring (str or function, optional) –

    Specifies how potential phrases are scored for comparison to the threshold setting. scoring can be set with either a string that refers to a built-in scoring function, or with a function with the expected parameter names. Two built-in scoring functions are available by setting scoring to a string:

    ’npmi’ is more robust when dealing with common words that form part of common bigrams, and ranges from -1 to 1, but is slower to calculate than the default.

    To use a custom scoring function, create a function with the following parameters and set the scoring parameter to the custom function, see original_scorer() as example. You must define all the parameters (but can use only part of it):

    • worda_count: number of occurrences in sentences of the first token in the phrase being scored

    • wordb_count: number of occurrences in sentences of the second token in the phrase being scored

    • bigram_count: number of occurrences in sentences of the phrase being scored

    • len_vocab: the number of unique tokens in sentences

    • min_count: the min_count setting of the Phrases class

    • corpus_word_count: the total number of (non-unique) tokens in sentences

    A scoring function without any of these parameters (even if the parameters are not used) will raise a ValueError on initialization of the Phrases class. The scoring function must be pickleable.

  • common_terms (set of str, optional) – List of “stop words” that won’t affect frequency count of expressions containing them. Allow to detect expressions like “bank_of_america” or “eye_of_the_beholder”.

fit(X, y=None)

Fit the model according to the given training data.


X (iterable of list of str) – Sequence of sentences to be used for training the model.


The trained model.

Return type


fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

  • X (numpy array of shape [n_samples, n_features]) – Training set.

  • y (numpy array of shape [n_samples]) – Target values.


X_new – Transformed array.

Return type

numpy array of shape [n_samples, n_features_new]


Get parameters for this estimator.


deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.


params – Parameter names mapped to their values.

Return type

mapping of string to any


Train model over a potentially incomplete set of sentences.

This method can be used in two ways:
  1. On an unfitted model in which case the model is initialized and trained on X.

  2. On an already fitted model in which case the X sentences are added to the vocabulary.


X (iterable of list of str) – Sequence of sentences to be used for training the model.


The trained model.

Return type



Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.


Return type



Transform the input documents into phrase tokens.

Words in the sentence will be joined by self.delimiter.


docs ({iterable of list of str, list of str}) – Sequence of documents to be used transformed.


Phrase representation for each of the input sentences.

Return type

iterable of str