gensim logo

gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine:

Corporate trainings in Python Data Science and Deep Learning

sklearn_api.phrases – Scikit learn wrapper for phrase (collocation) detection

sklearn_api.phrases – Scikit learn wrapper for phrase (collocation) detection

Scikit learn interface for gensim.models.phrases.Phrases.

Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.


>>> from gensim.sklearn_api.phrases import PhrasesTransformer
>>> # Create the model. Make sure no term is ignored and combinations seen 3+ times are captured.
>>> m = PhrasesTransformer(min_count=1, threshold=3)
>>> texts = [
...     ['I', 'love', 'computer', 'science'],
...     ['computer', 'science', 'is', 'my', 'passion'],
...     ['I', 'studied', 'computer', 'science']
... ]
>>> # Use sklearn fit_transform to see the transformation.
>>> # Since computer and science were seen together 3+ times they are considered a phrase.
>>> assert ['I', 'love', 'computer_science'] == m.fit_transform(texts)[0]
class gensim.sklearn_api.phrases.PhrasesTransformer(min_count=5, threshold=10.0, max_vocab_size=40000000, delimiter='_', progress_per=10000, scoring='default', common_terms=frozenset([]))

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Base Phrases module, wraps Phrases.

For more information, please have a look to Mikolov, et. al: “Distributed Representations of Words and Phrases and their Compositionality” and Gerlof Bouma: “Normalized (Pointwise) Mutual Information in Collocation Extraction”.

  • min_count (int, optional) – Terms with a count lower than this will be ignored
  • threshold (float, optional) – Only phrases scoring above this will be accepted, see scoring below.
  • max_vocab_size (int, optional) – Maximum size of the vocabulary. Used to control pruning of less common words, to keep memory under control. The default of 40M needs about 3.6GB of RAM.
  • delimiter (str, optional) – Character used to join collocation tokens, should be a byte string (e.g. b’_’).
  • progress_per (int, optional) – Training will report to the logger every that many phrases are learned.
  • scoring (str or function, optional) –

    Specifies how potential phrases are scored for comparison to the threshold setting. scoring can be set with either a string that refers to a built-in scoring function, or with a function with the expected parameter names. Two built-in scoring functions are available by setting scoring to a string:

    ’npmi’ is more robust when dealing with common words that form part of common bigrams, and ranges from -1 to 1, but is slower to calculate than the default.

    To use a custom scoring function, create a function with the following parameters and set the scoring parameter to the custom function, see original_scorer() as example. You must define all the parameters (but can use only part of it):

    • worda_count: number of occurrences in sentences of the first token in the phrase being scored
    • wordb_count: number of occurrences in sentences of the second token in the phrase being scored
    • bigram_count: number of occurrences in sentences of the phrase being scored
    • len_vocab: the number of unique tokens in sentences
    • min_count: the min_count setting of the Phrases class
    • corpus_word_count: the total number of (non-unique) tokens in sentences

    A scoring function without any of these parameters (even if the parameters are not used) will raise a ValueError on initialization of the Phrases class. The scoring function must be pickleable.

  • common_terms (set of str, optional) – List of “stop words” that won’t affect frequency count of expressions containing them. Allow to detect expressions like “bank_of_america” or “eye_of_the_beholder”.
fit(X, y=None)

Fit the model according to the given training data.

Parameters:X (iterable of list of str) – Sequence of sentences to be used for training the model.
Returns:The trained model.
Return type:PhrasesTransformer
fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

  • X (numpy array of shape [n_samples, n_features]) – Training set.
  • y (numpy array of shape [n_samples]) – Target values.

X_new – Transformed array.

Return type:

numpy array of shape [n_samples, n_features_new]


Get parameters for this estimator.

Parameters:deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any

Train model over a potentially incomplete set of sentences.

This method can be used in two ways:
  1. On an unfitted model in which case the model is initialized and trained on X.
  2. On an already fitted model in which case the X sentences are added to the vocabulary.
Parameters:X (iterable of list of str) – Sequence of sentences to be used for training the model.
Returns:The trained model.
Return type:PhrasesTransformer

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Return type:self

Transform the input documents into phrase tokens.

Words in the sentence will be joined by self.delimiter.

Parameters:docs ({iterable of list of str, list of str}) – Sequence of documents to be used transformed.
Returns:Phrase representation for each of the input sentences.
Return type:iterable of str