sklearn_api.phrases – Scikit learn wrapper for phrase (collocation) detection

`sklearn_api.phrases` – Scikit learn wrapper for phrase (collocation) detection¶

Scikit learn interface for gensim.models.phrases.Phrases.

Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.

Examples

>>> from gensim.sklearn_api.phrases import PhrasesTransformer
>>>
>>> # Create the model. Make sure no term is ignored and combinations seen 3+ times are captured.
>>> m = PhrasesTransformer(min_count=1, threshold=3)
>>> texts = [
...     ['I', 'love', 'computer', 'science'],
...     ['computer', 'science', 'is', 'my', 'passion'],
...     ['I', 'studied', 'computer', 'science']
... ]
>>>
>>> # Use sklearn fit_transform to see the transformation.
>>> # Since computer and science were seen together 3+ times they are considered a phrase.
>>> assert ['I', 'love', 'computer_science'] == m.fit_transform(texts)[0]

class gensim.sklearn_api.phrases.PhrasesTransformer(min_count=5, threshold=10.0, max_vocab_size=40000000, delimiter=b'_', progress_per=10000, scoring='default', common_terms=frozenset({}))¶

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Base Phrases module, wraps Phrases.

For more information, please have a look to Mikolov, et. al: “Distributed Representations of Words and Phrases and their Compositionality” and Gerlof Bouma: “Normalized (Pointwise) Mutual Information in Collocation Extraction”.

Parameters

min_count (int, optional) – Terms with a count lower than this will be ignored
threshold (float, optional) – Only phrases scoring above this will be accepted, see scoring below.
max_vocab_size (int, optional) – Maximum size of the vocabulary. Used to control pruning of less common words, to keep memory under control. The default of 40M needs about 3.6GB of RAM.
delimiter (str, optional) – Character used to join collocation tokens, should be a byte string (e.g. b’_’).
progress_per (int, optional) – Training will report to the logger every that many phrases are learned.
scoring (str or function, optional) –
Specifies how potential phrases are scored for comparison to the threshold setting. scoring can be set with either a string that refers to a built-in scoring function, or with a function with the expected parameter names. Two built-in scoring functions are available by setting scoring to a string:
- ’default’: Mikolov, et. al: “Distributed Representations of Words and Phrases and their Compositionality”.
- ’npmi’: Explained in Gerlof Bouma: “Normalized (Pointwise) Mutual Information in Collocation Extraction”.
’npmi’ is more robust when dealing with common words that form part of common bigrams, and ranges from -1 to 1, but is slower to calculate than the default.

To use a custom scoring function, create a function with the following parameters and set the scoring parameter to the custom function, see original_scorer() as example. You must define all the parameters (but can use only part of it):
- worda_count: number of occurrences in sentences of the first token in the phrase being scored
- wordb_count: number of occurrences in sentences of the second token in the phrase being scored
- bigram_count: number of occurrences in sentences of the phrase being scored
- len_vocab: the number of unique tokens in sentences
- min_count: the min_count setting of the Phrases class
- corpus_word_count: the total number of (non-unique) tokens in sentences
A scoring function without any of these parameters (even if the parameters are not used) will raise a ValueError on initialization of the Phrases class. The scoring function must be pickleable.
common_terms (set of str, optional) – List of “stop words” that won’t affect frequency count of expressions containing them. Allow to detect expressions like “bank_of_america” or “eye_of_the_beholder”.

fit(X, y=None)¶

Fit the model according to the given training data.

Parameters: X (iterable of list of str) – Sequence of sentences to be used for training the model.
Returns: The trained model.
Return type: PhrasesTransformer

fit_transform(X, y=None, **fit_params)¶

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.

Returns

X_new – Transformed array.

Return type

numpy array of shape [n_samples, n_features_new]

get_params(deep=True)¶

Get parameters for this estimator.

Parameters: deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: mapping of string to any

partial_fit(X)¶

Train model over a potentially incomplete set of sentences.

This method can be used in two ways:

On an unfitted model in which case the model is initialized and trained on X.
On an already fitted model in which case the X sentences are added to the vocabulary.

Parameters: X (iterable of list of str) – Sequence of sentences to be used for training the model.
Returns: The trained model.
Return type: PhrasesTransformer

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns
Return type: self

transform(docs)¶

Transform the input documents into phrase tokens.

Words in the sentence will be joined by self.delimiter.

Parameters: docs ({iterable of list of str, list of str}) – Sequence of documents to be used transformed.
Returns: Phrase representation for each of the input sentences.
Return type: iterable of str

Get Expert Help From The Gensim Authors

sklearn_api.phrases – Scikit learn wrapper for phrase (collocation) detection¶

`sklearn_api.phrases` – Scikit learn wrapper for phrase (collocation) detection¶