sklearn_api.phrases
– Scikit learn wrapper for phrase (collocation) detection¶Scikit learn interface for gensim.models.phrases.Phrases.
Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.
Examples
>>> from gensim.sklearn_api.phrases import PhrasesTransformer
>>>
>>> # Create the model. Make sure no term is ignored and combinations seen 3+ times are captured.
>>> m = PhrasesTransformer(min_count=1, threshold=3)
>>> texts = [
... ['I', 'love', 'computer', 'science'],
... ['computer', 'science', 'is', 'my', 'passion'],
... ['I', 'studied', 'computer', 'science']
... ]
>>>
>>> # Use sklearn fit_transform to see the transformation.
>>> # Since computer and science were seen together 3+ times they are considered a phrase.
>>> assert ['I', 'love', 'computer_science'] == m.fit_transform(texts)[0]
gensim.sklearn_api.phrases.
PhrasesTransformer
(min_count=5, threshold=10.0, max_vocab_size=40000000, delimiter=b'_', progress_per=10000, scoring='default', common_terms=frozenset({}))¶Bases: sklearn.base.TransformerMixin
, sklearn.base.BaseEstimator
Base Phrases module, wraps Phrases
.
For more information, please have a look to Mikolov, et. al: “Distributed Representations of Words and Phrases and their Compositionality” and Gerlof Bouma: “Normalized (Pointwise) Mutual Information in Collocation Extraction”.
min_count (int, optional) – Terms with a count lower than this will be ignored
threshold (float, optional) – Only phrases scoring above this will be accepted, see scoring below.
max_vocab_size (int, optional) – Maximum size of the vocabulary. Used to control pruning of less common words, to keep memory under control. The default of 40M needs about 3.6GB of RAM.
delimiter (str, optional) – Character used to join collocation tokens, should be a byte string (e.g. b’_’).
progress_per (int, optional) – Training will report to the logger every that many phrases are learned.
scoring (str or function, optional) –
Specifies how potential phrases are scored for comparison to the threshold setting. scoring can be set with either a string that refers to a built-in scoring function, or with a function with the expected parameter names. Two built-in scoring functions are available by setting scoring to a string:
’npmi’ is more robust when dealing with common words that form part of common bigrams, and ranges from -1 to 1, but is slower to calculate than the default.
To use a custom scoring function, create a function with the following parameters and set the scoring
parameter to the custom function, see original_scorer()
as example.
You must define all the parameters (but can use only part of it):
worda_count: number of occurrences in sentences of the first token in the phrase being scored
wordb_count: number of occurrences in sentences of the second token in the phrase being scored
bigram_count: number of occurrences in sentences of the phrase being scored
len_vocab: the number of unique tokens in sentences
min_count: the min_count setting of the Phrases class
corpus_word_count: the total number of (non-unique) tokens in sentences
A scoring function without any of these parameters (even if the parameters are not used) will raise a ValueError on initialization of the Phrases class. The scoring function must be pickleable.
common_terms (set of str, optional) – List of “stop words” that won’t affect frequency count of expressions containing them. Allow to detect expressions like “bank_of_america” or “eye_of_the_beholder”.
fit
(X, y=None)¶Fit the model according to the given training data.
X (iterable of list of str) – Sequence of sentences to be used for training the model.
The trained model.
fit_transform
(X, y=None, **fit_params)¶Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
X_new – Transformed array.
numpy array of shape [n_samples, n_features_new]
get_params
(deep=True)¶Get parameters for this estimator.
deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
params – Parameter names mapped to their values.
mapping of string to any
partial_fit
(X)¶Train model over a potentially incomplete set of sentences.
On an unfitted model in which case the model is initialized and trained on X.
On an already fitted model in which case the X sentences are added to the vocabulary.
X (iterable of list of str) – Sequence of sentences to be used for training the model.
The trained model.
set_params
(**params)¶Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects
(such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each
component of a nested object.
self
transform
(docs)¶Transform the input documents into phrase tokens.
Words in the sentence will be joined by self.delimiter.
docs ({iterable of list of str, list of str}) – Sequence of documents to be used transformed.
Phrase representation for each of the input sentences.
iterable of str