models.phrases – Phrase (collocation) detection

Automatically detect common phrases – aka multi-word expressions, word n-gram collocations – from a stream of sentences.

Inspired by:

Examples

>>> from gensim.test.utils import datapath
>>> from gensim.models.word2vec import Text8Corpus
>>> from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS
>>>
>>> # Create training corpus. Must be a sequence of sentences (e.g. an iterable or a generator).
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> # Each sentence must be a list of string tokens:
>>> first_sentence = next(iter(sentences))
>>> print(first_sentence[:10])
['computer', 'human', 'interface', 'computer', 'response', 'survey', 'system', 'time', 'user', 'interface']
>>>
>>> # Train a toy phrase model on our training corpus.
>>> phrase_model = Phrases(sentences, min_count=1, threshold=1, connector_words=ENGLISH_CONNECTOR_WORDS)
>>>
>>> # Apply the trained phrases model to a new, unseen sentence.
>>> new_sentence = ['trees', 'graph', 'minors']
>>> phrase_model[new_sentence]
['trees_graph', 'minors']
>>> # The toy model considered "trees graph" a single phrase => joined the two
>>> # tokens into a single "phrase" token, using our selected `_` delimiter.
>>>
>>> # Apply the trained model to each sentence of a corpus, using the same [] syntax:
>>> for sent in phrase_model[sentences]:
...     pass
>>>
>>> # Update the model with two new sentences on the fly.
>>> phrase_model.add_vocab([["hello", "world"], ["meow"]])
>>>
>>> # Export the trained model = use less RAM, faster processing. Model updates no longer possible.
>>> frozen_model = phrase_model.freeze()
>>> # Apply the frozen model; same results as before:
>>> frozen_model[new_sentence]
['trees_graph', 'minors']
>>>
>>> # Save / load models.
>>> frozen_model.save("/tmp/my_phrase_model.pkl")
>>> model_reloaded = Phrases.load("/tmp/my_phrase_model.pkl")
>>> model_reloaded[['trees', 'graph', 'minors']]  # apply the reloaded model to a sentence
['trees_graph', 'minors']
class gensim.models.phrases.FrozenPhrases(phrases_model)

Bases: gensim.models.phrases._PhrasesTransformation

Minimal state & functionality exported from a trained Phrases model.

The goal of this class is to cut down memory consumption of Phrases, by discarding model state not strictly needed for the phrase detection task.

Use this instead of Phrases if you do not need to update the bigram statistics with new documents any more.

Parameters

phrases_model (Phrases) – Trained phrases instance, to extract all phrases from.

Notes

After the one-time initialization, a FrozenPhrases will be much smaller and faster than using the full Phrases model.

Examples

>>> from gensim.test.utils import datapath
>>> from gensim.models.word2vec import Text8Corpus
>>> from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS
>>>
>>> # Load corpus and train a model.
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> phrases = Phrases(sentences, min_count=1, threshold=1, connector_words=ENGLISH_CONNECTOR_WORDS)
>>>
>>> # Export a FrozenPhrases object that is more efficient but doesn't allow further training.
>>> frozen_phrases = phrases.freeze()
>>> print(frozen_phrases[sent])
[u'trees_graph', u'minors']
analyze_sentence(sentence)

Analyze a sentence, concatenating any detected phrases into a single token.

Parameters

sentence (iterable of str) – Token sequence representing the sentence to be analyzed.

Yields

(str, {float, None}) – Iterate through the input sentence tokens and yield 2-tuples of: - (concatenated_phrase_tokens, score) for token sequences that form a phrase. - (word, None) if the token is not a part of a phrase.

find_phrases(sentences)

Get all unique phrases (multi-word expressions) that appear in sentences, and their scores.

Parameters

sentences (iterable of list of str) – Text corpus.

Returns

Unique phrases found in sentences, mapped to their scores.

Return type

dict(str, float)

Example

>>> from gensim.test.utils import datapath
>>> from gensim.models.word2vec import Text8Corpus
>>> from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS
>>>
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> phrases = Phrases(sentences, min_count=1, threshold=0.1, connector_words=ENGLISH_CONNECTOR_WORDS)
>>>
>>> for phrase, score in phrases.find_phrases(sentences).items():
...     print(phrase, score)
classmethod load(*args, **kwargs)

Load a previously saved Phrases / FrozenPhrases model.

Handles backwards compatibility from older versions which did not support pluggable scoring functions.

Parameters
  • args (object) – See load.

  • kwargs (object) – See load.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.

score_candidate(word_a, word_b, in_between)

Score a single phrase candidate.

Returns

2-tuple of (delimiter-joined phrase, phrase score) for a phrase, or (None, None) if not a phrase.

Return type

(str, float)

gensim.models.phrases.Phraser

alias of gensim.models.phrases.FrozenPhrases

class gensim.models.phrases.Phrases(sentences=None, min_count=5, threshold=10.0, max_vocab_size=40000000, delimiter='_', progress_per=10000, scoring='default', connector_words=frozenset({}))

Bases: gensim.models.phrases._PhrasesTransformation

Detect phrases based on collocation counts.

Parameters
  • sentences (iterable of list of str, optional) – The sentences iterable can be simply a list, but for larger corpora, consider a generator that streams the sentences directly from disk/network, See BrownCorpus, Text8Corpus or LineSentence for such examples.

  • min_count (float, optional) – Ignore all words and bigrams with total collected count lower than this value.

  • threshold (float, optional) – Represent a score threshold for forming the phrases (higher means fewer phrases). A phrase of words a followed by b is accepted if the score of the phrase is greater than threshold. Heavily depends on concrete scoring-function, see the scoring parameter.

  • max_vocab_size (int, optional) – Maximum size (number of tokens) of the vocabulary. Used to control pruning of less common words, to keep memory under control. The default of 40M needs about 3.6GB of RAM. Increase/decrease max_vocab_size depending on how much available memory you have.

  • delimiter (str, optional) – Glue character used to join collocation tokens.

  • scoring ({'default', 'npmi', function}, optional) –

    Specify how potential phrases are scored. scoring can be set with either a string that refers to a built-in scoring function, or with a function with the expected parameter names. Two built-in scoring functions are available by setting scoring to a string:

    1. ”default” - original_scorer().

    2. ”npmi” - npmi_scorer().

  • connector_words (set of str, optional) –

    Set of words that may be included within a phrase, without affecting its scoring. No phrase can start nor end with a connector word; a phrase may contain any number of connector words in the middle.

    If your texts are in English, set connector_words=phrases.ENGLISH_CONNECTOR_WORDS.

    This will cause phrases to include common English articles, prepositions and conjuctions, such as bank_of_america or eye_of_the_beholder.

    For other languages or specific applications domains, use custom connector_words that make sense there: connector_words=frozenset("der die das".split()) etc.

Examples

>>> from gensim.test.utils import datapath
>>> from gensim.models.word2vec import Text8Corpus
>>> from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS
>>>
>>> # Load corpus and train a model.
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> phrases = Phrases(sentences, min_count=1, threshold=1, connector_words=ENGLISH_CONNECTOR_WORDS)
>>>
>>> # Use the model to detect phrases in a new sentence.
>>> sent = [u'trees', u'graph', u'minors']
>>> print(phrases[sent])
[u'trees_graph', u'minors']
>>>
>>> # Or transform multiple sentences at once.
>>> sents = [[u'trees', u'graph', u'minors'], [u'graph', u'minors']]
>>> for phrase in phrases[sents]:
...     print(phrase)
[u'trees_graph', u'minors']
[u'graph_minors']
>>>
>>> # Export a FrozenPhrases object that is more efficient but doesn't allow any more training.
>>> frozen_phrases = phrases.freeze()
>>> print(frozen_phrases[sent])
[u'trees_graph', u'minors']

Notes

The scoring="npmi" is more robust when dealing with common words that form part of common bigrams, and ranges from -1 to 1, but is slower to calculate than the default scoring="default". The default is the PMI-like scoring as described in Mikolov, et. al: “Distributed Representations of Words and Phrases and their Compositionality”.

To use your own custom scoring function, pass in a function with the following signature:

  • worda_count - number of corpus occurrences in sentences of the first token in the bigram being scored

  • wordb_count - number of corpus occurrences in sentences of the second token in the bigram being scored

  • bigram_count - number of occurrences in sentences of the whole bigram

  • len_vocab - the number of unique tokens in sentences

  • min_count - the min_count setting of the Phrases class

  • corpus_word_count - the total number of tokens (non-unique) in sentences

The scoring function must accept all these parameters, even if it doesn’t use them in its scoring.

The scoring function must be pickleable.

add_vocab(sentences)

Update model parameters with new sentences.

Parameters

sentences (iterable of list of str) – Text corpus to update this model’s parameters from.

Example

>>> from gensim.test.utils import datapath
>>> from gensim.models.word2vec import Text8Corpus
>>> from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS
>>>
>>> # Train a phrase detector from a text corpus.
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> phrases = Phrases(sentences, connector_words=ENGLISH_CONNECTOR_WORDS)  # train model
>>> assert len(phrases.vocab) == 37
>>>
>>> more_sentences = [
...     [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there'],
...     [u'machine', u'learning', u'can', u'be', u'new', u'york', u'sometimes'],
... ]
>>>
>>> phrases.add_vocab(more_sentences)  # add new sentences to model
>>> assert len(phrases.vocab) == 60
analyze_sentence(sentence)

Analyze a sentence, concatenating any detected phrases into a single token.

Parameters

sentence (iterable of str) – Token sequence representing the sentence to be analyzed.

Yields

(str, {float, None}) – Iterate through the input sentence tokens and yield 2-tuples of: - (concatenated_phrase_tokens, score) for token sequences that form a phrase. - (word, None) if the token is not a part of a phrase.

export_phrases()

Extract all found phrases.

Returns

Mapping between phrases and their scores.

Return type

dict(str, float)

find_phrases(sentences)

Get all unique phrases (multi-word expressions) that appear in sentences, and their scores.

Parameters

sentences (iterable of list of str) – Text corpus.

Returns

Unique phrases found in sentences, mapped to their scores.

Return type

dict(str, float)

Example

>>> from gensim.test.utils import datapath
>>> from gensim.models.word2vec import Text8Corpus
>>> from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS
>>>
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> phrases = Phrases(sentences, min_count=1, threshold=0.1, connector_words=ENGLISH_CONNECTOR_WORDS)
>>>
>>> for phrase, score in phrases.find_phrases(sentences).items():
...     print(phrase, score)
freeze()

Return an object that contains the bare minimum of information while still allowing phrase detection. See FrozenPhrases.

Use this “frozen model” to dramatically reduce RAM footprint if you don’t plan to make any further changes to your Phrases model.

Returns

Exported object that’s smaller, faster, but doesn’t support model updates.

Return type

FrozenPhrases

classmethod load(*args, **kwargs)

Load a previously saved Phrases / FrozenPhrases model.

Handles backwards compatibility from older versions which did not support pluggable scoring functions.

Parameters
  • args (object) – See load.

  • kwargs (object) – See load.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.

score_candidate(word_a, word_b, in_between)

Score a single phrase candidate.

Returns

2-tuple of (delimiter-joined phrase, phrase score) for a phrase, or (None, None) if not a phrase.

Return type

(str, float)

gensim.models.phrases.npmi_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count)

Calculation NPMI score based on “Normalized (Pointwise) Mutual Information in Colocation Extraction” by Gerlof Bouma.

Parameters
  • worda_count (int) – Number of occurrences for first word.

  • wordb_count (int) – Number of occurrences for second word.

  • bigram_count (int) – Number of co-occurrences for phrase “worda_wordb”.

  • len_vocab (int) – Not used.

  • min_count (int) – Ignore all bigrams with total collected count lower than this value.

  • corpus_word_count (int) – Total number of words in the corpus.

Returns

If bigram_count >= min_count, return the collocation score, in the range -1 to 1. Otherwise return -inf.

Return type

float

Notes

Formula: \frac{ln(prop(word_a, word_b) / (prop(word_a)*prop(word_b)))}{ -ln(prop(word_a, word_b)}, where prob(word) = \frac{word\_count}{corpus\_word\_count}

gensim.models.phrases.original_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count)

Bigram scoring function, based on the original Mikolov, et. al: “Distributed Representations of Words and Phrases and their Compositionality”.

Parameters
  • worda_count (int) – Number of occurrences for first word.

  • wordb_count (int) – Number of occurrences for second word.

  • bigram_count (int) – Number of co-occurrences for phrase “worda_wordb”.

  • len_vocab (int) – Size of vocabulary.

  • min_count (int) – Minimum collocation count threshold.

  • corpus_word_count (int) – Not used in this particular scoring technique.

Returns

Score for given phrase. Can be negative.

Return type

float

Notes

Formula: \frac{(bigram\_count - min\_count) * len\_vocab }{ (worda\_count * wordb\_count)}.