gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

Corporate trainings in Data Science, NLP and Deep Learning

models.phrases – Phrase (collocation) detection

models.phrases – Phrase (collocation) detection

Automatically detect common phrases – aka multi-word expressions, word n-gram collocations – from a stream of sentences.

Inspired by:

Examples

>>> from gensim.test.utils import datapath
>>> from gensim.models.word2vec import Text8Corpus
>>> from gensim.models.phrases import Phrases, Phraser
>>>
>>> # Load training data.
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> # The training corpus must be a sequence (stream, generator) of sentences,
>>> # with each sentence a list of tokens:
>>> print(list(sentences)[0][:10])
['computer', 'human', 'interface', 'computer', 'response', 'survey', 'system', 'time', 'user', 'interface']
>>>
>>> # Train a toy bigram model.
>>> phrases = Phrases(sentences, min_count=1, threshold=1)
>>> # Apply the trained phrases model to a new, unseen sentence.
>>> phrases[['trees', 'graph', 'minors']]
['trees_graph', 'minors']
>>> # The toy model considered "trees graph" a single phrase => joined the two
>>> # tokens into a single token, `trees_graph`.
>>>
>>> # Update the model with two new sentences on the fly.
>>> phrases.add_vocab([["hello", "world"], ["meow"]])
>>>
>>> # Export the trained model = use less RAM, faster processing. Model updates no longer possible.
>>> bigram = Phraser(phrases)
>>> bigram[['trees', 'graph', 'minors']]  # apply the exported model to a sentence
['trees_graph', 'minors']
>>>
>>> # Apply the exported model to each sentence of a corpus:
>>> for sent in bigram[sentences]:
...     pass
>>>
>>> # Save / load an exported collocation model.
>>> bigram.save("/tmp/my_bigram_model.pkl")
>>> bigram_reloaded = Phraser.load("/tmp/my_bigram_model.pkl")
>>> bigram_reloaded[['trees', 'graph', 'minors']]  # apply the exported model to a sentence
['trees_graph', 'minors']
class gensim.models.phrases.Phraser(phrases_model)

Bases: gensim.models.phrases.SentenceAnalyzer, gensim.models.phrases.PhrasesTransformation

Minimal state & functionality exported from Phrases.

The goal of this class is to cut down memory consumption of Phrases, by discarding model state not strictly needed for the bigram detection task.

Use this instead of Phrases if you do not need to update the bigram statistics with new documents any more.

Parameters

phrases_model (Phrases) – Trained phrases instance.

Notes

After the one-time initialization, a Phraser will be much smaller and somewhat faster than using the full Phrases model.

Examples

>>> from gensim.test.utils import datapath
>>> from gensim.models.word2vec import Text8Corpus
>>> from gensim.models.phrases import Phrases, Phraser
>>>
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> phrases = Phrases(sentences, min_count=1, threshold=1)
>>>
>>> bigram = Phraser(phrases)
>>> sent = [u'trees', u'graph', u'minors']
>>> print(bigram[sent])
[u'trees_graph', u'minors']
analyze_sentence(sentence, threshold, common_terms, scorer)

Analyze a sentence, detecting any bigrams that should be concatenated.

Parameters
  • sentence (iterable of str) – Token sequence representing the sentence to be analyzed.

  • threshold (float) – The minimum score for a bigram to be taken into account.

  • common_terms (list of object) – List of common terms, they receive special treatment.

  • scorer (function) – Scorer function, as given to Phrases. See npmi_scorer() and original_scorer().

Yields

(str, score) – If bi-gram detected, a tuple where the first element is a detect bigram, second its score. Otherwise, the first tuple element is a single word and second is None.

classmethod load(*args, **kwargs)

Load a previously saved Phrases / Phraser class. Handles backwards compatibility from older Phrases / Phraser versions which did not support pluggable scoring functions.

Parameters
  • args (object) – Sequence of arguments, see load for more information.

  • kwargs (object) – Sequence of arguments, see load for more information.

pseudocorpus(phrases_model)

Alias for gensim.models.phrases.pseudocorpus().

Parameters

phrases_model (Phrases) – Phrases instance.

Returns

Generator with phrases.

Return type

generator

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.

score_item(worda, wordb, components, scorer)

Score a bigram.

Parameters
  • worda (str) – First word for comparison.

  • wordb (str) – Second word for comparison.

  • components (generator) – Contain phrases.

  • scorer ({'default', 'npmi'}) – NOT USED.

Returns

Score for given bi-gram, if bi-gram not presented in dictionary - return -1.

Return type

float

class gensim.models.phrases.Phrases(sentences=None, min_count=5, threshold=10.0, max_vocab_size=40000000, delimiter=b'_', progress_per=10000, scoring='default', common_terms=frozenset({}))

Bases: gensim.models.phrases.SentenceAnalyzer, gensim.models.phrases.PhrasesTransformation

Detect phrases based on collocation counts.

Parameters
  • sentences (iterable of list of str, optional) – The sentences iterable can be simply a list, but for larger corpora, consider a generator that streams the sentences directly from disk/network, See BrownCorpus, Text8Corpus or LineSentence for such examples.

  • min_count (float, optional) – Ignore all words and bigrams with total collected count lower than this value.

  • threshold (float, optional) – Represent a score threshold for forming the phrases (higher means fewer phrases). A phrase of words a followed by b is accepted if the score of the phrase is greater than threshold. Heavily depends on concrete scoring-function, see the scoring parameter.

  • max_vocab_size (int, optional) – Maximum size (number of tokens) of the vocabulary. Used to control pruning of less common words, to keep memory under control. The default of 40M needs about 3.6GB of RAM. Increase/decrease max_vocab_size depending on how much available memory you have.

  • delimiter (str, optional) – Glue character used to join collocation tokens, should be a byte string (e.g. b’_’).

  • scoring ({'default', 'npmi', function}, optional) –

    Specify how potential phrases are scored. scoring can be set with either a string that refers to a built-in scoring function, or with a function with the expected parameter names. Two built-in scoring functions are available by setting scoring to a string:

    1. ”default” - original_scorer().

    2. ”npmi” - npmi_scorer().

  • common_terms (set of str, optional) – List of “stop words” that won’t affect frequency count of expressions containing them. Allow to detect expressions like “bank_of_america” or “eye_of_the_beholder”.

Notes

‘npmi’ is more robust when dealing with common words that form part of common bigrams, and ranges from -1 to 1, but is slower to calculate than the default. The default is the PMI-like scoring as described by Mikolov, et. al: “Distributed Representations of Words and Phrases and their Compositionality”.

To use a custom scoring function, pass in a function with the following signature:

  • worda_count - number of corpus occurrences in sentences of the first token in the bigram being scored

  • wordb_count - number of corpus occurrences in sentences of the second token in the bigram being scored

  • bigram_count - number of occurrences in sentences of the whole bigram

  • len_vocab - the number of unique tokens in sentences

  • min_count - the min_count setting of the Phrases class

  • corpus_word_count - the total number of tokens (non-unique) in sentences

The scoring function must accept all these parameters, even if it doesn’t use them in its scoring. The scoring function must be pickleable.

add_vocab(sentences)

Update model with new sentences.

Parameters

sentences (iterable of list of str) – Text corpus.

Example

>>> from gensim.test.utils import datapath
>>> from gensim.models.word2vec import Text8Corpus
>>> from gensim.models.phrases import Phrases
>>> # Create corpus and use it for phrase detector
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> phrases = Phrases(sentences)  # train model
>>> assert len(phrases.vocab) == 37
>>>
>>> more_sentences = [
...     [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there'],
...     [u'machine', u'learning', u'can', u'be', u'new', u'york', u'sometimes']
... ]
>>>
>>> phrases.add_vocab(more_sentences)  # add new sentences to model
>>> assert len(phrases.vocab) == 60
analyze_sentence(sentence, threshold, common_terms, scorer)

Analyze a sentence, detecting any bigrams that should be concatenated.

Parameters
  • sentence (iterable of str) – Token sequence representing the sentence to be analyzed.

  • threshold (float) – The minimum score for a bigram to be taken into account.

  • common_terms (list of object) – List of common terms, they receive special treatment.

  • scorer (function) – Scorer function, as given to Phrases. See npmi_scorer() and original_scorer().

Yields

(str, score) – If bi-gram detected, a tuple where the first element is a detect bigram, second its score. Otherwise, the first tuple element is a single word and second is None.

export_phrases(sentences, out_delimiter=b' ', as_tuples=False)

Get all phrases that appear in ‘sentences’ that pass the bigram threshold.

Parameters
  • sentences (iterable of list of str) – Text corpus.

  • out_delimiter (str, optional) – Delimiter used to “glue” together words that form a bigram phrase.

  • as_tuples (bool, optional) – Yield (tuple(words), score) instead of (out_delimiter.join(words), score)?

Yields

((str, str), float) **or* (str, float)* – Phrases detected in sentences. Return type depends on the as_tuples parameter.

Example

>>> from gensim.test.utils import datapath
>>> from gensim.models.word2vec import Text8Corpus
>>> from gensim.models.phrases import Phrases
>>>
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> phrases = Phrases(sentences, min_count=1, threshold=0.1)
>>>
>>> for phrase, score in phrases.export_phrases(sentences):
...     pass
static learn_vocab(sentences, max_vocab_size, delimiter=b'_', progress_per=10000, common_terms=frozenset({}))

Collect unigram/bigram counts from the sentences iterable.

Parameters
  • sentences (iterable of list of str) – The sentences iterable can be simply a list, but for larger corpora, consider a generator that streams the sentences directly from disk/network, See BrownCorpus, Text8Corpus or LineSentence for such examples.

  • max_vocab_size (int) – Maximum size (number of tokens) of the vocabulary. Used to control pruning of less common words, to keep memory under control. The default of 40M needs about 3.6GB of RAM. Increase/decrease max_vocab_size depending on how much available memory you have.

  • delimiter (str, optional) – Glue character used to join collocation tokens, should be a byte string (e.g. b’_’).

  • progress_per (int) – Write logs every progress_per sentence.

  • common_terms (set of str, optional) – List of “stop words” that won’t affect frequency count of expressions containing them. Allow to detect expressions like “bank_of_america” or “eye_of_the_beholder”.

Returns

Number of pruned words, counters for each word/bi-gram and total number of words.

Return type

(int, dict of (str, int), int)

Example

>>> from gensim.test.utils import datapath
>>> from gensim.models.word2vec import Text8Corpus
>>> from gensim.models.phrases import Phrases
>>>
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> pruned_words, counters, total_words = Phrases.learn_vocab(sentences, 100)
>>> (pruned_words, total_words)
(1, 29)
>>> counters['computer']
2
>>> counters['response_time']
1
classmethod load(*args, **kwargs)

Load a previously saved Phrases class. Handles backwards compatibility from older Phrases versions which did not support pluggable scoring functions.

Parameters
  • args (object) – Sequence of arguments, see load for more information.

  • kwargs (object) – Sequence of arguments, see load for more information.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.

score_item(worda, wordb, components, scorer)

Get bi-gram score statistics.

Parameters
  • worda (str) – First word of bi-gram.

  • wordb (str) – Second word of bi-gram.

  • components (generator) – Contain all phrases.

  • scorer (function) – Scorer function, as given to Phrases. See npmi_scorer() and original_scorer().

Returns

Score for given bi-gram. If bi-gram not present in dictionary - return -1.

Return type

float

class gensim.models.phrases.PhrasesTransformation

Bases: gensim.interfaces.TransformationABC

Base util class for Phrases and Phraser.

classmethod load(*args, **kwargs)

Load a previously saved Phrases / Phraser class. Handles backwards compatibility from older Phrases / Phraser versions which did not support pluggable scoring functions.

Parameters
  • args (object) – Sequence of arguments, see load for more information.

  • kwargs (object) – Sequence of arguments, see load for more information.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.

class gensim.models.phrases.SentenceAnalyzer

Bases: object

Base util class for Phrases and Phraser.

analyze_sentence(sentence, threshold, common_terms, scorer)

Analyze a sentence, detecting any bigrams that should be concatenated.

Parameters
  • sentence (iterable of str) – Token sequence representing the sentence to be analyzed.

  • threshold (float) – The minimum score for a bigram to be taken into account.

  • common_terms (list of object) – List of common terms, they receive special treatment.

  • scorer (function) – Scorer function, as given to Phrases. See npmi_scorer() and original_scorer().

Yields

(str, score) – If bi-gram detected, a tuple where the first element is a detect bigram, second its score. Otherwise, the first tuple element is a single word and second is None.

score_item(worda, wordb, components, scorer)

Get bi-gram score statistics.

Parameters
  • worda (str) – First word of bi-gram.

  • wordb (str) – Second word of bi-gram.

  • components (generator) – Contain all phrases.

  • scorer (function) – Scorer function, as given to Phrases. See npmi_scorer() and original_scorer().

Returns

Score for given bi-gram. If bi-gram not present in dictionary - return -1.

Return type

float

gensim.models.phrases.npmi_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count)

Calculation NPMI score based on “Normalized (Pointwise) Mutual Information in Colocation Extraction” by Gerlof Bouma.

Parameters
  • worda_count (int) – Number of occurrences for first word.

  • wordb_count (int) – Number of occurrences for second word.

  • bigram_count (int) – Number of co-occurrences for phrase “worda_wordb”.

  • len_vocab (int) – Not used.

  • min_count (int) – Ignore all bigrams with total collected count lower than this value.

  • corpus_word_count (int) – Total number of words in the corpus.

Returns

Score for given bi-gram, in the range -1 to 1.

Return type

float

Notes

Formula: \frac{ln(prop(word_a, word_b) / (prop(word_a)*prop(word_b)))}{ -ln(prop(word_a, word_b)}, where prob(word) = \frac{word\_count}{corpus\_word\_count}

gensim.models.phrases.original_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count)

Bigram scoring function, based on the original Mikolov, et. al: “Distributed Representations of Words and Phrases and their Compositionality”.

Parameters
  • worda_count (int) – Number of occurrences for first word.

  • wordb_count (int) – Number of occurrences for second word.

  • bigram_count (int) – Number of co-occurrences for phrase “worda_wordb”.

  • len_vocab (int) – Size of vocabulary.

  • min_count (int) – Minimum collocation count threshold.

  • corpus_word_count (int) – Not used in this particular scoring technique.

Returns

Score for given bi-gram, greater than or equal to 0.

Return type

float

Notes

Formula: \frac{(bigram\_count - min\_count) * len\_vocab }{ (worda\_count * wordb\_count)}.

gensim.models.phrases.pseudocorpus(source_vocab, sep, common_terms=frozenset({}))

Feeds source_vocab’s compound keys back to it, to discover phrases.

Parameters
  • source_vocab (iterable of list of str) – Tokens vocabulary.

  • sep (str) – Separator element.

  • common_terms (set, optional) – Immutable set of stopwords.

Yields

list of str – Phrase.