gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

models.phrases – Phrase (collocation) detection

models.phrases – Phrase (collocation) detection

Automatically detect common phrases – multi-word expressions / word n-grams – from a stream of sentences.

Inspired by:

Examples

>>> from gensim.test.utils import datapath
>>> from gensim.models.word2vec import Text8Corpus
>>> from gensim.models.phrases import Phrases, Phraser
>>>
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> phrases = Phrases(sentences, min_count=1, threshold=1)  # train model
>>> phrases[[u'trees', u'graph', u'minors']]  # apply model to sentence
[u'trees_graph', u'minors']
>>>
>>> phrases.add_vocab([["hello", "world"], ["meow"]])  # update model with new sentences
>>>
>>> bigram = Phraser(phrases)  # construct faster model (this is only an wrapper)
>>> bigram[[u'trees', u'graph', u'minors']]  # apply model to sentence
[u'trees_graph', u'minors']
>>>
>>> for sent in bigram[sentences]:  # apply model to text corpus
...     pass
class gensim.models.phrases.Phraser(phrases_model)

Bases: gensim.models.phrases.SentenceAnalyzer, gensim.models.phrases.PhrasesTransformation

Minimal state & functionality exported from Phrases.

The goal of this class is to cut down memory consumption of Phrases, by discarding model state not strictly needed for the bigram detection task.

Use this instead of Phrases if you do not need to update the bigram statistics with new documents any more.

Parameters:phrases_model (Phrases) – Trained phrases instance.

Notes

After the one-time initialization, a Phraser will be much smaller and somewhat faster than using the full Phrases model.

Examples

>>> from gensim.test.utils import datapath
>>> from gensim.models.word2vec import Text8Corpus
>>> from gensim.models.phrases import Phrases, Phraser
>>>
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> phrases = Phrases(sentences, min_count=1, threshold=1)
>>>
>>> bigram = Phraser(phrases)
>>> sent = [u'trees', u'graph', u'minors']
>>> print(bigram[sent])
[u'trees_graph', u'minors']
analyze_sentence(sentence, threshold, common_terms, scorer)

Analyze a sentence, detecting any bigrams that should be concatenated.

Parameters:
  • sentence (iterable of str) – Token sequence representing the sentence to be analyzed.
  • threshold (float) – The minimum score for a bigram to be taken into account.
  • common_terms (list of object) – List of common terms, they have special treatment.
  • scorer (function) – Scorer function, as given to Phrases. See npmi_scorer() and original_scorer().
Yields:

(str, score) – If bi-gram detected, a tuple where the first element is a detect bigram, second its score. Otherwise, the first tuple element is a single word and second is None.

load(*args, **kwargs)

Load a previously saved Phrases / Phraser class. Handles backwards compatibility from older Phrases / Phraser versions which did not support pluggable scoring functions.

Parameters:
  • args (object) – Sequence of arguments, see load for more information.
  • kwargs (object) – Sequence of arguments, see load for more information.
pseudocorpus(phrases_model)

Alias for gensim.models.phrases.pseudocorpus().

Parameters:phrases_model (Phrases) – Phrases instance.
Returns:Generator with phrases.
Return type:generator
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to a file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()
Load object from file.
score_item(worda, wordb, components, scorer)

Score a bigram.

Parameters:
  • worda (str) – First word for comparison.
  • wordb (str) – Second word for comparison.
  • components (generator) – Contain phrases.
  • scorer ({'default', 'npmi'}) – NOT USED.
Returns:

Score for given bi-gram, if bi-gram not presented in dictionary - return -1.

Return type:

float

class gensim.models.phrases.Phrases(sentences=None, min_count=5, threshold=10.0, max_vocab_size=40000000, delimiter='_', progress_per=10000, scoring='default', common_terms=frozenset([]))

Bases: gensim.models.phrases.SentenceAnalyzer, gensim.models.phrases.PhrasesTransformation

Detect phrases based on collocation counts.

Parameters:
  • sentences (iterable of list of str, optional) – The sentences iterable can be simply a list, but for larger corpora, consider a generator that streams the sentences directly from disk/network, See BrownCorpus, Text8Corpus or LineSentence for such examples.
  • min_count (float, optional) – Ignore all words and bigrams with total collected count lower than this value.
  • threshold (float, optional) – Represent a score threshold for forming the phrases (higher means fewer phrases). A phrase of words a followed by b is accepted if the score of the phrase is greater than threshold. Hardly depends on concrete socring-function, see the scoring parameter.
  • max_vocab_size (int, optional) – Maximum size (number of tokens) of the vocabulary. Used to control pruning of less common words, to keep memory under control. The default of 40M needs about 3.6GB of RAM. Increase/decrease max_vocab_size depending on how much available memory you have.
  • delimiter (str, optional) – Glue character used to join collocation tokens, should be a byte string (e.g. b’_’).
  • scoring ({'default', 'npmi', function}, optional) –

    Specify how potential phrases are scored. scoring can be set with either a string that refers to a built-in scoring function, or with a function with the expected parameter names. Two built-in scoring functions are available by setting scoring to a string:

    1. ”default” - original_scorer().
    2. ”npmi” - npmi_scorer().
  • common_terms (set of str, optional) – List of “stop words” that won’t affect frequency count of expressions containing them. Allow to detect expressions like “bank_of_america” or “eye_of_the_beholder”.

Notes

‘npmi’ is more robust when dealing with common words that form part of common bigrams, and ranges from -1 to 1, but is slower to calculate than the default. The default is the PMI-like scoring as described by Mikolov, et. al: “Distributed Representations of Words and Phrases and their Compositionality”.

To use a custom scoring function, pass in a function with the following signature:

  • worda_count - number of corpus occurrences in sentences of the first token in the bigram being scored
  • wordb_count - number of corpus occurrences in sentences of the second token in the bigram being scored
  • bigram_count - number of occurrences in sentences of the whole bigram
  • len_vocab - the number of unique tokens in sentences
  • min_count - the min_count setting of the Phrases class
  • corpus_word_count - the total number of tokens (non-unique) in sentences

The scoring function must accept all these parameters, even if it doesn’t use them in its scoring. The scoring function must be pickleable.

add_vocab(sentences)

Update model with new sentences.

Parameters:sentences (iterable of list of str) – Text corpus.

Example

>>> from gensim.test.utils import datapath
>>> from gensim.models.word2vec import Text8Corpus
>>> from gensim.models.phrases import Phrases
>>> #Create corpus and use it for phrase detector
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> phrases = Phrases(sentences)  # train model
>>> assert len(phrases.vocab) == 37
>>>
>>> more_sentences = [
...    [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there'],
...    [u'machine', u'learning', u'can', u'be', u'new', u'york' , u'sometimes']
... ]
>>>
>>> phrases.add_vocab(more_sentences)  # add new sentences to model
>>> assert len(phrases.vocab) == 60
analyze_sentence(sentence, threshold, common_terms, scorer)

Analyze a sentence, detecting any bigrams that should be concatenated.

Parameters:
  • sentence (iterable of str) – Token sequence representing the sentence to be analyzed.
  • threshold (float) – The minimum score for a bigram to be taken into account.
  • common_terms (list of object) – List of common terms, they have special treatment.
  • scorer (function) – Scorer function, as given to Phrases. See npmi_scorer() and original_scorer().
Yields:

(str, score) – If bi-gram detected, a tuple where the first element is a detect bigram, second its score. Otherwise, the first tuple element is a single word and second is None.

export_phrases(sentences, out_delimiter=' ', as_tuples=False)

Get all phrases that appear in ‘sentences’ that pass the bigram threshold.

Parameters:
  • sentences (iterable of list of str) – Text corpus.
  • out_delimiter (str, optional) – Delimiter used to “glue” together words that form a bigram phrase.
  • as_tuples (bool, optional) – Yield (tuple(words), score) instead of (out_delimiter.join(words), score)?
Yields:

((str, str), float) **or* (str, float)* – Phrases detected in sentences. Return type depends on the as_tuples parameter.

Example

>>> from gensim.test.utils import datapath
>>> from gensim.models.word2vec import Text8Corpus
>>> from gensim.models.phrases import Phrases
>>>
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> phrases = Phrases(sentences, min_count=1, threshold=0.1)
>>>
>>> for phrase, score in phrases.export_phrases(sentences):
...     pass
static learn_vocab(sentences, max_vocab_size, delimiter='_', progress_per=10000, common_terms=frozenset([]))

Collect unigram/bigram counts from the sentences iterable.

Parameters:
  • sentences (iterable of list of str) – The sentences iterable can be simply a list, but for larger corpora, consider a generator that streams the sentences directly from disk/network, See BrownCorpus, Text8Corpus or LineSentence for such examples.
  • max_vocab_size (int) – Maximum size (number of tokens) of the vocabulary. Used to control pruning of less common words, to keep memory under control. The default of 40M needs about 3.6GB of RAM. Increase/decrease max_vocab_size depending on how much available memory you have.
  • delimiter (str, optional) – Glue character used to join collocation tokens, should be a byte string (e.g. b’_’).
  • progress_per (int) – Write logs every progress_per sentence.
  • common_terms (set of str, optional) – List of “stop words” that won’t affect frequency count of expressions containing them. Allow to detect expressions like “bank_of_america” or “eye_of_the_beholder”.
Returns:

Number of pruned words, counters for each word/bi-gram and total number of words.

Return type:

(int, dict of (str, int), int)

Example

>>> from gensim.test.utils import datapath
>>> from gensim.models.word2vec import Text8Corpus
>>> from gensim.models.phrases import Phrases
>>>
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> pruned_words, counters, total_words = Phrases.learn_vocab(sentences, 100)
>>> (pruned_words, total_words)
(1, 29)
>>> counters['computer']
2
>>> counters['response_time']
1
classmethod load(*args, **kwargs)

Load a previously saved Phrases class. Handles backwards compatibility from older Phrases versions which did not support pluggable scoring functions.

Parameters:
  • args (object) – Sequence of arguments, see load for more information.
  • kwargs (object) – Sequence of arguments, see load for more information.
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to a file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()
Load object from file.
score_item(worda, wordb, components, scorer)

Get bi-gram score statistics.

Parameters:
  • worda (str) – First word of bi-gram.
  • wordb (str) – Second word of bi-gram.
  • components (generator) – Contain all phrases.
  • scorer (function) – Scorer function, as given to Phrases. See npmi_scorer() and original_scorer().
Returns:

Score for given bi-gram, if bi-gram not presented in dictionary - return -1.

Return type:

float

class gensim.models.phrases.PhrasesTransformation

Bases: gensim.interfaces.TransformationABC

Base util class for Phrases and Phraser.

classmethod load(*args, **kwargs)

Load a previously saved Phrases / Phraser class. Handles backwards compatibility from older Phrases / Phraser versions which did not support pluggable scoring functions.

Parameters:
  • args (object) – Sequence of arguments, see load for more information.
  • kwargs (object) – Sequence of arguments, see load for more information.
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to a file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()
Load object from file.
class gensim.models.phrases.SentenceAnalyzer

Bases: object

Base util class for Phrases and Phraser.

analyze_sentence(sentence, threshold, common_terms, scorer)

Analyze a sentence, detecting any bigrams that should be concatenated.

Parameters:
  • sentence (iterable of str) – Token sequence representing the sentence to be analyzed.
  • threshold (float) – The minimum score for a bigram to be taken into account.
  • common_terms (list of object) – List of common terms, they have special treatment.
  • scorer (function) – Scorer function, as given to Phrases. See npmi_scorer() and original_scorer().
Yields:

(str, score) – If bi-gram detected, a tuple where the first element is a detect bigram, second its score. Otherwise, the first tuple element is a single word and second is None.

score_item(worda, wordb, components, scorer)

Get bi-gram score statistics.

Parameters:
  • worda (str) – First word of bi-gram.
  • wordb (str) – Second word of bi-gram.
  • components (generator) – Contain all phrases.
  • scorer (function) – Scorer function, as given to Phrases. See npmi_scorer() and original_scorer().
Returns:

Score for given bi-gram, if bi-gram not presented in dictionary - return -1.

Return type:

float

gensim.models.phrases.npmi_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count)

Calculation NPMI score based on “Normalized (Pointwise) Mutual Information in Colocation Extraction” by Gerlof Bouma.

Parameters:
  • worda_count (int) – Number of occurrences for first word.
  • wordb_count (int) – Number of occurrences for second word.
  • bigram_count (int) – Number of co-occurrences for phrase “worda_wordb”.
  • len_vocab (int) – Not used.
  • min_count (int) – Not used.
  • corpus_word_count (int) – Total number of words in the corpus.

Notes

Formula: \frac{ln(prop(word_a, word_b) / (prop(word_a)*prop(word_b)))}{ -ln(prop(word_a, word_b)}, where prob(word) = \frac{word\_count}{corpus\_word\_count}

gensim.models.phrases.original_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count)

Bigram scoring function, based on the original Mikolov, et. al: “Distributed Representations of Words and Phrases and their Compositionality”.

Parameters:
  • worda_count (int) – Number of occurrences for first word.
  • wordb_count (int) – Number of occurrences for second word.
  • bigram_count (int) – Number of co-occurrences for phrase “worda_wordb”.
  • len_vocab (int) – Size of vocabulary.
  • min_count (int) – Minimum score threshold.
  • corpus_word_count (int) – Not used in this particular scoring technique.

Notes

Formula: \frac{(bigram\_count - min\_count) * len\_vocab }{ (worda\_count * wordb\_count)}.

gensim.models.phrases.pseudocorpus(source_vocab, sep, common_terms=frozenset([]))

Feeds source_vocab’s compound keys back to it, to discover phrases.

Parameters:
  • source_vocab (iterable of list of str) – Tokens vocabulary.
  • sep (str) – Separator element.
  • common_terms (set, optional) – Immutable set of stopwords.
Yields:

list of str – Phrase.