gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

models.phrases – Phrase (collocation) detection

models.phrases – Phrase (collocation) detection

Automatically detect common phrases (multiword expressions) from a stream of sentences.

The phrases are collocations (frequently co-occurring tokens). See [1] for the exact formula.

For example, if your input stream (=an iterable, with each value a list of token strings) looks like:

>>> print(list(sentence_stream))
[[u'the', u'mayor', u'of', u'new', u'york', u'was', u'there'],
 [u'machine', u'learning', u'can', u'be', u'useful', u'sometimes'],
 ...,
]

you’d train the detector with:

>>> phrases = Phrases(sentence_stream)

and then create a performant Phraser object to transform any sentence (list of token strings) using the standard gensim syntax:

>>> bigram = Phraser(phrases)
>>> sent = [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there']
>>> print(bigram[sent])
[u'the', u'mayor', u'of', u'new_york', u'was', u'there']

(note new_york became a single token). As usual, you can also transform an entire sentence stream using:

>>> print(list(bigram[any_sentence_stream]))
[[u'the', u'mayor', u'of', u'new_york', u'was', u'there'],
 [u'machine_learning', u'can', u'be', u'useful', u'sometimes'],
 ...,
]

You can also continue updating the collocation counts with new sentences, by:

>>> bigram.add_vocab(new_sentence_stream)

These phrase streams are meant to be used during text preprocessing, before converting the resulting tokens into vectors using `Dictionary`. See the gensim.models.word2vec module for an example application of using phrase detection.

The detection can also be run repeatedly, to get phrases longer than two tokens (e.g. new_york_times):

>>> trigram = Phrases(bigram[sentence_stream])
>>> sent = [u'the', u'new', u'york', u'times', u'is', u'a', u'newspaper']
>>> print(trigram[bigram[sent]])
[u'the', u'new_york_times', u'is', u'a', u'newspaper']

The common_terms parameter add a way to give special treatment to common terms (aka stop words) such that their presence between two words won’t prevent bigram detection. It allows to detect expressions like “bank of america” or “eye of the beholder”.

>>> common_terms = ["of", "with", "without", "and", "or", "the", "a"]
>>> ct_phrases = Phrases(sentence_stream, common_terms=common_terms)

The phraser will of course inherit the common_terms from Phrases.

>>> ct_bigram = Phraser(ct_phrases)
>>> sent = [u'the', u'mayor', u'shows', u'his', u'lack', u'of', u'interest']
>>> print(bigram[sent])
[u'the', u'mayor', u'shows', u'his', u'lack_of_interest']
[1]Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.
class gensim.models.phrases.Phraser(phrases_model)

Bases: gensim.models.phrases.SentenceAnalyzer, gensim.interfaces.TransformationABC

Minimal state & functionality to apply results of a Phrases model to tokens.

After the one-time initialization, a Phraser will be much smaller and somewhat faster than using the full Phrases model.

Reflects the results of the source model’s min_count, threshold, and scoring settings. (You can tamper with those & create a new Phraser to try other values.)

analyze_sentence(sentence, threshold, common_terms, scorer)

Analyze a sentence

sentence a token list representing the sentence to be analyzed.

threshold the minimum score for a bigram to be taken into account

common_terms the list of common terms, they have a special treatment

scorer the scorer function, as given to Phrases

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

pseudocorpus(phrases_model)
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.

score_item(worda, wordb, components, scorer)

score is retained from original dataset

class gensim.models.phrases.Phrases(sentences=None, min_count=5, threshold=10.0, max_vocab_size=40000000, delimiter='_', progress_per=10000, scoring='default', common_terms=frozenset([]))

Bases: gensim.models.phrases.SentenceAnalyzer, gensim.interfaces.TransformationABC

Detect phrases, based on collected collocation counts. Adjacent words that appear together more frequently than expected are joined together with the _ character.

It can be used to generate phrases on the fly, using the phrases[sentence] and phrases[corpus] syntax.

Initialize the model from an iterable of sentences. Each sentence must be a list of words (unicode strings) that will be used for training.

The sentences iterable can be simply a list, but for larger corpora, consider a generator that streams the sentences directly from disk/network, without storing everything in RAM. See BrownCorpus, Text8Corpus or LineSentence in the gensim.models.word2vec module for such examples.

min_count ignore all words and bigrams with total collected count lower than this.

threshold represents a score threshold for forming the phrases (higher means fewer phrases). A phrase of words a followed by b is accepted if the score of the phrase is greater than threshold. see the scoring setting.

max_vocab_size is the maximum size of the vocabulary. Used to control pruning of less common words, to keep memory under control. The default of 40M needs about 3.6GB of RAM; increase/decrease max_vocab_size depending on how much available memory you have.

delimiter is the glue character used to join collocation tokens, and should be a byte string (e.g. b’_’).

scoring specifies how potential phrases are scored for comparison to the threshold setting. scoring can be set with either a string that refers to a built-in scoring function, or with a function with the expected parameter names. Two built-in scoring functions are available by setting scoring to a string:

‘default’: from “Efficient Estimaton of Word Representations in Vector Space” by
Mikolov, et. al.: (count(worda followed by wordb) - min_count) * N / (count(worda) * count(wordb)) > threshold`, where N is the total vocabulary size.
‘npmi’: normalized pointwise mutual information, from “Normalized (Pointwise) Mutual
Information in Colocation Extraction” by Gerlof Bouma: ln(prop(worda followed by wordb) / (prop(worda)*prop(wordb))) / - ln(prop(worda followed by wordb) where prop(n) is the count of n / the count of everything in the entire corpus.

‘npmi’ is more robust when dealing with common words that form part of common bigrams, and ranges from -1 to 1, but is slower to calculate than the default.

To use a custom scoring function, create a function with the following parameters and set the scoring parameter to the custom function. You must use all the parameters in your function call, even if the function does not require all the parameters.

worda_count: number of occurrances in sentences of the first token in the phrase being scored wordb_count: number of occurrances in sentences of the second token in the phrase being scored bigram_count: number of occurrances in sentences of the phrase being scored len_vocab: the number of unique tokens in sentences min_count: the min_count setting of the Phrases class corpus_word_count: the total number of (non-unique) tokens in sentences

A scoring function without any of these parameters (even if the parameters are not used) will raise a ValueError on initialization of the Phrases class. The scoring function must be picklable.

common_terms is an optionnal list of “stop words” that won’t affect frequency count of expressions containing them.

add_vocab(sentences)

Merge the collected counts vocab into this phrase detector.

analyze_sentence(sentence, threshold, common_terms, scorer)

Analyze a sentence

sentence a token list representing the sentence to be analyzed.

threshold the minimum score for a bigram to be taken into account

common_terms the list of common terms, they have a special treatment

scorer the scorer function, as given to Phrases

export_phrases(sentences, out_delimiter=' ', as_tuples=False)

Generate an iterator that contains all phrases in given ‘sentences’

Example:

>>> sentences = Text8Corpus(path_to_corpus)
>>> bigram = Phrases(sentences, min_count=5, threshold=100)
>>> for phrase, score in bigram.export_phrases(sentences):
...     print(u'{0}   {1}'.format(phrase, score))

  then you can debug the threshold with generated tsv
static learn_vocab(sentences, max_vocab_size, delimiter='_', progress_per=10000, common_terms=frozenset([]))

Collect unigram/bigram counts from the sentences iterable.

classmethod load(*args, **kwargs)
Load a previously saved Phrases class. Handles backwards compatibility from older Phrases versions which did not support
pluggable scoring functions. Otherwise, relies on utils.load
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.

score_item(worda, wordb, components, scorer)
class gensim.models.phrases.SentenceAnalyzer

Bases: object

analyze_sentence(sentence, threshold, common_terms, scorer)

Analyze a sentence

sentence a token list representing the sentence to be analyzed.

threshold the minimum score for a bigram to be taken into account

common_terms the list of common terms, they have a special treatment

scorer the scorer function, as given to Phrases

score_item(worda, wordb, components, scorer)
gensim.models.phrases.npmi_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count)
gensim.models.phrases.original_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count)
gensim.models.phrases.pseudocorpus(source_vocab, sep, common_terms=frozenset([]))

Feeds source_vocab’s compound keys back to it, to discover phrases