gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

Corporate trainings in Data Science, NLP and Deep Learning

topic_coherence.probability_estimation – Probability estimation module

topic_coherence.probability_estimation – Probability estimation module

This module contains functions to perform segmentation on a list of topics.

gensim.topic_coherence.probability_estimation.p_boolean_document(corpus, segmented_topics)

Perform the boolean document probability estimation. Boolean document estimates the probability of a single word as the number of documents in which the word occurs divided by the total number of documents.

Parameters
  • corpus (iterable of list of (int, int)) – The corpus of documents.

  • segmented_topics (list of (int, int)) – Each tuple (word_id_set1, word_id_set2) is either a single integer, or a numpy.ndarray of integers.

Returns

Word occurrence accumulator instance that can be used to lookup token frequencies and co-occurrence frequencies.

Return type

CorpusAccumulator

Examples

>>> from gensim.topic_coherence import probability_estimation
>>> from gensim.corpora.hashdictionary import HashDictionary
>>>
>>>
>>> texts = [
...     ['human', 'interface', 'computer'],
...     ['eps', 'user', 'interface', 'system'],
...     ['system', 'human', 'system', 'eps'],
...     ['user', 'response', 'time'],
...     ['trees'],
...     ['graph', 'trees']
... ]
>>> dictionary = HashDictionary(texts)
>>> w2id = dictionary.token2id
>>>
>>> # create segmented_topics
>>> segmented_topics = [
...     [
...         (w2id['system'], w2id['graph']),
...         (w2id['computer'], w2id['graph']),
...         (w2id['computer'], w2id['system'])
...     ],
...     [
...         (w2id['computer'], w2id['graph']),
...         (w2id['user'], w2id['graph']),
...         (w2id['user'], w2id['computer'])]
... ]
>>> # create corpus
>>> corpus = [dictionary.doc2bow(text) for text in texts]
>>>
>>> result = probability_estimation.p_boolean_document(corpus, segmented_topics)
>>> result.index_to_dict()
{10608: set([0]), 12736: set([1, 3]), 18451: set([5]), 5798: set([1, 2])}
gensim.topic_coherence.probability_estimation.p_boolean_sliding_window(texts, segmented_topics, dictionary, window_size, processes=1)

Perform the boolean sliding window probability estimation.

Parameters
  • texts (iterable of iterable of str) – Input text

  • segmented_topics (list of (int, int)) – Each tuple (word_id_set1, word_id_set2) is either a single integer, or a numpy.ndarray of integers.

  • dictionary (Dictionary) – Gensim dictionary mapping of the tokens and ids.

  • window_size (int) – Size of the sliding window, 110 found out to be the ideal size for large corpora.

  • processes (int, optional) – Number of process that will be used for ParallelWordOccurrenceAccumulator

Notes

Boolean sliding window determines word counts using a sliding window. The window moves over the documents one word token per step. Each step defines a new virtual document by copying the window content. Boolean document is applied to these virtual documents to compute word probabilities.

Returns

Examples

>>> from gensim.topic_coherence import probability_estimation
>>> from gensim.corpora.hashdictionary import HashDictionary
>>>
>>>
>>> texts = [
...     ['human', 'interface', 'computer'],
...     ['eps', 'user', 'interface', 'system'],
...     ['system', 'human', 'system', 'eps'],
...     ['user', 'response', 'time'],
...     ['trees'],
...     ['graph', 'trees']
... ]
>>> dictionary = HashDictionary(texts)
>>> w2id = dictionary.token2id

>>>
>>> # create segmented_topics
>>> segmented_topics = [
...     [
...         (w2id['system'], w2id['graph']),
...         (w2id['computer'], w2id['graph']),
...         (w2id['computer'], w2id['system'])
...     ],
...     [
...         (w2id['computer'], w2id['graph']),
...         (w2id['user'], w2id['graph']),
...         (w2id['user'], w2id['computer'])]
... ]
>>> # create corpus
>>> corpus = [dictionary.doc2bow(text) for text in texts]
>>> accumulator = probability_estimation.p_boolean_sliding_window(texts, segmented_topics, dictionary, 2)
>>>
>>> (accumulator[w2id['computer']], accumulator[w2id['user']], accumulator[w2id['system']])
(1, 3, 4)
gensim.topic_coherence.probability_estimation.p_word2vec(texts, segmented_topics, dictionary, window_size=None, processes=1, model=None)

Train word2vec model on texts if model is not None.

Parameters
  • texts (iterable of iterable of str) – Input text

  • segmented_topics (iterable of iterable of str) – Output from the segmentation of topics. Could be simply topics too.

  • dictionary (dictionary) – Gensim dictionary mapping of the tokens and ids.

  • window_size (int, optional) – Size of the sliding window.

  • processes (int, optional) – Number of processes to use.

  • model (Word2Vec or KeyedVectors, optional) – If None, a new Word2Vec model is trained on the given text corpus. Otherwise, it should be a pre-trained Word2Vec context vectors.

Returns

Text accumulator with trained context vectors.

Return type

WordVectorsAccumulator

Examples

>>> from gensim.topic_coherence import probability_estimation
>>> from gensim.corpora.hashdictionary import HashDictionary
>>> from gensim.models import word2vec
>>>
>>> texts = [
...     ['human', 'interface', 'computer'],
...     ['eps', 'user', 'interface', 'system'],
...     ['system', 'human', 'system', 'eps'],
...     ['user', 'response', 'time'],
...     ['trees'],
...     ['graph', 'trees']
... ]
>>> dictionary = HashDictionary(texts)
>>> w2id = dictionary.token2id

>>>
>>> # create segmented_topics
>>> segmented_topics = [
...     [
...         (w2id['system'], w2id['graph']),
...         (w2id['computer'], w2id['graph']),
...         (w2id['computer'], w2id['system'])
...     ],
...     [
...         (w2id['computer'], w2id['graph']),
...         (w2id['user'], w2id['graph']),
...         (w2id['user'], w2id['computer'])]
... ]
>>> # create corpus
>>> corpus = [dictionary.doc2bow(text) for text in texts]
>>> sentences = [
...     ['human', 'interface', 'computer'],
...     ['survey', 'user', 'computer', 'system', 'response', 'time']
... ]
>>> model = word2vec.Word2Vec(sentences, size=100, min_count=1)
>>> accumulator = probability_estimation.p_word2vec(texts, segmented_topics, dictionary, 2, 1, model)
gensim.topic_coherence.probability_estimation.unique_ids_from_segments(segmented_topics)

Return the set of all unique ids in a list of segmented topics.

Parameters

segmented_topics (list of (int, int)) – Each tuple (word_id_set1, word_id_set2) is either a single integer, or a numpy.ndarray of integers.

Returns

Set of unique ids across all topic segments.

Return type

set

Example

>>> from gensim.topic_coherence import probability_estimation
>>>
>>> segmentation = [[(1, 2)]]
>>> probability_estimation.unique_ids_from_segments(segmentation)
set([1, 2])