gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

topic_coherence.probability_estimation – Probability estimation module

topic_coherence.probability_estimation – Probability estimation module

This module contains functions to perform segmentation on a list of topics.

gensim.topic_coherence.probability_estimation.p_boolean_document(corpus, segmented_topics)

This function performs the boolean document probability estimation. Boolean document estimates the probability of a single word as the number of documents in which the word occurs divided by the total number of documents.

Parameters:
  • corpus – The corpus of documents.
  • segmented_topics – Output from the segmentation of topics. Could be simply topics too.
Returns:

word occurrence accumulator instance that can be used to lookup token

frequencies and co-occurrence frequencies.

Return type:

accumulator

gensim.topic_coherence.probability_estimation.p_boolean_sliding_window(texts, segmented_topics, dictionary, window_size, processes=1)

This function performs the boolean sliding window probability estimation. Boolean sliding window determines word counts using a sliding window. The window moves over the documents one word token per step. Each step defines a new virtual document by copying the window content. Boolean document is applied to these virtual documents to compute word probabilities.

Parameters:
  • texts – List of string sentences.
  • segmented_topics – Output from the segmentation of topics. Could be simply topics too.
  • dictionary – Gensim dictionary mapping of the tokens and ids.
  • window_size – Size of the sliding window. 110 found out to be the ideal size for large corpora.
Returns:

word occurrence accumulator instance that can be used to lookup token

frequencies and co-occurrence frequencies.

Return type:

accumulator

gensim.topic_coherence.probability_estimation.unique_ids_from_segments(segmented_topics)

Return the set of all unique ids in a list of segmented topics.

Parameters:segmented_topics – list of tuples of (word_id_set1, word_id_set2). Each word_id_set is either a single integer, or a numpy.ndarray of integers.
Returns:set of unique ids across all topic segments.
Return type:unique_ids