gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

topic_coherence.indirect_confirmation_measure – Indirect confirmation measure module

topic_coherence.indirect_confirmation_measure – Indirect confirmation measure module

This module contains functions to compute confirmation on a pair of words or word subsets.

The advantage of indirect confirmation measure is that it computes similarity of words in W’ and W* with respect to direct confirmations to all words. Eg. Suppose x and z are both competing brands of cars, which semantically support each other. However, both brands are seldom mentioned together in documents in the reference corpus. But their confirmations to other words like “road” or “speed” do strongly correlate. This would be reflected by an indirect confirmation measure. Thus, indirect confirmation measures may capture semantic support that direct measures would miss.

The formula used to compute indirect confirmation measure is

m_{sim}_{(m, gamma)}(W’, W*) =
s_{sim}(vec{V}^{,}_{m,gamma}(W’), vec{V}^{,}_{m,gamma}(W*))

where s_sim can be cosine, dice or jaccard similarity and

vec{V}^{,}_{m,gamma}(W’) =
Bigg {{sum_{w_{i} in W’}^{ } m(w_{i}, w_{j})^{gamma}}Bigg }_{j = 1,…,|W|}

Here ‘m’ is the direct confirmation measure used.

class gensim.topic_coherence.indirect_confirmation_measure.ContextVectorComputer(measure, topics, accumulator, gamma)

Bases: object

Lazily compute context vectors for topic segments.

compute_context_vector(segment_word_ids, topic_word_ids)

Step 1. Check if (segment_word_ids, topic_word_ids) context vector has been cached. Step 2. If yes, return corresponding context vector, else compute, cache, and return.

gensim.topic_coherence.indirect_confirmation_measure.cosine_similarity(segmented_topics, accumulator, topics, measure='nlr', gamma=1, with_std=False, with_support=False)

This function calculates the indirect cosine measure.

Given context vectors u = V(W’) and w = V(W*) for the word sets of a pair S_i = (W’, W*) indirect cosine measure is computed as the cosine similarity between u and w.

The formula used is

m_{sim}_{(m, gamma)}(W’, W*) =
s_{sim}(vec{V}^{,}_{m,gamma}(W’), vec{V}^{,}_{m,gamma}(W*))

where each vector

vec{V}^{,}_{m,gamma}(W’) =
Bigg {{sum_{w_{i} in W’}^{ } m(w_{i}, w_{j})^{gamma}}Bigg }_{j = 1,…,|W|}
Parameters:
  • segmented_topics – Output from the segmentation module of the segmented topics. Is a list of list of tuples.
  • accumulator – Output from the probability_estimation module. Is an accumulator of word occurrences (see text_analysis module).
  • topics – Topics obtained from the trained topic model.
  • measure (str) – Direct confirmation measure to be used. Supported values are “nlr” (normalized log ratio).
  • gamma – Gamma value for computing W’, W* vectors; default is 1.
  • with_std (bool) – True to also include standard deviation across topic segment sets in addition to the mean coherence for each topic; default is False.
  • with_support (bool) – True to also include support across topic segments. The support is defined as the number of pairwise similarity comparisons were used to compute the overall topic coherence.
Returns:

of indirect cosine similarity measure for each topic.

Return type:

list

gensim.topic_coherence.indirect_confirmation_measure.word2vec_similarity(segmented_topics, accumulator, with_std=False, with_support=False)

For each topic segmentation, compute average cosine similarity using a WordVectorsAccumulator.

Parameters:
  • segmented_topics (list) – Output from the segmentation module of the segmented topics. Is a list of list of tuples.
  • accumulator – word occurrence accumulator from probability_estimation.
  • with_std (bool) – True to also include standard deviation across topic segment sets in addition to the mean coherence for each topic; default is False.
  • with_support (bool) – True to also include support across topic segments. The support is defined as the number of pairwise similarity comparisons were used to compute the overall topic coherence.
Returns:

of word2vec cosine similarities per topic.

Return type:

list