`models.coherencemodel` – Topic coherence pipeline¶

Calculate topic coherence for topic models. This is the implementation of the four stage topic coherence pipeline from the paper Michael Roeder, Andreas Both and Alexander Hinneburg: “Exploring the space of topic coherence measures”. Typically, CoherenceModel used for evaluation of topic models.

The four stage pipeline is basically:

Segmentation

Probability Estimation

Confirmation Measure

Aggregation

Implementation of this pipeline allows for the user to in essence “make” a coherence measure of his/her choice by choosing a method in each of the pipelines.

See also

gensim.topic_coherence: Internal functions for pipelines.

class gensim.models.coherencemodel.CoherenceModel(model=None, topics=None, texts=None, corpus=None, dictionary=None, window_size=None, keyed_vectors=None, coherence='c_v', topn=20, processes=-1)¶

Bases: TransformationABC

Objects of this class allow for building and maintaining a model for topic coherence.

Examples

One way of using this feature is through providing a trained topic model. A dictionary has to be explicitly provided if the model does not contain a dictionary already

>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.models.ldamodel import LdaModel
>>> from gensim.models.coherencemodel import CoherenceModel
>>>
>>> model = LdaModel(common_corpus, 5, common_dictionary)
>>>
>>> cm = CoherenceModel(model=model, corpus=common_corpus, coherence='u_mass')
>>> coherence = cm.get_coherence()  # get coherence value

Another way of using this feature is through providing tokenized topics such as:

>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.models.coherencemodel import CoherenceModel
>>> topics = [
...     ['human', 'computer', 'system', 'interface'],
...     ['graph', 'minors', 'trees', 'eps']
... ]
>>>
>>> cm = CoherenceModel(topics=topics, corpus=common_corpus, dictionary=common_dictionary, coherence='u_mass')
>>> coherence = cm.get_coherence()  # get coherence value

Parameters

model (BaseTopicModel, optional) – Pre-trained topic model, should be provided if topics is not provided. Currently supports LdaModel, LdaMulticore. Use topics parameter to plug in an as yet unsupported model.
topics (list of list of str, optional) – List of tokenized topics, if this is preferred over model - dictionary should be provided.
texts (list of list of str, optional) – Tokenized texts, needed for coherence models that use sliding window based (i.e. coherence=`c_something`) probability estimator .
corpus (iterable of list of (int, number), optional) – Corpus in BoW format.
dictionary (Dictionary, optional) – Gensim dictionary mapping of id word to create corpus. If model.id2word is present, this is not needed. If both are provided, passed dictionary will be used.
window_size (int, optional) – Is the size of the window to be used for coherence measures using boolean sliding window as their probability estimator. For ‘u_mass’ this doesn’t matter. If None - the default window sizes are used which are: ‘c_v’ - 110, ‘c_uci’ - 10, ‘c_npmi’ - 10.
coherence ({'u_mass', 'c_v', 'c_uci', 'c_npmi'}, optional) – Coherence measure to be used. Fastest method - ‘u_mass’, ‘c_uci’ also known as c_pmi. For ‘u_mass’ corpus should be provided, if texts is provided, it will be converted to corpus using the dictionary. For ‘c_v’, ‘c_uci’ and ‘c_npmi’ texts should be provided (corpus isn’t needed)
topn (int, optional) – Integer corresponding to the number of top words to be extracted from each topic.
processes (int, optional) – Number of processes to use for probability estimation phase, any value less than 1 will be interpreted as num_cpus - 1.

add_lifecycle_event(event_name, log_level=20, **event)¶

Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.

Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.

The lifecycle_events attribute is persisted across object’s save() and load() operations. It has no impact on the use of the model, but is useful during debugging and support.

Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.

Parameters

event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.

This method will automatically add the following key-values to event, so you don’t have to specify them:
- datetime: the current date & time
- gensim: the current Gensim version
- python: the current Python version
- platform: the current platform
- event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.

aggregate_measures(topic_coherences)¶

Aggregate the individual topic coherence measures using the pipeline’s aggregation function. Use self.measure.aggr(topic_coherences).

Parameters: topic_coherences (list of float) – List of calculated confirmation measure on each set in the segmented topics.
Returns: Arithmetic mean of all the values contained in confirmation measures.
Return type: float

compare_model_topics(model_topics)¶

Perform the coherence evaluation for each of the models.

Parameters: model_topics (list of list of str) – list of list of words for the model trained with that number of topics.
Returns: Sequence of pairs of average topic coherence and average coherence.
Return type: list of (float, float)

Notes

This first precomputes the probabilities once, then evaluates coherence for each model.

Since we have already precomputed the probabilities, this simply involves using the accumulated stats in the CoherenceModel to perform the evaluations, which should be pretty quick.

compare_models(models)¶

Compare topic models by coherence value.

Parameters: models (BaseTopicModel) – Sequence of topic models.
Returns: Sequence of pairs of average topic coherence and average coherence.
Return type: list of (float, float)

estimate_probabilities(segmented_topics=None)¶

Accumulate word occurrences and co-occurrences from texts or corpus using the optimal method for the chosen coherence metric.

Notes

This operation may take quite some time for the sliding window based coherence methods.

Parameters: segmented_topics (list of list of pair, optional) – Segmented topics, typically produced by segment_topics().
Returns: Corpus accumulator.
Return type: CorpusAccumulator

classmethod for_models(models, dictionary, topn=20, **kwargs)¶

Initialize a CoherenceModel with estimated probabilities for all of the given models. Use for_topics() method.

Parameters

models (list of BaseTopicModel) – List of models to evaluate coherence of, each of it should implements get_topics() method.
dictionary (Dictionary) – Gensim dictionary mapping of id word.
topn (int, optional) – Integer corresponding to the number of top words to be extracted from each topic.
kwargs (object) – Sequence of arguments, see for_topics().

Returns

CoherenceModel with estimated probabilities for all of the given models.

Return type

CoherenceModel

Example

>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.models.ldamodel import LdaModel
>>> from gensim.models.coherencemodel import CoherenceModel
>>>
>>> m1 = LdaModel(common_corpus, 3, common_dictionary)
>>> m2 = LdaModel(common_corpus, 5, common_dictionary)
>>>
>>> cm = CoherenceModel.for_models([m1, m2], common_dictionary, corpus=common_corpus, coherence='u_mass')

classmethod for_topics(topics_as_topn_terms, **kwargs)¶

Initialize a CoherenceModel with estimated probabilities for all of the given topics.

Parameters: topics_as_topn_terms (list of list of str) – Each element in the top-level list should be the list of topics for a model. The topics for the model should be a list of top-N words, one per topic.
Returns: CoherenceModel with estimated probabilities for all of the given models.
Return type: CoherenceModel

get_coherence()¶

Get coherence value based on pipeline parameters.

Returns: Value of coherence.
Return type: float

get_coherence_per_topic(segmented_topics=None, with_std=False, with_support=False)¶

Get list of coherence values for each topic based on pipeline parameters.

Parameters

segmented_topics (list of list of (int, number)) – Topics.
with_std (bool, optional) – True to also include standard deviation across topic segment sets in addition to the mean coherence for each topic.
with_support (bool, optional) – True to also include support across topic segments. The support is defined as the number of pairwise similarity comparisons were used to compute the overall topic coherence.

Returns

Sequence of similarity measure for each topic.

Return type

list of float

classmethod load(fname, mmap=None)¶

Load an object previously saved using save() from a file.

Parameters

fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

Please sponsor Gensim to help sustain this open source project!

models.coherencemodel – Topic coherence pipeline¶

`models.coherencemodel` – Topic coherence pipeline¶