gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

models.coherencemodel – Topic coherence pipeline

models.coherencemodel – Topic coherence pipeline

Module for calculating topic coherence in python. This is the implementation of the four stage topic coherence pipeline from the paper [1]. The four stage pipeline is basically:

Segmentation -> Probability Estimation -> Confirmation Measure -> Aggregation.

Implementation of this pipeline allows for the user to in essence “make” a coherence measure of his/her choice by choosing a method in each of the pipelines.

[1]Michael Roeder, Andreas Both and Alexander Hinneburg. Exploring the space of topic coherence measures. http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf.
class gensim.models.coherencemodel.CoherenceModel(model=None, topics=None, texts=None, corpus=None, dictionary=None, window_size=None, keyed_vectors=None, coherence='c_v', topn=20, processes=-1)

Bases: gensim.interfaces.TransformationABC

Objects of this class allow for building and maintaining a model for topic coherence.

The main methods are:

  1. constructor, which initializes the four stage pipeline by accepting a coherence measure,
  2. the get_coherence() method, which returns the topic coherence.

Pipeline phases can also be executed individually. Methods for doing this are:

  1. segment_topics(), which performs segmentation of the given topics into their comparison sets.
  2. estimate_probabilities(), which accumulates word occurrence stats from the given corpus or texts.
    The output of this is also cached on the CoherenceModel, so calling this method can be used as a precomputation step for the next phase.
  3. get_coherence_per_topic(), which uses the segmented topics and estimated probabilities to compute
    the coherence of each topic. This output can be used to rank topics in order of most coherent to least. Such a ranking is useful if the intended use case of a topic model is document exploration by a human. It is also useful for filtering out incoherent topics (keep top-n from ranked list).
  4. aggregate_measures(topic_coherences), which uses the pipeline’s aggregation method to compute
    the overall coherence from the topic coherences.

One way of using this feature is through providing a trained topic model. A dictionary has to be explicitly provided if the model does not contain a dictionary already:

cm = CoherenceModel(model=tm, corpus=corpus, coherence='u_mass')  # tm is the trained topic model
cm.get_coherence()

Another way of using this feature is through providing tokenized topics such as:

topics = [['human', 'computer', 'system', 'interface'],
          ['graph', 'minors', 'trees', 'eps']]
cm = CoherenceModel(topics=topics, corpus=corpus, dictionary=dictionary, coherence='u_mass') # note that a dictionary has to be provided.
cm.get_coherence()

Model persistency is achieved via its load/save methods.

Parameters:
  • model – Pre-trained topic model. Should be provided if topics is not provided. Currently supports LdaModel, LdaMallet wrapper and LdaVowpalWabbit wrapper. Use ‘topics’ parameter to plug in an as yet unsupported model.
  • topics

    List of tokenized topics. If this is preferred over model, dictionary should be provided. eg:

    topics = [['human', 'machine', 'computer', 'interface'],
               ['graph', 'trees', 'binary', 'widths']]
    
  • texts

    Tokenized texts. Needed for coherence models that use sliding window based probability estimator, eg:

    texts = [['system', 'human', 'system', 'eps'],
             ['user', 'response', 'time'],
             ['trees'],
             ['graph', 'trees'],
             ['graph', 'minors', 'trees'],
             ['graph', 'minors', 'survey']]
    
  • corpus – Gensim document corpus.
  • dictionary – Gensim dictionary mapping of id word to create corpus. If model.id2word is present, this is not needed. If both are provided, dictionary will be used.
  • window_size

    Is the size of the window to be used for coherence measures using boolean sliding window as their probability estimator. For ‘u_mass’ this doesn’t matter. If left ‘None’ the default window sizes are used which are:

    ’c_v’ : 110 ‘c_uci’ : 10 ‘c_npmi’ : 10
  • coherence – Coherence measure to be used. Supported values are: ‘u_mass’ ‘c_v’ ‘c_uci’ also popularly known as c_pmi ‘c_npmi’ For ‘u_mass’ corpus should be provided. If texts is provided, it will be converted to corpus using the dictionary. For ‘c_v’, ‘c_uci’ and ‘c_npmi’ texts should be provided. Corpus is not needed.
  • topn – Integer corresponding to the number of top words to be extracted from each topic.
  • processes – number of processes to use for probability estimation phase; any value less than 1 will be interpreted to mean num_cpus - 1; default is -1.
aggregate_measures(topic_coherences)

Aggregate the individual topic coherence measures using the pipeline’s aggregation function.

compare_model_topics(model_topics)

Perform the coherence evaluation for each of the models.

This first precomputes the probabilities once, then evaluates coherence for each model.

Since we have already precomputed the probabilities, this simply involves using the accumulated stats in the CoherenceModel to perform the evaluations, which should be pretty quick.

Parameters:model_topics (list) – of lists of top-N words for the model trained with that number of topics.
Returns:
of (avg_topic_coherences, avg_coherence).
These are the coherence values per topic and the overall model coherence.
Return type:list
compare_models(models)
estimate_probabilities(segmented_topics=None)

Accumulate word occurrences and co-occurrences from texts or corpus using the optimal method for the chosen coherence metric. This operation may take quite some time for the sliding window based coherence methods.

classmethod for_models(models, dictionary, topn=20, **kwargs)

Initialize a CoherenceModel with estimated probabilities for all of the given models.

Parameters:models (list) – List of models to evalaute coherence of; the only requirement is that each has a get_topics methods.
classmethod for_topics(topics_as_topn_terms, **kwargs)

Initialize a CoherenceModel with estimated probabilities for all of the given topics.

Parameters:topics_as_topn_terms (list of lists) – Each element in the top-level list should be the list of topics for a model. The topics for the model should be a list of top-N words, one per topic.
get_coherence()

Return coherence value based on pipeline parameters.

get_coherence_per_topic(segmented_topics=None, with_std=False, with_support=False)

Return list of coherence values for each topic based on pipeline parameters.

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

measure
model
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.

segment_topics()
static top_topics_as_word_lists(model, dictionary, topn=20)
topics
topn