models.ensembelda – Ensemble Latent Dirichlet Allocation

Ensemble Latent Dirichlet Allocation (eLDA), an algorithm for extracting reliable topics.

The aim of topic modelling is to find a set of topics that represent the global structure of a corpus of documents. One issue that occurs with topics extracted from an NMF or LDA model is reproducibility. That is, if the topic model is trained repeatedly allowing only the random seed to change, would the same (or similar) topic representation be reliably learned. Unreliable topics are undesireable because they are not a good representation of the corpus.

Ensemble LDA addresses the issue by training an ensemble of topic models and throwing out topics that do not reoccur across the ensemble. In this regard, the topics extracted are more reliable and there is the added benefit over many topic models that the user does not need to know the exact number of topics ahead of time.

For more information, see the citation section below, watch our Machine Learning Prague 2019 talk, or view our Machine Learning Summer School poster.

Usage examples

Train an ensemble of LdaModels using a Gensim corpus:

>>> from gensim.test.utils import common_texts
>>> from gensim.corpora.dictionary import Dictionary
>>> from gensim.models import EnsembleLda
>>>
>>> # Create a corpus from a list of texts
>>> common_dictionary = Dictionary(common_texts)
>>> common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]
>>>
>>> # Train the model on the corpus. corpus has to be provided as a
>>> # keyword argument, as they are passed through to the children.
>>> elda = EnsembleLda(corpus=common_corpus, id2word=common_dictionary, num_topics=10, num_models=4)

Save a model to disk, or reload a pre-trained model:

>>> from gensim.test.utils import datapath
>>>
>>> # Save model to disk.
>>> temp_file = datapath("model")
>>> elda.save(temp_file)
>>>
>>> # Load a potentially pretrained model from disk.
>>> elda = EnsembleLda.load(temp_file)

Query, the model using new, unseen documents:

>>> # Create a new corpus, made of previously unseen documents.
>>> other_texts = [
...     ['computer', 'time', 'graph'],
...     ['survey', 'response', 'eps'],
...     ['human', 'system', 'computer']
... ]
>>> other_corpus = [common_dictionary.doc2bow(text) for text in other_texts]
>>>
>>> unseen_doc = other_corpus[0]
>>> vector = elda[unseen_doc]  # get topic probability distribution for a document

Increase the ensemble size by adding a new model. Make sure it uses the same dictionary:

>>> from gensim.models import LdaModel
>>> elda.add_model(LdaModel(common_corpus, id2word=common_dictionary, num_topics=10))
>>> elda.recluster()
>>> vector = elda[unseen_doc]

To optimize the ensemble for your specific case, the children can be clustered again using different hyperparameters:

>>> elda.recluster(eps=0.2)

Citation

BRIGL, Tobias, 2019, Extracting Reliable Topics using Ensemble Latent Dirichlet Allocation [Bachelor Thesis]. Technische Hochschule Ingolstadt. Munich: Data Reply GmbH. Available from: https://www.sezanzeb.de/machine_learning/ensemble_LDA/

class gensim.models.ensemblelda.CBDBSCAN(eps, min_samples)

Bases: object

A Variation of the DBSCAN algorithm called Checkback DBSCAN (CBDBSCAN).

The algorithm works based on DBSCAN-like parameters ‘eps’ and ‘min_samples’ that respectively define how far a “nearby” point is, and the minimum number of nearby points needed to label a candidate datapoint a core of a cluster. (See https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html).

The algorithm works as follows:

  1. (A)symmetric distance matrix provided at fit-time (called ‘amatrix’). For the sake of example below, assume the there are only five topics (amatrix contains distances with dim 5x5), T_1, T_2, T_3, T_4, T_5:

  2. Start by scanning a candidate topic with respect to a parent topic (e.g. T_1 with respect to parent None)

  3. Check which topics are nearby the candidate topic using ‘self.eps’ as a threshold and call them neighbours (e.g. assume T_3, T_4, and T_5 are nearby and become neighbours)

  4. If there are more neighbours than ‘self.min_samples’, the candidate topic becomes a core candidate for a cluster (e.g. if ‘min_samples’=1, then T_1 becomes the first core of a cluster)

  5. If candidate is a core, CheckBack (CB) to find the fraction of neighbours that are either the parent or the parent’s neighbours. If this fraction is more than 75%, give the candidate the same label as its parent. (e.g. in the trivial case there is no parent (or neighbours of that parent), a new incremental label is given)

  6. If candidate is a core, recursively scan the next nearby topic (e.g. scan T_3) labeling the previous topic as the parent and the previous neighbours as the parent_neighbours - repeat steps 2-6:

    1. (e.g. Scan candidate T_3 with respect to parent T_1 that has parent_neighbours T_3, T_4, and T_5)

    2. (e.g. T5 is the only neighbour)

    3. (e.g. number of neighbours is 1, therefore candidate T_3 becomes a core)

    4. (e.g. CheckBack finds that two of the four parent and parent neighbours are neighbours of candidate T_3. Therefore the candidate T_3 does NOT get the same label as its parent T_1)

    5. (e.g. Scan candidate T_5 with respect to parent T_3 that has parent_neighbours T_5)

The CB step has the effect that it enforces cluster compactness and allows the model to avoid creating clusters for unstable topics made of a composition of multiple stable topics.

Create a new CBDBSCAN object. Call fit in order to train it on an asymmetric distance matrix.

Parameters
  • eps (float) – epsilon for the CBDBSCAN algorithm, having the same meaning as in classic DBSCAN clustering.

  • min_samples (int) – The minimum number of samples in the neighborhood of a topic to be considered a core in CBDBSCAN.

fit(amatrix)

Apply the algorithm to an asymmetric distance matrix.

class gensim.models.ensemblelda.Cluster(max_num_neighboring_labels: int, neighboring_labels: List[Set[int]], label: int, num_cores: int)

Bases: object

label: int
max_num_neighboring_labels: int
neighboring_labels: List[Set[int]]
num_cores: int
class gensim.models.ensemblelda.EnsembleLda(topic_model_class='ldamulticore', num_models=3, min_cores=None, epsilon=0.1, ensemble_workers=1, memory_friendly_ttda=True, min_samples=None, masking_method=<function mass_masking>, masking_threshold=None, distance_workers=1, random_state=None, **gensim_kw_args)

Bases: gensim.utils.SaveLoad

Ensemble Latent Dirichlet Allocation (eLDA), a method of training a topic model ensemble.

Extracts stable topics that are consistently learned across multiple LDA models. eLDA has the added benefit that the user does not need to know the exact number of topics the topic model should extract ahead of time.

Create and train a new EnsembleLda model.

Will start training immediatelly, except if iterations, passes or num_models is 0 or if the corpus is missing.

Parameters
  • topic_model_class (str, topic model, optional) –

    Examples:
    • ’ldamulticore’ (default, recommended)

    • ’lda’

    • ldamodel.LdaModel

    • ldamulticore.LdaMulticore

  • ensemble_workers (int, optional) –

    Spawns that many processes and distributes the models from the ensemble to those as evenly as possible. num_models should be a multiple of ensemble_workers.

    Setting it to 0 or 1 will both use the non-multiprocessing version. Default: 1

  • num_models (int, optional) – How many LDA models to train in this ensemble. Default: 3

  • min_cores (int, optional) – Minimum cores a cluster of topics has to contain so that it is recognized as stable topic.

  • epsilon (float, optional) – Defaults to 0.1. Epsilon for the CBDBSCAN clustering that generates the stable topics.

  • ensemble_workers

    Spawns that many processes and distributes the models from the ensemble to those as evenly as possible. num_models should be a multiple of ensemble_workers.

    Setting it to 0 or 1 will both use the nonmultiprocessing version. Default: 1

  • memory_friendly_ttda (boolean, optional) –

    If True, the models in the ensemble are deleted after training and only a concatenation of each model’s topic term distribution (called ttda) is kept to save memory.

    Defaults to True. When False, trained models are stored in a list in self.tms, and no models that are not of a gensim model type can be added to this ensemble using the add_model function.

    If False, any topic term matrix can be suplied to add_model.

  • min_samples (int, optional) – Required int of nearby topics for a topic to be considered as ‘core’ in the CBDBSCAN clustering.

  • masking_method (function, optional) –

    Choose one of mass_masking() (default) or rank_masking() (percentile, faster).

    For clustering, distances between topic-term distributions are asymmetric. In particular, the distance (technically a divergence) from distribution A to B is more of a measure of if A is contained in B. At a high level, this involves using distribution A to mask distribution B and then calculating the cosine distance between the two. The masking can be done in two ways:

    1. mass: forms mask by taking the top ranked terms until their cumulative mass reaches the ‘masking_threshold’

    2. rank: forms mask by taking the top ranked terms (by mass) until the ‘masking_threshold’ is reached. For example, a ranking threshold of 0.11 means the top 0.11 terms by weight are used to form a mask.

  • masking_threshold (float, optional) – Default: None, which uses 0.95 for “mass”, and 0.11 for masking_method “rank”. In general, too small a mask threshold leads to inaccurate calculations (no signal) and too big a mask leads to noisy distance calculations. Defaults are often a good sweet spot for this hyperparameter.

  • distance_workers (int, optional) – When distance_workers is None, it defaults to os.cpu_count() for maximum performance. Default is 1, which is not multiprocessed. Set to > 1 to enable multiprocessing.

  • **gensim_kw_args – Parameters for each gensim model (e.g. gensim.models.LdaModel) in the ensemble.

add_lifecycle_event(event_name, log_level=20, **event)

Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.

Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.

The lifecycle_events attribute is persisted across object’s save() and load() operations. It has no impact on the use of the model, but is useful during debugging and support.

Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.

Parameters
  • event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.

  • event (dict) –

    Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.

    This method will automatically add the following key-values to event, so you don’t have to specify them:

    • datetime: the current date & time

    • gensim: the current Gensim version

    • python: the current Python version

    • platform: the current platform

    • event: the name of this event

  • log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.

add_model(target, num_new_models=None)

Add the topic term distribution array (ttda) of another model to the ensemble.

This way, multiple topic models can be connected to an ensemble manually. Make sure that all the models use the exact same dictionary/idword mapping.

In order to generate new stable topics afterwards, use:
  1. self.recluster()

The ttda of another ensemble can also be used, in that case set num_new_models to the num_models parameter of the ensemble, that means the number of classic models in the ensemble that generated the ttda. This is important, because that information is used to estimate “min_samples” for _generate_topic_clusters.

If you trained this ensemble in the past with a certain Dictionary that you want to reuse for other models, you can get it from: self.id2word.

Parameters
  • target ({see description}) –

    1. A single EnsembleLda object

    2. List of EnsembleLda objects

    3. A single Gensim topic model (e.g. (gensim.models.LdaModel)

    4. List of Gensim topic models

    if memory_friendly_ttda is True, target can also be: 5. topic-term-distribution-array

    example: [[0.1, 0.1, 0.8], […], …]

    [topic1, topic2, …] with topic being an array of probabilities: [token1, token2, …]

    token probabilities in a single topic sum to one, therefore, all the words sum to len(ttda)

  • num_new_models (integer, optional) –

    the model keeps track of how many models were used in this ensemble. Set higher if ttda contained topics from more than one model. Default: None, which takes care of it automatically.

    If target is a 2D-array of float values, it assumes 1.

    If the ensemble has memory_friendly_ttda set to False, then it will always use the number of models in the target parameter.

convert_to_memory_friendly()

Remove the stored gensim models and only keep their ttdas.

This frees up memory, but you won’t have access to the individual models anymore if you intended to use them outside of the ensemble.

generate_gensim_representation()

Create a gensim model from the stable topics.

The returned representation is an Gensim LdaModel (gensim.models.LdaModel) that has been instantiated with an A-priori belief on word probability, eta, that represents the topic-term distributions of any stable topics the were found by clustering over the ensemble of topic distributions.

When no stable topics have been detected, None is returned.

Returns

A Gensim LDA Model classic_model_representation for which: classic_model_representation.get_topics() == self.get_topics()

Return type

gensim.models.LdaModel

get_topic_model_class()

Get the class that is used for gensim.models.EnsembleLda.generate_gensim_representation().

get_topics()

Return only the stable topics from the ensemble.

Returns

List of stable topic term distributions

Return type

2D Numpy.numpy.ndarray of floats

property id2word

Return the gensim.corpora.dictionary.Dictionary object used in the model.

inference(*posargs, **kwargs)

See gensim.models.LdaModel.inference().

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

log_perplexity(*posargs, **kwargs)

See gensim.models.LdaModel.log_perplexity().

print_topics(*posargs, **kwargs)

See gensim.models.LdaModel.print_topics().

recluster(eps=0.1, min_samples=None, min_cores=None)

Reapply CBDBSCAN clustering and stable topic generation.

Stable topics can be retrieved using get_topics().

Parameters
  • eps (float) – epsilon for the CBDBSCAN algorithm, having the same meaning as in classic DBSCAN clustering. default: 0.1

  • min_samples (int) – The minimum number of samples in the neighborhood of a topic to be considered a core in CBDBSCAN. default: int(self.num_models / 2)

  • min_cores (int) – how many cores a cluster has to have, to be treated as stable topic. That means, how many topics that look similar have to be present, so that the average topic in those is used as stable topic. default: min(3, max(1, int(self.num_models /4 +1)))

save(*args, **kwargs)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.

class gensim.models.ensemblelda.Topic(is_core: bool, neighboring_labels: Set[int], neighboring_topic_indices: Set[int], label: Union[int, NoneType], num_neighboring_labels: int, valid_neighboring_labels: Set[int])

Bases: object

is_core: bool
label: Optional[int]
neighboring_labels: Set[int]
neighboring_topic_indices: Set[int]
num_neighboring_labels: int
valid_neighboring_labels: Set[int]
gensim.models.ensemblelda.mass_masking(a, threshold=None)

Original masking method. Returns a new binary mask.

gensim.models.ensemblelda.rank_masking(a, threshold=None)

Faster masking method. Returns a new binary mask.