`models.logentropy_model` – LogEntropy model¶

This module allows simple Bag of Words (BoW) represented corpus to be transformed into log entropy space. It implements Log Entropy Model that produces entropy-weighted logarithmic term frequency representation.

Empirical study by Lee et al. 2015 1 suggests log entropy-weighted model yields better results among other forms of representation.

References

1: Lee et al. 2005. An Empirical Evaluation of Models of Text Document Similarity. https://escholarship.org/uc/item/48g155nq

class gensim.models.logentropy_model.LogEntropyModel(corpus, normalize=True)¶

Bases: TransformationABC

Objects of this class realize the transformation between word-document co-occurrence matrix (int) into a locally/globally weighted matrix (positive floats).

This is done by a log entropy normalization, optionally normalizing the resulting documents to unit length. The following formulas explain how o compute the log entropy weight for term $i$ in document $j$ :

$local\_weight_{i,j} = log(frequency_{i,j} + 1) P_{i,j} = \frac{frequency_{i,j}}{\sum_j frequency_{i,j}} global\_weight_i = 1 + \frac{\sum_j P_{i,j} * log(P_{i,j})}{log(number\_of\_documents + 1)} final\_weight_{i,j} = local\_weight_{i,j} * global\_weight_i$

Examples

>>> from gensim.models import LogEntropyModel
>>> from gensim.test.utils import common_texts
>>> from gensim.corpora import Dictionary
>>>
>>> dct = Dictionary(common_texts)  # fit dictionary
>>> corpus = [dct.doc2bow(row) for row in common_texts]  # convert to BoW format
>>> model = LogEntropyModel(corpus)  # fit model
>>> vector = model[corpus[1]]  # apply model to document

Parameters

corpus (iterable of iterable of (int, int)) – Input corpus in BoW format.
normalize (bool, optional) – If True, the resulted log entropy weighted vector will be normalized to length of 1, If False - do nothing.

add_lifecycle_event(event_name, log_level=20, **event)¶

Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.

Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.

The lifecycle_events attribute is persisted across object’s save() and load() operations. It has no impact on the use of the model, but is useful during debugging and support.

Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.

Parameters

event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.

This method will automatically add the following key-values to event, so you don’t have to specify them:
- datetime: the current date & time
- gensim: the current Gensim version
- python: the current Python version
- platform: the current platform
- event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.

initialize(corpus)¶

Calculates the global weighting for all terms in a given corpus and transforms the simple count representation into the log entropy normalized space.

Parameters: corpus (iterable of iterable of (int, int)) – Corpus is BoW format

classmethod load(fname, mmap=None)¶

Load an object previously saved using save() from a file.

Parameters

fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

Please sponsor Gensim to help sustain this open source project!

models.logentropy_model – LogEntropy model¶

`models.logentropy_model` – LogEntropy model¶