interfaces
– Core gensim interfaces¶
Basic interfaces used across the whole Gensim package.
These interfaces are used for building corpora, model transformation and similarity queries.
The interfaces are realized as abstract base classes. This means some functionality is already provided in the interface itself, and subclasses should inherit from these interfaces and implement the missing methods.
- class gensim.interfaces.CorpusABC¶
Bases:
SaveLoad
Interface for corpus classes from
gensim.corpora
.Corpus is simply an iterable object, where each iteration step yields one document:
>>> from gensim.corpora import MmCorpus # inherits from the CorpusABC class >>> from gensim.test.utils import datapath >>> >>> corpus = MmCorpus(datapath("testcorpus.mm")) >>> for doc in corpus: ... pass # do something with the doc...
A document represented in the bag-of-word (BoW) format, i.e. list of (attr_id, attr_value), like
[(1, 0.2), (4, 0.6), ...]
.>>> from gensim.corpora import MmCorpus # inherits from the CorpusABC class >>> from gensim.test.utils import datapath >>> >>> corpus = MmCorpus(datapath("testcorpus.mm")) >>> doc = next(iter(corpus)) >>> print(doc) [(0, 1.0), (1, 1.0), (2, 1.0)]
Remember that the save/load methods only pickle the corpus object, not the (streamed) corpus data itself! To save the corpus data, please use this pattern :
>>> from gensim.corpora import MmCorpus # MmCorpus inherits from CorpusABC >>> from gensim.test.utils import datapath, get_tmpfile >>> >>> corpus = MmCorpus(datapath("testcorpus.mm")) >>> tmp_path = get_tmpfile("temp_corpus.mm") >>> >>> MmCorpus.serialize(tmp_path, corpus) # serialize corpus to disk in the MmCorpus format >>> loaded_corpus = MmCorpus(tmp_path) # load corpus through constructor >>> for (doc_1, doc_2) in zip(corpus, loaded_corpus): ... assert doc_1 == doc_2 # no change between the original and loaded corpus
See also
gensim.corpora
Corpora in different formats.
- add_lifecycle_event(event_name, log_level=20, **event)¶
Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.
Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.
The lifecycle_events attribute is persisted across object’s
save()
andload()
operations. It has no impact on the use of the model, but is useful during debugging and support.Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.
- Parameters
event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.
This method will automatically add the following key-values to event, so you don’t have to specify them:
datetime: the current date & time
gensim: the current Gensim version
python: the current Python version
platform: the current platform
event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.
- classmethod load(fname, mmap=None)¶
Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
- save(*args, **kwargs)¶
Saves the in-memory state of the corpus (pickles the object).
Warning
This saves only the “internal state” of the corpus object, not the corpus data!
To save the corpus data, use the serialize method of your desired output format instead, e.g.
gensim.corpora.mmcorpus.MmCorpus.serialize()
.
- static save_corpus(fname, corpus, id2word=None, metadata=False)¶
Save corpus to disk.
Some formats support saving the dictionary (feature_id -> word mapping), which can be provided by the optional id2word parameter.
Notes
Some corpora also support random access via document indexing, so that the documents on disk can be accessed in O(1) time (see the
gensim.corpora.indexedcorpus.IndexedCorpus
base class).In this case,
save_corpus()
is automatically called internally byserialize()
, which doessave_corpus()
plus saves the index at the same time.Calling
serialize() is preferred to calling :meth:`gensim.interfaces.CorpusABC.save_corpus()
.- Parameters
fname (str) – Path to output file.
corpus (iterable of list of (int, number)) – Corpus in BoW format.
id2word (
Dictionary
, optional) – Dictionary of corpus.metadata (bool, optional) – Write additional metadata to a separate too?
- class gensim.interfaces.SimilarityABC(corpus)¶
Bases:
SaveLoad
Interface for similarity search over a corpus.
In all instances, there is a corpus against which we want to perform the similarity search. For each similarity search, the input is a document or a corpus, and the output are the similarities to individual corpus documents.
Examples
>>> from gensim.similarities import MatrixSimilarity >>> from gensim.test.utils import common_corpus >>> >>> index = MatrixSimilarity(common_corpus) >>> similarities = index.get_similarities(common_corpus[1]) # get similarities between query and corpus
Notes
There is also a convenience wrapper, where iterating over self yields similarities of each document in the corpus against the whole corpus (i.e. the query is each corpus document in turn).
See also
gensim.similarities
Different index implementations of this interface.
- Parameters
corpus (iterable of list of (int, number)) – Corpus in sparse Gensim bag-of-words format.
- add_lifecycle_event(event_name, log_level=20, **event)¶
Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.
Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.
The lifecycle_events attribute is persisted across object’s
save()
andload()
operations. It has no impact on the use of the model, but is useful during debugging and support.Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.
- Parameters
event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.
This method will automatically add the following key-values to event, so you don’t have to specify them:
datetime: the current date & time
gensim: the current Gensim version
python: the current Python version
platform: the current platform
event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.
- get_similarities(doc)¶
Get similarities of the given document or corpus against this index.
- Parameters
doc ({list of (int, number), iterable of list of (int, number)}) – Document in the sparse Gensim bag-of-words format, or a streamed corpus of such documents.
- classmethod load(fname, mmap=None)¶
Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
- save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)¶
Save the object to a file.
- Parameters
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
- class gensim.interfaces.TransformationABC¶
Bases:
SaveLoad
Transformation interface.
A ‘transformation’ is any object which accepts document in BoW format via the __getitem__ (notation []) and returns another sparse document in its stead:
>>> from gensim.models import LsiModel >>> from gensim.test.utils import common_dictionary, common_corpus >>> >>> model = LsiModel(common_corpus, id2word=common_dictionary) >>> bow_vector = model[common_corpus[0]] # model applied through __getitem__ on one document from corpus. >>> bow_corpus = model[common_corpus] # also, we can apply model on the full corpus
- add_lifecycle_event(event_name, log_level=20, **event)¶
Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.
Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.
The lifecycle_events attribute is persisted across object’s
save()
andload()
operations. It has no impact on the use of the model, but is useful during debugging and support.Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.
- Parameters
event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.
This method will automatically add the following key-values to event, so you don’t have to specify them:
datetime: the current date & time
gensim: the current Gensim version
python: the current Python version
platform: the current platform
event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.
- classmethod load(fname, mmap=None)¶
Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
- save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)¶
Save the object to a file.
- Parameters
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
- class gensim.interfaces.TransformedCorpus(obj, corpus, chunksize=None, **kwargs)¶
Bases:
CorpusABC
Interface for corpora that are the result of an online (streamed) transformation.
- Parameters
obj (object) – A transformation
TransformationABC
object that will be applied to each document from corpus during iteration.corpus (iterable of list of (int, number)) – Corpus in bag-of-words format.
chunksize (int, optional) – If provided, a slightly more effective processing will be performed by grouping documents from corpus.
- add_lifecycle_event(event_name, log_level=20, **event)¶
Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.
Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.
The lifecycle_events attribute is persisted across object’s
save()
andload()
operations. It has no impact on the use of the model, but is useful during debugging and support.Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.
- Parameters
event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.
This method will automatically add the following key-values to event, so you don’t have to specify them:
datetime: the current date & time
gensim: the current Gensim version
python: the current Python version
platform: the current platform
event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.
- classmethod load(fname, mmap=None)¶
Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
- save(*args, **kwargs)¶
Saves the in-memory state of the corpus (pickles the object).
Warning
This saves only the “internal state” of the corpus object, not the corpus data!
To save the corpus data, use the serialize method of your desired output format instead, e.g.
gensim.corpora.mmcorpus.MmCorpus.serialize()
.
- static save_corpus(fname, corpus, id2word=None, metadata=False)¶
Save corpus to disk.
Some formats support saving the dictionary (feature_id -> word mapping), which can be provided by the optional id2word parameter.
Notes
Some corpora also support random access via document indexing, so that the documents on disk can be accessed in O(1) time (see the
gensim.corpora.indexedcorpus.IndexedCorpus
base class).In this case,
save_corpus()
is automatically called internally byserialize()
, which doessave_corpus()
plus saves the index at the same time.Calling
serialize() is preferred to calling :meth:`gensim.interfaces.CorpusABC.save_corpus()
.- Parameters
fname (str) – Path to output file.
corpus (iterable of list of (int, number)) – Corpus in BoW format.
id2word (
Dictionary
, optional) – Dictionary of corpus.metadata (bool, optional) – Write additional metadata to a separate too?