gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

interfaces – Core gensim interfaces

interfaces – Core gensim interfaces

This module contains implementations of basic interfaces used across the whole gensim package. These interfaces usable for building corpus, transformation and similarity classes.

All interfaces are realized as abstract base classes (i.e. some optional functionality is provided in the interface itself, so that the interfaces should be inherited).

class gensim.interfaces.CorpusABC

Bases: gensim.utils.SaveLoad

Interface for corpus classes from gensim.corpora.

Corpus is simply an iterable object, where each iteration step yields one document:

>>> from gensim.corpora import MmCorpus  # this is inheritor of CorpusABC class
>>> from gensim.test.utils import datapath
>>>
>>> corpus = MmCorpus(datapath("testcorpus.mm"))
>>> for doc in corpus:
...     pass # do something with the doc...

A document represented in bag-of-word (BoW) format, i.e. list of (attr_id, attr_value), like [(1, 0.2), (4, 0.6), ...].

>>> from gensim.corpora import MmCorpus  # this is inheritor of CorpusABC class
>>> from gensim.test.utils import datapath
>>>
>>> corpus = MmCorpus(datapath("testcorpus.mm"))
>>> doc = next(iter(corpus))
>>> print(doc)
[(0, 1.0), (1, 1.0), (2, 1.0)]

Remember, that save/load methods save only corpus class (not corpus as data itself), for save/load functionality, please use this pattern :

>>> from gensim.corpora import MmCorpus  # this is inheritor of CorpusABC class
>>> from gensim.test.utils import datapath, get_tmpfile
>>>
>>> corpus = MmCorpus(datapath("testcorpus.mm"))
>>> tmp_path = get_tmpfile("temp_corpus.mm")
>>>
>>> MmCorpus.serialize(tmp_path, corpus)  #  serialize corpus to disk in MmCorpus format
>>> # MmCorpus.save_corpus(tmp_path, corpus)  # this variant also possible, but if serialize availbe - call it.
>>> loaded_corpus = MmCorpus(tmp_path)  # load corpus through constructor
>>> for (doc_1, doc_2) in zip(corpus, loaded_corpus):
...     assert doc_1 == doc_2  # check that corpuses exactly same

See also

gensim.corpora
Corpuses in different formats
classmethod load(fname, mmap=None)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
save(*args, **kwargs)

Saves corpus in-memory state.

Warning

This save only “state” of corpus class (not corpus-data at all), for saving data please use save_corpus() instead`.

Parameters:
  • *args – Variable length argument list.
  • **kwargs – Arbitrary keyword arguments.
static save_corpus(corpus, id2word=None, metadata=False)

Saves given corpus to disk, should be overridden in inheritor class.

Some formats support saving the dictionary (feature_id -> word mapping), which can be provided by the optional id2word parameter.

Notes

Some corpus also support an index of where each document begins, so that the documents on disk can be accessed in O(1) time (see the gensim.corpora.indexedcorpus.IndexedCorpus base class). In this case, save_corpus() is automatically called internally by serialize(), which does save_corpus() plus saves the index at the same time.

Calling serialize() is preferred to calling :meth:`~gensim.interfaces.CorpusABC.save_corpus().

Parameters:
  • fname (str) – Path to output file.
  • corpus (iterable of list of (int, number)) – Corpus in BoW format.
  • id2word (Dictionary, optional) – Dictionary of corpus.
  • metadata (bool, optional) – If True, will write some meta-information to fname too.
class gensim.interfaces.SimilarityABC(corpus)

Bases: gensim.utils.SaveLoad

Interface for similarity search over a corpus.

In all instances, there is a corpus against which we want to perform the similarity search. For each similarity search, the input is a document and the output are its similarities to individual corpus documents.

Examples

>>> from gensim.similarities import MatrixSimilarity
>>> from gensim.test.utils import common_dictionary, common_corpus
>>>
>>> index = MatrixSimilarity(common_corpus)
>>> similarities = index.get_similarities(common_corpus[1])  # get similarities between query and corpus

Notes

There is also a convenience wrapper, where iterating over self yields similarities of each document in the corpus against the whole corpus (i.e. the query is each corpus document in turn).

See also

gensim.similarities
Provided different type of indexes for search.

Initialization of object, should be overridden in inheritor class.

Parameters:corpus (iterable of list of (int, number)) – Corpus in BoW format.
Raises:NotImplementedError – Since it’s abstract class this method should be reimplemented later.
get_similarities(doc)

Get similarity measures of documents of corpus to given doc, should be overridden in inheritor class.

Parameters:doc (list of (int, number)) – Document in BoW format.
Raises:NotImplementedError – Since it’s abstract class this method should be reimplemented later.
classmethod load(fname, mmap=None)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) – If None - automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently. If list of str - this attributes will be stored in separate files, the automatic check is not performed in this case.
  • sep_limit (int) – Limit for automatic separation.
  • ignore (frozenset of str) – Attributes that shouldn’t be serialize/store.
  • pickle_protocol (int) – Protocol number for pickle.

See also

load()

class gensim.interfaces.TransformationABC

Bases: gensim.utils.SaveLoad

Transformation interface.

A ‘transformation’ is any object which accepts document in BoW format via the __getitem__ (notation []) and returns another sparse document in its stead:

>>> from gensim.models import LsiModel
>>> from gensim.test.utils import common_dictionary, common_corpus
>>>
>>> model = LsiModel(common_corpus, id2word=common_dictionary)
>>> bow_vector = model[common_corpus[0]]  # model applied through __getitem__ on document from corpus.
>>> bow_corpus = model[common_corpus]  # also, we can apply model on full corpus
classmethod load(fname, mmap=None)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) – If None - automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently. If list of str - this attributes will be stored in separate files, the automatic check is not performed in this case.
  • sep_limit (int) – Limit for automatic separation.
  • ignore (frozenset of str) – Attributes that shouldn’t be serialize/store.
  • pickle_protocol (int) – Protocol number for pickle.

See also

load()

class gensim.interfaces.TransformedCorpus(obj, corpus, chunksize=None, **kwargs)

Bases: gensim.interfaces.CorpusABC

Interface for corpus supports transformations.

Parameters:
  • obj (object) – Some corpus class from gensim.corpora.
  • corpus (iterable of list of (int, number)) – Corpus in BoW format.
  • chunksize (int, optional) – If provided - more effective processing (by group of documents) will performed.
  • kwargs – Arbitrary keyword arguments.
classmethod load(fname, mmap=None)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
save(*args, **kwargs)

Saves corpus in-memory state.

Warning

This save only “state” of corpus class (not corpus-data at all), for saving data please use save_corpus() instead`.

Parameters:
  • *args – Variable length argument list.
  • **kwargs – Arbitrary keyword arguments.
static save_corpus(corpus, id2word=None, metadata=False)

Saves given corpus to disk, should be overridden in inheritor class.

Some formats support saving the dictionary (feature_id -> word mapping), which can be provided by the optional id2word parameter.

Notes

Some corpus also support an index of where each document begins, so that the documents on disk can be accessed in O(1) time (see the gensim.corpora.indexedcorpus.IndexedCorpus base class). In this case, save_corpus() is automatically called internally by serialize(), which does save_corpus() plus saves the index at the same time.

Calling serialize() is preferred to calling :meth:`~gensim.interfaces.CorpusABC.save_corpus().

Parameters:
  • fname (str) – Path to output file.
  • corpus (iterable of list of (int, number)) – Corpus in BoW format.
  • id2word (Dictionary, optional) – Dictionary of corpus.
  • metadata (bool, optional) – If True, will write some meta-information to fname too.