gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

interfaces – Core gensim interfaces

interfaces – Core gensim interfaces

This module contains basic interfaces used throughout the whole gensim package.

The interfaces are realized as abstract base classes (ie., some optional functionality is provided in the interface itself, so that the interfaces can be subclassed).

class gensim.interfaces.CorpusABC

Bases: gensim.utils.SaveLoad

Interface (abstract base class) for corpora. A corpus is simply an iterable, where each iteration step yields one document:

>>> for doc in corpus:
>>>     # do something with the doc...

A document is a sequence of (fieldId, fieldValue) 2-tuples:

>>> for attr_id, attr_value in doc:
>>>     # do something with the attribute

Note that although a default len() method is provided, it is very inefficient (performs a linear scan through the corpus to determine its length). Wherever the corpus size is needed and known in advance (or at least doesn’t change so that it can be cached), the len() method should be overridden.

See the gensim.corpora.svmlightcorpus module for an example of a corpus.

Saving the corpus with the save method (inherited from utils.SaveLoad) will only store the in-memory (binary, pickled) object representation=the stream state, and not the documents themselves. See the save_corpus static method for serializing the actual stream content.

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

save(*args, **kwargs)
static save_corpus(fname, corpus, id2word=None, metadata=False)

Save an existing corpus to disk.

Some formats also support saving the dictionary (feature_id->word mapping), which can in this case be provided by the optional id2word parameter.

>>> MmCorpus.save_corpus('file.mm', corpus)

Some corpora also support an index of where each document begins, so that the documents on disk can be accessed in O(1) time (see the corpora.IndexedCorpus base class). In this case, save_corpus is automatically called internally by serialize, which does save_corpus plus saves the index at the same time, so you want to store the corpus with:

>>> MmCorpus.serialize('file.mm', corpus) # stores index as well, allowing random access to individual documents

Calling serialize() is preferred to calling save_corpus().

class gensim.interfaces.SimilarityABC(corpus)

Bases: gensim.utils.SaveLoad

Abstract interface for similarity searches over a corpus.

In all instances, there is a corpus against which we want to perform the similarity search.

For each similarity search, the input is a document and the output are its similarities to individual corpus documents.

Similarity queries are realized by calling self[query_document].

There is also a convenience wrapper, where iterating over self yields similarities of each document in the corpus against the whole corpus (ie., the query is each corpus document in turn).

get_similarities(doc)
load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.

class gensim.interfaces.TransformationABC

Bases: gensim.utils.SaveLoad

Interface for transformations. A ‘transformation’ is any object which accepts a sparse document via the dictionary notation [] and returns another sparse document in its stead:

>>> transformed_doc = transformation[doc]

or also:

>>> transformed_corpus = transformation[corpus]

See the gensim.models.tfidfmodel module for an example of a transformation.

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.

class gensim.interfaces.TransformedCorpus(obj, corpus, chunksize=None, **kwargs)

Bases: gensim.interfaces.CorpusABC

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

save(*args, **kwargs)
save_corpus(fname, corpus, id2word=None, metadata=False)

Save an existing corpus to disk.

Some formats also support saving the dictionary (feature_id->word mapping), which can in this case be provided by the optional id2word parameter.

>>> MmCorpus.save_corpus('file.mm', corpus)

Some corpora also support an index of where each document begins, so that the documents on disk can be accessed in O(1) time (see the corpora.IndexedCorpus base class). In this case, save_corpus is automatically called internally by serialize, which does save_corpus plus saves the index at the same time, so you want to store the corpus with:

>>> MmCorpus.serialize('file.mm', corpus) # stores index as well, allowing random access to individual documents

Calling serialize() is preferred to calling save_corpus().