gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

corpora.indexedcorpus – Random access to corpus documents

corpora.indexedcorpus – Random access to corpus documents

Indexed corpus is a mechanism for random-accessing corpora.

While the standard corpus interface in gensim allows iterating over corpus with for doc in corpus: pass, indexed corpus allows accessing the documents with corpus[docno] (in O(1) look-up time).

This functionality is achieved by storing an extra file (by default named the same as the corpus file plus ‘.index’ suffix) that stores the byte offset of the beginning of each document.

class gensim.corpora.indexedcorpus.IndexedCorpus(fname, index_fname=None)

Bases: gensim.interfaces.CorpusABC

Initialize this abstract base class, by loading a previously saved index from index_fname (or fname.index if index_fname is not set). This index will allow subclasses to support the corpus[docno] syntax (random access to document #`docno` in O(1)).

>>> # save corpus in SvmLightCorpus format with an index
>>> corpus = [[(1, 0.5)], [(0, 1.0), (1, 2.0)]]
>>> gensim.corpora.SvmLightCorpus.serialize('testfile.svmlight', corpus)
>>> # load back as a document stream (*not* plain Python list)
>>> corpus_with_random_access = gensim.corpora.SvmLightCorpus('tstfile.svmlight')
>>> print(corpus_with_random_access[1])
[(0, 1.0), (1, 2.0)]
load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

save(*args, **kwargs)
save_corpus(fname, corpus, id2word=None, metadata=False)

Save an existing corpus to disk.

Some formats also support saving the dictionary (feature_id->word mapping), which can in this case be provided by the optional id2word parameter.

>>> MmCorpus.save_corpus('file.mm', corpus)

Some corpora also support an index of where each document begins, so that the documents on disk can be accessed in O(1) time (see the corpora.IndexedCorpus base class). In this case, save_corpus is automatically called internally by serialize, which does save_corpus plus saves the index at the same time, so you want to store the corpus with:

>>> MmCorpus.serialize('file.mm', corpus) # stores index as well, allowing random access to individual documents

Calling serialize() is preferred to calling save_corpus().

classmethod serialize(serializer, fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)

Iterate through the document stream corpus, saving the documents to fname and recording byte offset of each document. Save the resulting index structure to file index_fname (or fname.index is not set).

This relies on the underlying corpus class serializer providing (in addition to standard iteration):

  • save_corpus method that returns a sequence of byte offsets, one for
    each saved document,
  • the docbyoffset(offset) method, which returns a document positioned at offset bytes within the persistent storage (file).
  • metadata if set to true will ensure that serialize will write out article titles to a pickle file.

Example:

>>> MmCorpus.serialize('test.mm', corpus)
>>> mm = MmCorpus('test.mm') # `mm` document stream now has random access
>>> print(mm[42]) # retrieve document no. 42, etc.
class gensim.corpora.indexedcorpus.IndexedCorpus(fname, index_fname=None)

Bases: gensim.interfaces.CorpusABC

Initialize this abstract base class, by loading a previously saved index from index_fname (or fname.index if index_fname is not set). This index will allow subclasses to support the corpus[docno] syntax (random access to document #`docno` in O(1)).

>>> # save corpus in SvmLightCorpus format with an index
>>> corpus = [[(1, 0.5)], [(0, 1.0), (1, 2.0)]]
>>> gensim.corpora.SvmLightCorpus.serialize('testfile.svmlight', corpus)
>>> # load back as a document stream (*not* plain Python list)
>>> corpus_with_random_access = gensim.corpora.SvmLightCorpus('tstfile.svmlight')
>>> print(corpus_with_random_access[1])
[(0, 1.0), (1, 2.0)]
load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

save(*args, **kwargs)
save_corpus(fname, corpus, id2word=None, metadata=False)

Save an existing corpus to disk.

Some formats also support saving the dictionary (feature_id->word mapping), which can in this case be provided by the optional id2word parameter.

>>> MmCorpus.save_corpus('file.mm', corpus)

Some corpora also support an index of where each document begins, so that the documents on disk can be accessed in O(1) time (see the corpora.IndexedCorpus base class). In this case, save_corpus is automatically called internally by serialize, which does save_corpus plus saves the index at the same time, so you want to store the corpus with:

>>> MmCorpus.serialize('file.mm', corpus) # stores index as well, allowing random access to individual documents

Calling serialize() is preferred to calling save_corpus().

classmethod serialize(serializer, fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)

Iterate through the document stream corpus, saving the documents to fname and recording byte offset of each document. Save the resulting index structure to file index_fname (or fname.index is not set).

This relies on the underlying corpus class serializer providing (in addition to standard iteration):

  • save_corpus method that returns a sequence of byte offsets, one for
    each saved document,
  • the docbyoffset(offset) method, which returns a document positioned at offset bytes within the persistent storage (file).
  • metadata if set to true will ensure that serialize will write out article titles to a pickle file.

Example:

>>> MmCorpus.serialize('test.mm', corpus)
>>> mm = MmCorpus('test.mm') # `mm` document stream now has random access
>>> print(mm[42]) # retrieve document no. 42, etc.