Indexed corpus is a mechanism for random-accessing corpora.
While the standard corpus interface in gensim allows iterating over corpus with for doc in corpus: pass, indexed corpus allows accessing the documents with corpus[docno] (in O(1) look-up time).
This functionality is achieved by storing an extra file (by default named the same as the corpus file plus ‘.index’ suffix) that stores the byte offset of the beginning of each document.
Initialize this abstract base class, by loading a previously saved index from index_fname (or fname.index if index_fname is not set). This index will allow subclasses to support the corpus[docno] syntax (random access to document #`docno` in O(1)).
>>> # save corpus in SvmLightCorpus format with an index >>> corpus = [[(1, 0.5)], [(0, 1.0), (1, 2.0)]] >>> gensim.corpora.SvmLightCorpus.serialize('testfile.svmlight', corpus) >>> # load back as a document stream (*not* plain Python list) >>> corpus_with_random_access = gensim.corpora.SvmLightCorpus('tstfile.svmlight') >>> print corpus_with_random_access [(0, 1.0), (1, 2.0)]
Load a previously saved object from file (also see save).
Save the object to file via pickling (also see load).
Save an existing corpus to disk.
Some formats also support saving the dictionary (feature_id->word mapping), which can in this case be provided by the optional id2word parameter.
>>> MmCorpus.save_corpus('file.mm', corpus)
Some corpora also support an index of where each document begins, so that the documents on disk can be accessed in O(1) time (see the corpora.IndexedCorpus base class). In this case, save_corpus is automatically called internally by serialize, which does save_corpus plus saves the index at the same time, so you want to store the corpus with:
>>> MmCorpus.serialize('file.mm', corpus) # stores index as well, allowing random access to individual documents
Calling serialize() is preferred to calling save_corpus().
Iterate through the document stream corpus, saving the documents to fname and recording byte offset of each document. Save the resulting index structure to file index_fname (or fname.index is not set).
This relies on the underlying corpus class serializer providing (in addition to standard iteration):
each saved document,
the docbyoffset(offset) method, which returns a document positioned at offset bytes within the persistent storage (file).
>>> MmCorpus.serialize('test.mm', corpus) >>> mm = MmCorpus('test.mm') # `mm` document stream now has random access >>> print mm # retrieve document no. 42, etc.