gensim logo

gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

corpora.ucicorpus – Corpus in UCI bag-of-words format

corpora.ucicorpus – Corpus in UCI bag-of-words format

University of California, Irvine (UCI) Bag-of-Words format.

class gensim.corpora.ucicorpus.UciCorpus(fname, fname_vocab=None)

Bases: gensim.corpora.ucicorpus.UciReader, gensim.corpora.indexedcorpus.IndexedCorpus

Corpus in the UCI bag-of-words format.


Utility method to generate gensim-style Dictionary directly from the corpus and vocabulary data.


Return document at file offset offset (in bytes)

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

save(*args, **kwargs)
static save_corpus(fname, corpus, id2word=None, progress_cnt=10000, metadata=False)

Save a corpus in the UCI Bag-of-Words format.

There are actually two files saved: fname and fname.vocab, where fname.vocab is the vocabulary file.

This function is automatically called by UciCorpus.serialize; don’t call it directly, call serialize instead.

serialize(serializer, fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)

Iterate through the document stream corpus, saving the documents to fname and recording byte offset of each document. Save the resulting index structure to file index_fname (or fname.index is not set).

This relies on the underlying corpus class serializer providing (in addition to standard iteration):

  • save_corpus method that returns a sequence of byte offsets, one for
    each saved document,
  • the docbyoffset(offset) method, which returns a document positioned at offset bytes within the persistent storage (file).
  • metadata if set to true will ensure that serialize will write out article titles to a pickle file.


>>> MmCorpus.serialize('', corpus)
>>> mm = MmCorpus('') # `mm` document stream now has random access
>>> print(mm[42]) # retrieve document no. 42, etc.
class gensim.corpora.ucicorpus.UciReader(input)

Bases: gensim.matutils.MmReader

Initialize the reader.

The input parameter refers to a file on the local filesystem, which is expected to be in the UCI Bag-of-Words format.


Return document at file offset offset (in bytes)

class gensim.corpora.ucicorpus.UciWriter(fname)

Bases: gensim.matutils.MmWriter

Store a corpus in UCI Bag-of-Words format.

This corpus format is identical to MM format, except for different file headers. There is no format line, and the first three lines of the file contain number_docs, num_terms, and num_nnz, one value per line.

This implementation is based on matutils.MmWriter, and works the same way.

HEADER_LINE = '%%MatrixMarket matrix coordinate real general\n'
fake_headers(num_docs, num_terms, num_nnz)
update_headers(num_docs, num_terms, num_nnz)

Update headers with actual values.

static write_corpus(fname, corpus, progress_cnt=1000, index=False)

Write blank header lines. Will be updated later, once corpus stats are known.

write_vector(docno, vector)

Write a single sparse vector to the file.

Sparse vector is any iterable yielding (field id, field value) pairs.