gensim logo

gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine:

Corporate trainings in Python Data Science and Deep Learning

corpora.mmcorpus – Corpus in Matrix Market format

corpora.mmcorpus – Corpus in Matrix Market format

Corpus in the Matrix Market format.

class gensim.corpora.mmcorpus.MmCorpus(fname)

Bases: gensim.corpora._mmreader.MmReader, gensim.corpora.indexedcorpus.IndexedCorpus

Corpus in matrix market format.

Wrap a term-document matrix on disk (in matrix-market format), and present it as an object which supports iteration over the rows (~documents).


int – Number of documents in market matrix file.


int – Number of terms.


int – Number of non-zero terms.


Note that the file is read into memory one document at a time, not the whole matrix at once (unlike mmread()). This allows us to process corpora which are larger than the available RAM.


>>> from gensim.corpora.mmcorpus import MmCorpus
>>> from gensim.test.utils import datapath
>>> import gensim.downloader as api
>>> corpus = MmCorpus(datapath(''))
>>> for document in corpus:
...     pass
Parameters:fname ({str, file-like object}) – Path to file in MM format or a file-like object that supports seek() (e.g. gzip.GzipFile, bz2.BZ2File).
docbyoffset(self, offset)

Get document at file offset offset (in bytes).

Parameters:offset (int) – Offset, in bytes, of desired document.
Returns:Document in BoW format.
Return type:list of (int, str)

input – object

classmethod load(fname, mmap=None)

Load a previously saved object (using save()) from file.

  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also


Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).

num_docs – ‘int’


num_nnz – ‘int’


num_terms – ‘int’

save(*args, **kwargs)

Saves corpus in-memory state.


This save only “state” of corpus class (not corpus-data at all), for saving data please use save_corpus() instead`.

  • *args – Variable length argument list.
  • **kwargs – Arbitrary keyword arguments.
static save_corpus(corpus, id2word=None, progress_cnt=1000, metadata=False)

Save a corpus in the Matrix Market format to disk.

  • fname (str) – Path to file.
  • corpus (iterable of list of (int, number)) – Corpus in Bow format.
  • id2word (dict of (int, str), optional) – WordId -> Word.
  • progress_cnt (int, optional) – Progress counter.
  • metadata (bool, optional) – If true, writes out additional metadata.


This function is automatically called by MmCorpus.serialize; don’t call it directly, call serialize instead.


>>> from gensim.corpora.mmcorpus import MmCorpus
>>> from gensim.test.utils import datapath
>>> import gensim.downloader as api
>>> corpus = MmCorpus(datapath(''))
>>> MmCorpus.save_corpus("random", corpus) # Do not do it, use `serialize` instead.
[97, 121, 169, 201, 225, 249, 258, 276, 303]
classmethod serialize(fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)

Serialize corpus with offset metadata, allows to use direct indexes after loading.

  • fname (str) – Path to output file.
  • corpus (iterable of iterable of (int, float)) – Corpus in BoW format.
  • id2word (dict of (str, str), optional) – Mapping id -> word.
  • index_fname (str, optional) – Where to save resulting index, if None - store index to fname.index.
  • progress_cnt (int, optional) – Number of documents after which progress info is printed.
  • labels (bool, optional) – If True - ignore first column (class labels).
  • metadata (bool, optional) – If True - ensure that serialize will write out article titles to a pickle file.


>>> from gensim.corpora import MmCorpus
>>> from gensim.test.utils import get_tmpfile
>>> corpus = [[(1, 0.3), (2, 0.1)], [(1, 0.1)], [(2, 0.3)]]
>>> output_fname = get_tmpfile("")
>>> MmCorpus.serialize(output_fname, corpus)
>>> mm = MmCorpus(output_fname) # `mm` document stream now has random access
>>> print(mm[1]) # retrieve document no. 42, etc.
[(1, 0.1)]
skip_headers(self, input_file)

Skip file headers that appear before the first document.

Parameters:input_file (iterable of str) – Iterable taken from file in MM format.

transposed – ‘bool’