corpora.mmcorpus – Corpus in Matrix Market format

Corpus in the Matrix Market format.

class gensim.corpora.mmcorpus.MmCorpus(fname)

Bases: gensim.corpora._mmreader.MmReader, gensim.corpora.indexedcorpus.IndexedCorpus

Corpus serialized using the sparse coordinate Matrix Market format.

Wrap a term-document matrix on disk (in matrix-market format), and present it as an object which supports iteration over the matrix rows (~documents).

Notes

The file is read into memory one document at a time, not the whole matrix at once, unlike e.g. scipy.io.mmread and other implementations. This allows you to process corpora which are larger than the available RAM, in a streamed manner.

Example

>>> from gensim.corpora.mmcorpus import MmCorpus
>>> from gensim.test.utils import datapath
>>>
>>> corpus = MmCorpus(datapath('test_mmcorpus_with_index.mm'))
>>> for document in corpus:
...     pass
Parameters

fname ({str, file-like object}) – Path to file in MM format or a file-like object that supports seek() (e.g. a compressed file opened by smart_open).

add_lifecycle_event(event_name, log_level=20, **event)

Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.

Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.

The lifecycle_events attribute is persisted across object’s save() and load() operations. It has no impact on the use of the model, but is useful during debugging and support.

Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.

Parameters
  • event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.

  • event (dict) –

    Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.

    This method will automatically add the following key-values to event, so you don’t have to specify them:

    • datetime: the current date & time

    • gensim: the current Gensim version

    • python: the current Python version

    • platform: the current platform

    • event: the name of this event

  • log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.

docbyoffset(self, offset)

Get the document at file offset offset (in bytes).

Parameters

offset (int) – File offset, in bytes, of the desired document.

Returns

Document in sparse bag-of-words format.

Return type

list of (int, str)

input

object

Type

input

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

num_docs

‘long long’

Type

num_docs

num_nnz

‘long long’

Type

num_nnz

num_terms

‘long long’

Type

num_terms

save(*args, **kwargs)

Saves corpus in-memory state.

Warning

This save only the “state” of a corpus class, not the corpus data!

For saving data use the serialize method of the output format you’d like to use (e.g. gensim.corpora.mmcorpus.MmCorpus.serialize()).

static save_corpus(fname, corpus, id2word=None, progress_cnt=1000, metadata=False)

Save a corpus to disk in the sparse coordinate Matrix Market format.

Parameters
  • fname (str) – Path to file.

  • corpus (iterable of list of (int, number)) – Corpus in Bow format.

  • id2word (dict of (int, str), optional) – Mapping between word_id -> word. Used to retrieve the total vocabulary size if provided. Otherwise, the total vocabulary size is estimated based on the highest feature id encountered in corpus.

  • progress_cnt (int, optional) – How often to report (log) progress.

  • metadata (bool, optional) – Writes out additional metadata?

Warning

This function is automatically called by serialize, don’t call it directly, call serialize instead.

Example

>>> from gensim.corpora.mmcorpus import MmCorpus
>>> from gensim.test.utils import datapath
>>>
>>> corpus = MmCorpus(datapath('test_mmcorpus_with_index.mm'))
>>>
>>> MmCorpus.save_corpus("random", corpus)  # Do not do it, use `serialize` instead.
[97, 121, 169, 201, 225, 249, 258, 276, 303]
classmethod serialize(fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)

Serialize corpus with offset metadata, allows to use direct indexes after loading.

Parameters
  • fname (str) – Path to output file.

  • corpus (iterable of iterable of (int, float)) – Corpus in BoW format.

  • id2word (dict of (str, str), optional) – Mapping id -> word.

  • index_fname (str, optional) – Where to save resulting index, if None - store index to fname.index.

  • progress_cnt (int, optional) – Number of documents after which progress info is printed.

  • labels (bool, optional) – If True - ignore first column (class labels).

  • metadata (bool, optional) – If True - ensure that serialize will write out article titles to a pickle file.

Examples

>>> from gensim.corpora import MmCorpus
>>> from gensim.test.utils import get_tmpfile
>>>
>>> corpus = [[(1, 0.3), (2, 0.1)], [(1, 0.1)], [(2, 0.3)]]
>>> output_fname = get_tmpfile("test.mm")
>>>
>>> MmCorpus.serialize(output_fname, corpus)
>>> mm = MmCorpus(output_fname)  # `mm` document stream now has random access
>>> print(mm[1])  # retrieve document no. 42, etc.
[(1, 0.1)]
skip_headers(self, input_file)

Skip file headers that appear before the first document.

Parameters

input_file (iterable of str) – Iterable taken from file in MM format.

transposed

‘bool’

Type

transposed