gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

corpora._mmreader – Reader for corpus in the Matrix Market format.

corpora._mmreader – Reader for corpus in the Matrix Market format.

Reader for corpus in the Matrix Market format.

class gensim.corpora._mmreader.MmReader(input, transposed=True)

Bases: object

Matrix market file reader (fast Cython version), used for MmCorpus.

Wrap a term-document matrix on disk (in matrix-market format), and present it as an object which supports iteration over the rows (~documents).

num_docs

int – Number of documents in market matrix file.

num_terms

int – Number of terms.

num_nnz

int – Number of non-zero terms.

Notes

Note that the file is read into memory one document at a time, not the whole matrix at once (unlike scipy.io.mmread). This allows us to process corpora which are larger than the available RAM.

Parameters:
  • input ({str, file-like object}) – Path to input file in MM format or a file-like object that supports seek() (e.g. GzipFile, BZ2File).
  • transposed (bool, optional) – if True, expects lines to represent doc_id, term_id, value. Else, expects term_id, doc_id, value.
docbyoffset(self, offset)

Get document at file offset offset (in bytes).

Parameters:offset (int) – Offset, in bytes, of desired document.
Returns:Document in BoW format.
Return type:list of (int, str)
input

input – object

num_docs

num_docs – ‘int’

num_nnz

num_nnz – ‘int’

num_terms

num_terms – ‘int’

skip_headers(self, input_file)

Skip file headers that appear before the first document.

Parameters:input_file (iterable of str) – Iterable taken from file in MM format.
transposed

transposed – ‘bool’