corpora.mmcorpus
– Corpus in Matrix Market format¶Corpus in the Matrix Market format.
gensim.corpora.mmcorpus.
MmCorpus
(fname)¶Bases: gensim.corpora._mmreader.MmReader
, gensim.corpora.indexedcorpus.IndexedCorpus
Corpus in matrix market format.
Wrap a term-document matrix on disk (in matrix-market format), and present it as an object which supports iteration over the rows (~documents).
num_docs
¶int – Number of documents in market matrix file.
num_terms
¶int – Number of terms.
num_nnz
¶int – Number of non-zero terms.
Notes
Note that the file is read into memory one document at a time, not the whole matrix at once
(unlike mmread()
). This allows us to process corpora which are larger than the available RAM.
Example
>>> from gensim.corpora.mmcorpus import MmCorpus
>>> from gensim.test.utils import datapath
>>> import gensim.downloader as api
>>>
>>> corpus = MmCorpus(datapath('test_mmcorpus_with_index.mm'))
>>> for document in corpus:
... pass
Parameters: | fname ({str, file-like object}) – Path to file in MM format or a file-like object that supports seek()
(e.g. gzip.GzipFile , bz2.BZ2File ). |
---|
docbyoffset
(self, offset)¶Get document at file offset offset (in bytes).
Parameters: | offset (int) – Offset, in bytes, of desired document. |
---|---|
Returns: | Document in BoW format. |
Return type: | list of (int, str) |
input
¶input – object
load
(fname, mmap=None)¶Load a previously saved object (using save()
) from file.
Parameters: |
|
---|
See also
Returns: | Object loaded from fname. |
---|---|
Return type: | object |
Raises: | IOError – When methods are called on instance (should be called from class). |
num_docs
num_docs – ‘int’
num_nnz
num_nnz – ‘int’
num_terms
num_terms – ‘int’
save
(*args, **kwargs)¶Saves corpus in-memory state.
Warning
This save only “state” of corpus class (not corpus-data at all),
for saving data please use save_corpus()
instead`.
Parameters: |
|
---|
save_corpus
(corpus, id2word=None, progress_cnt=1000, metadata=False)¶Save a corpus in the Matrix Market format to disk.
Parameters: |
|
---|
Notes
This function is automatically called by MmCorpus.serialize; don’t call it directly, call serialize instead.
Example
>>> from gensim.corpora.mmcorpus import MmCorpus
>>> from gensim.test.utils import datapath
>>> import gensim.downloader as api
>>>
>>> corpus = MmCorpus(datapath('test_mmcorpus_with_index.mm'))
>>>
>>> MmCorpus.save_corpus("random", corpus) # Do not do it, use `serialize` instead.
[97, 121, 169, 201, 225, 249, 258, 276, 303]
serialize
(fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)¶Serialize corpus with offset metadata, allows to use direct indexes after loading.
Parameters: |
|
---|
Examples
>>> from gensim.corpora import MmCorpus
>>> from gensim.test.utils import get_tmpfile
>>>
>>> corpus = [[(1, 0.3), (2, 0.1)], [(1, 0.1)], [(2, 0.3)]]
>>> output_fname = get_tmpfile("test.mm")
>>>
>>> MmCorpus.serialize(output_fname, corpus)
>>> mm = MmCorpus(output_fname) # `mm` document stream now has random access
>>> print(mm[1]) # retrieve document no. 42, etc.
[(1, 0.1)]
skip_headers
(self, input_file)¶Skip file headers that appear before the first document.
Parameters: | input_file (iterable of str) – Iterable taken from file in MM format. |
---|
transposed
¶transposed – ‘bool’