corpora.malletcorpus – Corpus in Mallet format¶Corpus in Mallet format.
gensim.corpora.malletcorpus.MalletCorpus(fname, id2word=None, metadata=False)¶Bases: gensim.corpora.lowcorpus.LowCorpus
Corpus handles input in Mallet format.
Format description
One file, one instance per line, assume the data is in the following format
[URL] [language] [text of the page...]
Or, more generally,
[document #1 id] [label] [text of the document...]
[document #2 id] [label] [text of the document...]
...
[document #N id] [label] [text of the document...]
Note that language/label is not considered in Gensim, used __unknown__ as default value.
Examples
>>> from gensim.test.utils import get_tmpfile, common_texts
>>> from gensim.corpora import MalletCorpus
>>> from gensim.corpora import Dictionary
>>>
>>> # Prepare needed data
>>> dictionary = Dictionary(common_texts)
>>> corpus = [dictionary.doc2bow(doc) for doc in common_texts]
>>>
>>> # Write corpus in Mallet format to disk
>>> output_fname = get_tmpfile("corpus.mallet")
>>> MalletCorpus.serialize(output_fname, corpus, dictionary)
>>>
>>> # Read corpus
>>> loaded_corpus = MalletCorpus(output_fname)
fname (str) – Path to file in Mallet format.
id2word ({dict of (int, str), Dictionary}, optional) – Mapping between word_ids (integers) and words (strings).
If not provided, the mapping is constructed directly from fname.
metadata (bool, optional) – If True, return additional information (“document id” and “lang” when you call
line2doc(),
__iter__() or
docbyoffset()
docbyoffset(offset)¶Get the document stored in file by offset position.
offset (int) – Offset (in bytes) to begin of document.
Document in BoW format (+”document_id” and “lang” if metadata=True).
list of (int, int)
Examples
>>> from gensim.test.utils import datapath
>>> from gensim.corpora import MalletCorpus
>>>
>>> data = MalletCorpus(datapath("testcorpus.mallet"))
>>> data.docbyoffset(1) # end of first line
[(3, 1), (4, 1)]
>>> data.docbyoffset(4) # start of second line
[(4, 1)]
id2word¶Get mapping between words and their ids.
line2doc(line)¶Covert line into document in BoW format.
line (str) – Line from input file.
Document in BoW format (+”document_id” and “lang” if metadata=True).
list of (int, int)
Examples
>>> from gensim.test.utils import datapath
>>> from gensim.corpora import MalletCorpus
>>>
>>> corpus = MalletCorpus(datapath("testcorpus.mallet"))
>>> corpus.line2doc("en computer human interface")
[(3, 1), (4, 1)]
load(fname, mmap=None)¶Load an object previously saved using save() from a file.
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()Save object to file.
Object loaded from fname.
object
AttributeError – When called on an object instance instead of class (this is a class method).
save(*args, **kwargs)¶Saves corpus in-memory state.
Warning
This save only the “state” of a corpus class, not the corpus data!
For saving data use the serialize method of the output format you’d like to use
(e.g. gensim.corpora.mmcorpus.MmCorpus.serialize()).
save_corpus(fname, corpus, id2word=None, metadata=False)¶Save a corpus in the Mallet format.
Warning
This function is automatically called by gensim.corpora.malletcorpus.MalletCorpus.serialize(),
don’t call it directly, call gensim.corpora.lowcorpus.malletcorpus.MalletCorpus.serialize() instead.
fname (str) – Path to output file.
corpus (iterable of iterable of (int, int)) – Corpus in BoW format.
id2word ({dict of (int, str), Dictionary}, optional) – Mapping between word_ids (integers) and words (strings).
If not provided, the mapping is constructed directly from corpus.
metadata (bool, optional) – If True - ????
List of offsets in resulting file for each document (in bytes),
can be used for docbyoffset().
list of int
Notes
The document id will be generated by enumerating the corpus. That is, it will range between 0 and number of documents in the corpus.
Since Mallet has a language field in the format, this defaults to the string ‘__unknown__’. If the language needs to be saved, post-processing will be required.
serialize(fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)¶Serialize corpus with offset metadata, allows to use direct indexes after loading.
fname (str) – Path to output file.
corpus (iterable of iterable of (int, float)) – Corpus in BoW format.
id2word (dict of (str, str), optional) – Mapping id -> word.
index_fname (str, optional) – Where to save resulting index, if None - store index to fname.index.
progress_cnt (int, optional) – Number of documents after which progress info is printed.
labels (bool, optional) – If True - ignore first column (class labels).
metadata (bool, optional) – If True - ensure that serialize will write out article titles to a pickle file.
Examples
>>> from gensim.corpora import MmCorpus
>>> from gensim.test.utils import get_tmpfile
>>>
>>> corpus = [[(1, 0.3), (2, 0.1)], [(1, 0.1)], [(2, 0.3)]]
>>> output_fname = get_tmpfile("test.mm")
>>>
>>> MmCorpus.serialize(output_fname, corpus)
>>> mm = MmCorpus(output_fname) # `mm` document stream now has random access
>>> print(mm[1]) # retrieve document no. 42, etc.
[(1, 0.1)]