gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

corpora.malletcorpus – Corpus in Mallet format

corpora.malletcorpus – Corpus in Mallet format

Corpus in Mallet format.

class gensim.corpora.malletcorpus.MalletCorpus(fname, id2word=None, metadata=False)

Bases: gensim.corpora.lowcorpus.LowCorpus

Corpus handles input in Mallet format.

Format description

One file, one instance per line, assume the data is in the following format

[URL] [language] [text of the page...]

Or, more generally,

[document #1 id] [label] [text of the document...]
[document #2 id] [label] [text of the document...]
...
[document #N id] [label] [text of the document...]

Note that language/label is not considered in Gensim, used __unknown__ as default value.

Examples

>>> from gensim.test.utils import datapath, get_tmpfile, common_texts
>>> from gensim.corpora import MalletCorpus
>>> from gensim.corpora import Dictionary
>>>
>>> # Prepare needed data
>>> dictionary = Dictionary(common_texts)
>>> corpus = [dictionary.doc2bow(doc) for doc in common_texts]
>>>
>>> # Write corpus in Mallet format to disk
>>> output_fname = get_tmpfile("corpus.mallet")
>>> MalletCorpus.serialize(output_fname, corpus, dictionary)
>>>
>>> # Read corpus
>>> loaded_corpus = MalletCorpus(output_fname)
Parameters:
  • fname (str) – Path to file in Mallet format.
  • id2word ({dict of (int, str), Dictionary}, optional) – Mapping between word_ids (integers) and words (strings). If not provided, the mapping is constructed directly from fname.
  • metadata (bool, optional) – If True, return additional information (“document id” and “lang” when you call line2doc(), __iter__() or docbyoffset()
docbyoffset(offset)

Get the document stored in file by offset position.

Parameters:offset (int) – Offset (in bytes) to begin of document.
Returns:Document in BoW format (+”document_id” and “lang” if metadata=True).
Return type:list of (int, int)

Examples

>>> from gensim.test.utils import datapath
>>> from gensim.corpora import MalletCorpus
>>>
>>> data = MalletCorpus(datapath("testcorpus.mallet"))
>>> data.docbyoffset(1)  # end of first line
[(3, 1), (4, 1)]
>>> data.docbyoffset(4)  # start of second line
[(4, 1)]
id2word

Get mapping between words and their ids.

line2doc(line)

Covert line into document in BoW format.

Parameters:line (str) – Line from input file.
Returns:Document in BoW format (+”document_id” and “lang” if metadata=True).
Return type:list of (int, int)

Examples

>>> from gensim.test.utils import datapath
>>> from gensim.corpora import MalletCorpus
>>>
>>> corpus = MalletCorpus(datapath("testcorpus.mallet"))
>>> corpus.line2doc("en computer human interface")
[(3, 1), (4, 1)]
classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()
Save object to file.
Returns:Object loaded from fname.
Return type:object
Raises:AttributeError – When called on an object instance instead of class (this is a class method).
save(*args, **kwargs)

Saves corpus in-memory state.

Warning

This save only the “state” of a corpus class, not the corpus data!

For saving data use the serialize method of the output format you’d like to use (e.g. gensim.corpora.mmcorpus.MmCorpus.serialize()).

static save_corpus(fname, corpus, id2word=None, metadata=False)

Save a corpus in the Mallet format.

Warning

This function is automatically called by gensim.corpora.malletcorpus.MalletCorpus.serialize(), don’t call it directly, call gensim.corpora.lowcorpus.malletcorpus.MalletCorpus.serialize() instead.

Parameters:
  • fname (str) – Path to output file.
  • corpus (iterable of iterable of (int, int)) – Corpus in BoW format.
  • id2word ({dict of (int, str), Dictionary}, optional) – Mapping between word_ids (integers) and words (strings). If not provided, the mapping is constructed directly from corpus.
  • metadata (bool, optional) – If True - ????
Returns:

List of offsets in resulting file for each document (in bytes), can be used for docbyoffset().

Return type:

list of int

Notes

The document id will be generated by enumerating the corpus. That is, it will range between 0 and number of documents in the corpus.

Since Mallet has a language field in the format, this defaults to the string ‘__unknown__’. If the language needs to be saved, post-processing will be required.

classmethod serialize(fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)

Serialize corpus with offset metadata, allows to use direct indexes after loading.

Parameters:
  • fname (str) – Path to output file.
  • corpus (iterable of iterable of (int, float)) – Corpus in BoW format.
  • id2word (dict of (str, str), optional) – Mapping id -> word.
  • index_fname (str, optional) – Where to save resulting index, if None - store index to fname.index.
  • progress_cnt (int, optional) – Number of documents after which progress info is printed.
  • labels (bool, optional) – If True - ignore first column (class labels).
  • metadata (bool, optional) – If True - ensure that serialize will write out article titles to a pickle file.

Examples

>>> from gensim.corpora import MmCorpus
>>> from gensim.test.utils import get_tmpfile
>>>
>>> corpus = [[(1, 0.3), (2, 0.1)], [(1, 0.1)], [(2, 0.3)]]
>>> output_fname = get_tmpfile("test.mm")
>>>
>>> MmCorpus.serialize(output_fname, corpus)
>>> mm = MmCorpus(output_fname) # `mm` document stream now has random access
>>> print(mm[1]) # retrieve document no. 42, etc.
[(1, 0.1)]