gensim logo

gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

Corporate trainings in Data Science, NLP and Deep Learning

corpora.lowcorpus – Corpus in GibbsLda++ format

corpora.lowcorpus – Corpus in GibbsLda++ format

Corpus in GibbsLda++ format.

class gensim.corpora.lowcorpus.LowCorpus(fname, id2word=None, line2words=<function split_on_space>)

Bases: gensim.corpora.indexedcorpus.IndexedCorpus

Corpus handles input in GibbsLda++ format.

Format description

Both data for training/estimating the model and new data (i.e., previously unseen data) have the same format as follows


in which the first line is the total number for documents [M]. Each line after that is one document. [documenti] is the ith document of the dataset that consists of a list of Ni words/terms

[documenti] = [wordi1] [wordi2] ... [wordiNi]

in which all [wordij] (i=1..M, j=1..Ni) are text strings and they are separated by the blank character.


>>> from gensim.test.utils import get_tmpfile, common_texts
>>> from gensim.corpora import LowCorpus
>>> from gensim.corpora import Dictionary
>>> # Prepare needed data
>>> dictionary = Dictionary(common_texts)
>>> corpus = [dictionary.doc2bow(doc) for doc in common_texts]
>>> # Write corpus in GibbsLda++ format to disk
>>> output_fname = get_tmpfile("corpus.low")
>>> LowCorpus.serialize(output_fname, corpus, dictionary)
>>> # Read corpus
>>> loaded_corpus = LowCorpus(output_fname)
  • fname (str) – Path to file in GibbsLda++ format.

  • id2word ({dict of (int, str), Dictionary}, optional) – Mapping between word_ids (integers) and words (strings). If not provided, the mapping is constructed directly from fname.

  • line2words (callable, optional) – Function which converts lines(str) into tokens(list of str), using split_on_space() as default.


Get the document stored in file by offset position.


offset (int) – Offset (in bytes) to begin of document.


Document in BoW format.

Return type

list of (int, int)


>>> from gensim.test.utils import datapath
>>> from gensim.corpora import LowCorpus
>>> data = LowCorpus(datapath("testcorpus.low"))
>>> data.docbyoffset(1)  # end of first line
>>> data.docbyoffset(2)  # start of second line
[(0, 1), (3, 1), (4, 1)]
property id2word

Get mapping between words and their ids.


Covert line into document in BoW format.


line (str) – Line from input file.


Document in BoW format

Return type

list of (int, int)

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also


Save object to file.


Object loaded from fname.

Return type



AttributeError – When called on an object instance instead of class (this is a class method).

save(*args, **kwargs)

Saves corpus in-memory state.


This save only the “state” of a corpus class, not the corpus data!

For saving data use the serialize method of the output format you’d like to use (e.g. gensim.corpora.mmcorpus.MmCorpus.serialize()).

static save_corpus(fname, corpus, id2word=None, metadata=False)

Save a corpus in the GibbsLda++ format.


This function is automatically called by gensim.corpora.lowcorpus.LowCorpus.serialize(), don’t call it directly, call gensim.corpora.lowcorpus.LowCorpus.serialize() instead.

  • fname (str) – Path to output file.

  • corpus (iterable of iterable of (int, int)) – Corpus in BoW format.

  • id2word ({dict of (int, str), Dictionary}, optional) – Mapping between word_ids (integers) and words (strings). If not provided, the mapping is constructed directly from corpus.

  • metadata (bool, optional) – THIS PARAMETER WILL BE IGNORED.


List of offsets in resulting file for each document (in bytes), can be used for docbyoffset()

Return type

list of int

classmethod serialize(fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)

Serialize corpus with offset metadata, allows to use direct indexes after loading.

  • fname (str) – Path to output file.

  • corpus (iterable of iterable of (int, float)) – Corpus in BoW format.

  • id2word (dict of (str, str), optional) – Mapping id -> word.

  • index_fname (str, optional) – Where to save resulting index, if None - store index to fname.index.

  • progress_cnt (int, optional) – Number of documents after which progress info is printed.

  • labels (bool, optional) – If True - ignore first column (class labels).

  • metadata (bool, optional) – If True - ensure that serialize will write out article titles to a pickle file.


>>> from gensim.corpora import MmCorpus
>>> from gensim.test.utils import get_tmpfile
>>> corpus = [[(1, 0.3), (2, 0.1)], [(1, 0.1)], [(2, 0.3)]]
>>> output_fname = get_tmpfile("")
>>> MmCorpus.serialize(output_fname, corpus)
>>> mm = MmCorpus(output_fname)  # `mm` document stream now has random access
>>> print(mm[1])  # retrieve document no. 42, etc.
[(1, 0.1)]

Split line by spaces, used in gensim.corpora.lowcorpus.LowCorpus.


s (str) – Some line.


List of tokens from s.

Return type

list of str