gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

corpora.csvcorpus – Corpus in CSV format

corpora.csvcorpus – Corpus in CSV format

Corpus in CSV format.

class gensim.corpora.csvcorpus.CsvCorpus(fname, labels)

Bases: gensim.interfaces.CorpusABC

Corpus in CSV format.

Notes

The CSV delimiter, headers etc. are guessed automatically based on the file content. All row values are expected to be ints/floats.

Parameters:
  • fname (str) – Path to corpus.
  • labels (bool) – If True - ignore first column (class labels).
load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()
Save object to file.
Returns:Object loaded from fname.
Return type:object
Raises:AttributeError – When called on an object instance instead of class (this is a class method).
save(*args, **kwargs)

Saves corpus in-memory state.

Warning

This save only the “state” of a corpus class, not the corpus data!

For saving data use the serialize method of the output format you’d like to use (e.g. gensim.corpora.mmcorpus.MmCorpus.serialize()).

save_corpus(fname, corpus, id2word=None, metadata=False)

Save corpus to disk.

Some formats support saving the dictionary (feature_id -> word mapping), which can be provided by the optional id2word parameter.

Notes

Some corpora also support random access via document indexing, so that the documents on disk can be accessed in O(1) time (see the gensim.corpora.indexedcorpus.IndexedCorpus base class).

In this case, save_corpus() is automatically called internally by serialize(), which does save_corpus() plus saves the index at the same time.

Calling serialize() is preferred to calling :meth:`gensim.interfaces.CorpusABC.save_corpus().

Parameters:
  • fname (str) – Path to output file.
  • corpus (iterable of list of (int, number)) – Corpus in BoW format.
  • id2word (Dictionary, optional) – Dictionary of corpus.
  • metadata (bool, optional) – Write additional metadata to a separate too?