gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

corpora.svmlightcorpus – Corpus in SVMlight format

corpora.svmlightcorpus – Corpus in SVMlight format

Corpus in SVMlight format.

class gensim.corpora.svmlightcorpus.SvmLightCorpus(fname, store_labels=True)

Bases: gensim.corpora.indexedcorpus.IndexedCorpus

Corpus in SVMlight format.

Quoting http://svmlight.joachims.org/: The input file contains the training examples. The first lines may contain comments and are ignored if they start with #. Each of the following lines represents one training example and is of the following format:

<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info>
<target> .=. +1 | -1 | 0 | <float>
<feature> .=. <integer> | "qid"
<value> .=. <float>
<info> .=. <string>

The “qid” feature (used for SVMlight ranking), if present, is ignored.

Notes

Although not mentioned in the specification above, SVMlight also expect its feature ids to be 1-based (counting starts at 1). We convert features to 0-base internally by decrementing all ids when loading a SVMlight input file, and increment them again when saving as SVMlight.

Parameters:
  • fname (str) – Path to corpus.
  • store_labels (bool, optional) – Whether to store labels (~SVM target class). They currently have no application but stored in self.labels for convenience by default.
static doc2line(doc, label=0)

Convert BoW representation of document in SVMlight format. This method inverse of line2doc().

Parameters:
  • doc (list of (int, float)) – Document in BoW format.
  • label (int, optional) – Document label (if provided).
Returns:

doc in SVMlight format.

Return type:

str

docbyoffset(offset)

Get the document stored at file position offset.

Parameters:offset (int) – Document’s position.
Returns:
Return type:tuple of (int, float)
line2doc(line)

Get a document from a single line in SVMlight format. This method inverse of doc2line().

Parameters:line (str) – Line in SVMLight format.
Returns:Document in BoW format and target class label.
Return type:(list of (int, float), str)
classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()
Save object to file.
Returns:Object loaded from fname.
Return type:object
Raises:AttributeError – When called on an object instance instead of class (this is a class method).
save(*args, **kwargs)

Saves corpus in-memory state.

Warning

This save only the “state” of a corpus class, not the corpus data!

For saving data use the serialize method of the output format you’d like to use (e.g. gensim.corpora.mmcorpus.MmCorpus.serialize()).

static save_corpus(fname, corpus, id2word=None, labels=False, metadata=False)

Save a corpus in the SVMlight format.

The SVMlight <target> class tag is taken from the labels array, or set to 0 for all documents if labels is not supplied.

Parameters:
  • fname (str) – Path to output file.
  • corpus (iterable of iterable of (int, float)) – Corpus in BoW format.
  • id2word (dict of (str, str), optional) – Mapping id -> word.
  • labels (list or False) – An SVMlight <target> class tags or False if not present.
  • metadata (bool) – ARGUMENT WILL BE IGNORED.
Returns:

Offsets for each line in file (in bytes).

Return type:

list of int

classmethod serialize(fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)

Serialize corpus with offset metadata, allows to use direct indexes after loading.

Parameters:
  • fname (str) – Path to output file.
  • corpus (iterable of iterable of (int, float)) – Corpus in BoW format.
  • id2word (dict of (str, str), optional) – Mapping id -> word.
  • index_fname (str, optional) – Where to save resulting index, if None - store index to fname.index.
  • progress_cnt (int, optional) – Number of documents after which progress info is printed.
  • labels (bool, optional) – If True - ignore first column (class labels).
  • metadata (bool, optional) – If True - ensure that serialize will write out article titles to a pickle file.

Examples

>>> from gensim.corpora import MmCorpus
>>> from gensim.test.utils import get_tmpfile
>>>
>>> corpus = [[(1, 0.3), (2, 0.1)], [(1, 0.1)], [(2, 0.3)]]
>>> output_fname = get_tmpfile("test.mm")
>>>
>>> MmCorpus.serialize(output_fname, corpus)
>>> mm = MmCorpus(output_fname) # `mm` document stream now has random access
>>> print(mm[1]) # retrieve document no. 42, etc.
[(1, 0.1)]