corpora.svmlightcorpus
– Corpus in SVMlight format¶
Corpus in SVMlight format.
- class gensim.corpora.svmlightcorpus.SvmLightCorpus(fname, store_labels=True)¶
Bases:
IndexedCorpus
Corpus in SVMlight format.
Quoting http://svmlight.joachims.org/: The input file contains the training examples. The first lines may contain comments and are ignored if they start with #. Each of the following lines represents one training example and is of the following format:
<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info> <target> .=. +1 | -1 | 0 | <float> <feature> .=. <integer> | "qid" <value> .=. <float> <info> .=. <string>
The “qid” feature (used for SVMlight ranking), if present, is ignored.
Notes
Although not mentioned in the specification above, SVMlight also expect its feature ids to be 1-based (counting starts at 1). We convert features to 0-base internally by decrementing all ids when loading a SVMlight input file, and increment them again when saving as SVMlight.
- Parameters
fname (str) – Path to corpus.
store_labels (bool, optional) – Whether to store labels (~SVM target class). They currently have no application but stored in self.labels for convenience by default.
- add_lifecycle_event(event_name, log_level=20, **event)¶
Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.
Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.
The lifecycle_events attribute is persisted across object’s
save()
andload()
operations. It has no impact on the use of the model, but is useful during debugging and support.Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.
- Parameters
event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.
This method will automatically add the following key-values to event, so you don’t have to specify them:
datetime: the current date & time
gensim: the current Gensim version
python: the current Python version
platform: the current platform
event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.
- static doc2line(doc, label=0)¶
Convert BoW representation of document in SVMlight format. This method inverse of
line2doc()
.- Parameters
doc (list of (int, float)) – Document in BoW format.
label (int, optional) – Document label (if provided).
- Returns
doc in SVMlight format.
- Return type
str
- docbyoffset(offset)¶
Get the document stored at file position offset.
- Parameters
offset (int) – Document’s position.
- Return type
tuple of (int, float)
- line2doc(line)¶
Get a document from a single line in SVMlight format. This method inverse of
doc2line()
.- Parameters
line (str) – Line in SVMLight format.
- Returns
Document in BoW format and target class label.
- Return type
(list of (int, float), str)
- classmethod load(fname, mmap=None)¶
Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
- save(*args, **kwargs)¶
Saves the in-memory state of the corpus (pickles the object).
Warning
This saves only the “internal state” of the corpus object, not the corpus data!
To save the corpus data, use the serialize method of your desired output format instead, e.g.
gensim.corpora.mmcorpus.MmCorpus.serialize()
.
- static save_corpus(fname, corpus, id2word=None, labels=False, metadata=False)¶
Save a corpus in the SVMlight format.
The SVMlight <target> class tag is taken from the labels array, or set to 0 for all documents if labels is not supplied.
- Parameters
fname (str) – Path to output file.
corpus (iterable of iterable of (int, float)) – Corpus in BoW format.
id2word (dict of (str, str), optional) – Mapping id -> word.
labels (list or False) – An SVMlight <target> class tags or False if not present.
metadata (bool) – ARGUMENT WILL BE IGNORED.
- Returns
Offsets for each line in file (in bytes).
- Return type
list of int
- classmethod serialize(fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)¶
Serialize corpus with offset metadata, allows to use direct indexes after loading.
- Parameters
fname (str) – Path to output file.
corpus (iterable of iterable of (int, float)) – Corpus in BoW format.
id2word (dict of (str, str), optional) – Mapping id -> word.
index_fname (str, optional) – Where to save resulting index, if None - store index to fname.index.
progress_cnt (int, optional) – Number of documents after which progress info is printed.
labels (bool, optional) – If True - ignore first column (class labels).
metadata (bool, optional) – If True - ensure that serialize will write out article titles to a pickle file.
Examples
>>> from gensim.corpora import MmCorpus >>> from gensim.test.utils import get_tmpfile >>> >>> corpus = [[(1, 0.3), (2, 0.1)], [(1, 0.1)], [(2, 0.3)]] >>> output_fname = get_tmpfile("test.mm") >>> >>> MmCorpus.serialize(output_fname, corpus) >>> mm = MmCorpus(output_fname) # `mm` document stream now has random access >>> print(mm[1]) # retrieve document no. 42, etc. [(1, 0.1)]