`corpora.bleicorpus` – Corpus in Blei’s LDA-C format¶

Сorpus in Blei’s LDA-C format.

class gensim.corpora.bleicorpus.BleiCorpus(fname, fname_vocab=None)¶

Bases: IndexedCorpus

Corpus in Blei’s LDA-C format.

The corpus is represented as two files: one describing the documents, and another describing the mapping between words and their ids.

Each document is one line:

N fieldId1:fieldValue1 fieldId2:fieldValue2 ... fieldIdN:fieldValueN

The vocabulary is a file with words, one word per line; word at line K has an implicit id=K.

Parameters

fname (str) – Path to corpus.
fname_vocab (str, optional) –
Vocabulary file. If fname_vocab is None, searching one of variants:
- fname.vocab
- fname/vocab.txt
- fname_without_ext.vocab
- fname_folder/vocab.txt

Raises

IOError – If vocabulary file doesn’t exist.

add_lifecycle_event(event_name, log_level=20, **event)¶

Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.

Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.

The lifecycle_events attribute is persisted across object’s save() and load() operations. It has no impact on the use of the model, but is useful during debugging and support.

Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.

Parameters

event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.

This method will automatically add the following key-values to event, so you don’t have to specify them:
- datetime: the current date & time
- gensim: the current Gensim version
- python: the current Python version
- platform: the current platform
- event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.

docbyoffset(offset)¶

Get document corresponding to offset. Offset can be given from save_corpus().

Parameters: offset (int) – Position of the document in the file (in bytes).
Returns: Document in BoW format.
Return type: list of (int, float)

line2doc(line)¶

Convert line in Blei LDA-C format to document (BoW representation).

Parameters: line (str) – Line in Blei’s LDA-C format.
Returns: Document’s BoW representation.
Return type: list of (int, float)

classmethod load(fname, mmap=None)¶

Load an object previously saved using save() from a file.

Parameters

fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save(): Save object to file.

Returns: Object loaded from fname.
Return type: object
Raises: AttributeError – When called on an object instance instead of class (this is a class method).

save(*args, **kwargs)¶: Saves the in-memory state of the corpus (pickles the object).

Warning

This saves only the “internal state” of the corpus object, not the corpus data!

To save the corpus data, use the serialize method of your desired output format instead, e.g. gensim.corpora.mmcorpus.MmCorpus.serialize().

static save_corpus(fname, corpus, id2word=None, metadata=False)¶

Save a corpus in the LDA-C format.

Notes

There are actually two files saved: fname and fname.vocab, where fname.vocab is the vocabulary file.

Parameters

fname (str) – Path to output file.
corpus (iterable of iterable of (int, float)) – Input corpus in BoW format.
id2word (dict of (str, str), optional) – Mapping id -> word for corpus.
metadata (bool, optional) – THIS PARAMETER WILL BE IGNORED.

Returns

Offsets for each line in file (in bytes).

Return type

list of int

classmethod serialize(fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)¶

Serialize corpus with offset metadata, allows to use direct indexes after loading.

Parameters

fname (str) – Path to output file.
corpus (iterable of iterable of (int, float)) – Corpus in BoW format.
id2word (dict of (str, str), optional) – Mapping id -> word.
index_fname (str, optional) – Where to save resulting index, if None - store index to fname.index.
progress_cnt (int, optional) – Number of documents after which progress info is printed.
labels (bool, optional) – If True - ignore first column (class labels).
metadata (bool, optional) – If True - ensure that serialize will write out article titles to a pickle file.

Examples

>>> from gensim.corpora import MmCorpus
>>> from gensim.test.utils import get_tmpfile
>>>
>>> corpus = [[(1, 0.3), (2, 0.1)], [(1, 0.1)], [(2, 0.3)]]
>>> output_fname = get_tmpfile("test.mm")
>>>
>>> MmCorpus.serialize(output_fname, corpus)
>>> mm = MmCorpus(output_fname)  # `mm` document stream now has random access
>>> print(mm[1])  # retrieve document no. 42, etc.
[(1, 0.1)]

Please sponsor Gensim to help sustain this open source project!

corpora.bleicorpus – Corpus in Blei’s LDA-C format¶

`corpora.bleicorpus` – Corpus in Blei’s LDA-C format¶