corpora.ucicorpus – Corpus in UCI format

Corpus in UCI format.

class gensim.corpora.ucicorpus.UciCorpus(fname, fname_vocab=None)

Bases: UciReader, IndexedCorpus

Corpus in the UCI bag-of-words format.

Parameters
  • fname (str) – Path to corpus in UCI format.

  • fname_vocab (bool, optional) – Path to vocab.

Examples

>>> from gensim.corpora import UciCorpus
>>> from gensim.test.utils import datapath
>>>
>>> corpus = UciCorpus(datapath('testcorpus.uci'))
>>> for document in corpus:
...     pass
add_lifecycle_event(event_name, log_level=20, **event)

Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.

Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.

The lifecycle_events attribute is persisted across object’s save() and load() operations. It has no impact on the use of the model, but is useful during debugging and support.

Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.

Parameters
  • event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.

  • event (dict) –

    Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.

    This method will automatically add the following key-values to event, so you don’t have to specify them:

    • datetime: the current date & time

    • gensim: the current Gensim version

    • python: the current Python version

    • platform: the current platform

    • event: the name of this event

  • log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.

create_dictionary()

Generate gensim.corpora.dictionary.Dictionary directly from the corpus and vocabulary data.

Returns

Dictionary, based on corpus.

Return type

gensim.corpora.dictionary.Dictionary

Examples

>>> from gensim.corpora.ucicorpus import UciCorpus
>>> from gensim.test.utils import datapath
>>> ucc = UciCorpus(datapath('testcorpus.uci'))
>>> dictionary = ucc.create_dictionary()
docbyoffset(self, offset)

Get the document at file offset offset (in bytes).

Parameters

offset (int) – File offset, in bytes, of the desired document.

Returns

Document in sparse bag-of-words format.

Return type

list of (int, str)

input

object

Type

input

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

num_docs

‘long long’

Type

num_docs

num_nnz

‘long long’

Type

num_nnz

num_terms

‘long long’

Type

num_terms

save(*args, **kwargs)

Saves the in-memory state of the corpus (pickles the object).

Warning

This saves only the “internal state” of the corpus object, not the corpus data!

To save the corpus data, use the serialize method of your desired output format instead, e.g. gensim.corpora.mmcorpus.MmCorpus.serialize().

static save_corpus(fname, corpus, id2word=None, progress_cnt=10000, metadata=False)

Save a corpus in the UCI Bag-of-Words format.

Warning

This function is automatically called by :meth`gensim.corpora.ucicorpus.UciCorpus.serialize`, don’t call it directly, call :meth`gensim.corpora.ucicorpus.UciCorpus.serialize` instead.

Parameters
  • fname (str) – Path to output file.

  • corpus (iterable of iterable of (int, int)) – Corpus in BoW format.

  • id2word ({dict of (int, str), gensim.corpora.dictionary.Dictionary}, optional) – Mapping between words and their ids. If None - will be inferred from corpus.

  • progress_cnt (int, optional) – Progress counter, write log message each progress_cnt documents.

  • metadata (bool, optional) – THIS PARAMETER WILL BE IGNORED.

Notes

There are actually two files saved: fname and fname.vocab, where fname.vocab is the vocabulary file.

classmethod serialize(fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)

Serialize corpus with offset metadata, allows to use direct indexes after loading.

Parameters
  • fname (str) – Path to output file.

  • corpus (iterable of iterable of (int, float)) – Corpus in BoW format.

  • id2word (dict of (str, str), optional) – Mapping id -> word.

  • index_fname (str, optional) – Where to save resulting index, if None - store index to fname.index.

  • progress_cnt (int, optional) – Number of documents after which progress info is printed.

  • labels (bool, optional) – If True - ignore first column (class labels).

  • metadata (bool, optional) – If True - ensure that serialize will write out article titles to a pickle file.

Examples

>>> from gensim.corpora import MmCorpus
>>> from gensim.test.utils import get_tmpfile
>>>
>>> corpus = [[(1, 0.3), (2, 0.1)], [(1, 0.1)], [(2, 0.3)]]
>>> output_fname = get_tmpfile("test.mm")
>>>
>>> MmCorpus.serialize(output_fname, corpus)
>>> mm = MmCorpus(output_fname)  # `mm` document stream now has random access
>>> print(mm[1])  # retrieve document no. 42, etc.
[(1, 0.1)]
skip_headers(input_file)

Skip headers in input_file.

Parameters

input_file (file) – File object.

transposed

‘bool’

Type

transposed

class gensim.corpora.ucicorpus.UciReader(input)

Bases: MmReader

Reader of UCI format for gensim.corpora.ucicorpus.UciCorpus.

Parameters

input (str) – Path to file in UCI format.

docbyoffset(self, offset)

Get the document at file offset offset (in bytes).

Parameters

offset (int) – File offset, in bytes, of the desired document.

Returns

Document in sparse bag-of-words format.

Return type

list of (int, str)

input

object

Type

input

num_docs

‘long long’

Type

num_docs

num_nnz

‘long long’

Type

num_nnz

num_terms

‘long long’

Type

num_terms

skip_headers(input_file)

Skip headers in input_file.

Parameters

input_file (file) – File object.

transposed

‘bool’

Type

transposed

class gensim.corpora.ucicorpus.UciWriter(fname)

Bases: MmWriter

Writer of UCI format for gensim.corpora.ucicorpus.UciCorpus.

Notes

This corpus format is identical to Matrix Market format<http://math.nist.gov/MatrixMarket/formats.html>, except for different file headers. There is no format line, and the first three lines of the file contain `number_docs, num_terms, and num_nnz, one value per line.

Parameters

fname (str) – Path to output file.

FAKE_HEADER = b'                    \n'
HEADER_LINE = b'%%MatrixMarket matrix coordinate real general\n'
MAX_HEADER_LENGTH = 20
close()

Close self.fout file.

fake_headers(num_docs, num_terms, num_nnz)

Write “fake” headers to file, to be rewritten once we’ve scanned the entire corpus.

Parameters
  • num_docs (int) – Number of documents in corpus.

  • num_terms (int) – Number of term in corpus.

  • num_nnz (int) – Number of non-zero elements in corpus.

update_headers(num_docs, num_terms, num_nnz)

Update headers with actual values.

static write_corpus(fname, corpus, progress_cnt=1000, index=False)

Write corpus in file.

Parameters
  • fname (str) – Path to output file.

  • corpus (iterable of list of (int, int)) – Corpus in BoW format.

  • progress_cnt (int, optional) – Progress counter, write log message each progress_cnt documents.

  • index (bool, optional) – If True - return offsets, otherwise - nothing.

Returns

Sequence of offsets to documents (in bytes), only if index=True.

Return type

list of int

write_headers()

Write blank header lines. Will be updated later, once corpus stats are known.

write_vector(docno, vector)

Write a single sparse vector to the file.

Parameters
  • docno (int) – Number of document.

  • vector (list of (int, number)) – Document in BoW format.

Returns

Max word index in vector and len of vector. If vector is empty, return (-1, 0).

Return type

(int, int)