gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

corpora.ucicorpus – Corpus in UCI format

corpora.ucicorpus – Corpus in UCI format

Corpus in UCI format.

class gensim.corpora.ucicorpus.UciCorpus(fname, fname_vocab=None)

Bases: gensim.corpora.ucicorpus.UciReader, gensim.corpora.indexedcorpus.IndexedCorpus

Corpus in the UCI bag-of-words format.

Parameters:
  • fname (str) – Path to corpus in UCI format.
  • fname_vocab (bool, optional) – Path to vocab.

Examples

>>> from gensim.corpora import UciCorpus
>>> from gensim.test.utils import datapath
>>>
>>> corpus = UciCorpus(datapath('testcorpus.uci'))
>>> for document in corpus:
...     pass
create_dictionary()

Generate gensim.corpora.dictionary.Dictionary directly from the corpus and vocabulary data.

Returns:Dictionary, based on corpus.
Return type:gensim.corpora.dictionary.Dictionary

Examples

>>> from gensim.corpora.ucicorpus import UciCorpus
>>> from gensim.test.utils import datapath
>>> ucc = UciCorpus(datapath('testcorpus.uci'))
>>> dictionary = ucc.create_dictionary()
docbyoffset(self, offset)

Get document at file offset offset (in bytes).

Parameters:offset (int) – Offset, in bytes, of desired document.
Returns:Document in BoW format.
Return type:list of (int, str)
input

input – object

classmethod load(fname, mmap=None)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
num_docs

num_docs – ‘int’

num_nnz

num_nnz – ‘int’

num_terms

num_terms – ‘int’

save(*args, **kwargs)

Saves corpus in-memory state.

Warning

This save only “state” of corpus class (not corpus-data at all), for saving data please use save_corpus() instead`.

Parameters:
  • *args – Variable length argument list.
  • **kwargs – Arbitrary keyword arguments.
static save_corpus(corpus, id2word=None, progress_cnt=10000, metadata=False)

Save a corpus in the UCI Bag-of-Words format.

Warning

This function is automatically called by :meth`gensim.corpora.ucicorpus.UciCorpus.serialize`, don’t call it directly, call :meth`gensim.corpora.ucicorpus.UciCorpus.serialize` instead.

Parameters:
  • fname (str) – Path to output file.
  • corpus (iterable of iterable of (int, int)) – Corpus in BoW format.
  • id2word ({dict of (int, str), gensim.corpora.dictionary.Dictionary}, optional) – Mapping between words and their ids. If None - will be inferred from corpus.
  • progress_cnt (int, optional) – Progress counter, write log message each progress_cnt documents.
  • metadata (bool, optional) – THIS PARAMETER WILL BE IGNORED.

Notes

There are actually two files saved: fname and fname.vocab, where fname.vocab is the vocabulary file.

classmethod serialize(fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)

Serialize corpus with offset metadata, allows to use direct indexes after loading.

Parameters:
  • fname (str) – Path to output file.
  • corpus (iterable of iterable of (int, float)) – Corpus in BoW format.
  • id2word (dict of (str, str), optional) – Mapping id -> word.
  • index_fname (str, optional) – Where to save resulting index, if None - store index to fname.index.
  • progress_cnt (int, optional) – Number of documents after which progress info is printed.
  • labels (bool, optional) – If True - ignore first column (class labels).
  • metadata (bool, optional) – If True - ensure that serialize will write out article titles to a pickle file.

Examples

>>> from gensim.corpora import MmCorpus
>>> from gensim.test.utils import get_tmpfile
>>>
>>> corpus = [[(1, 0.3), (2, 0.1)], [(1, 0.1)], [(2, 0.3)]]
>>> output_fname = get_tmpfile("test.mm")
>>>
>>> MmCorpus.serialize(output_fname, corpus)
>>> mm = MmCorpus(output_fname) # `mm` document stream now has random access
>>> print(mm[1]) # retrieve document no. 42, etc.
[(1, 0.1)]
skip_headers(input_file)

Skip headers in input_file.

Parameters:input_file (file) – File object.
transposed

transposed – ‘bool’

class gensim.corpora.ucicorpus.UciReader(input)

Bases: gensim.corpora._mmreader.MmReader

Reader of UCI format for gensim.corpora.ucicorpus.UciCorpus.

Parameters:input (str) – Path to file in UCI format.
docbyoffset(self, offset)

Get document at file offset offset (in bytes).

Parameters:offset (int) – Offset, in bytes, of desired document.
Returns:Document in BoW format.
Return type:list of (int, str)
input

input – object

num_docs

num_docs – ‘int’

num_nnz

num_nnz – ‘int’

num_terms

num_terms – ‘int’

skip_headers(input_file)

Skip headers in input_file.

Parameters:input_file (file) – File object.
transposed

transposed – ‘bool’

class gensim.corpora.ucicorpus.UciWriter(fname)

Bases: gensim.matutils.MmWriter

Writer of UCI format for gensim.corpora.ucicorpus.UciCorpus.

Notes

This corpus format is identical to Matrix Market format<http://math.nist.gov/MatrixMarket/formats.html>, except for different file headers. There is no format line, and the first three lines of the file contain `number_docs, num_terms, and num_nnz, one value per line.

Parameters:fname (str) – Path to output file.
FAKE_HEADER = ' \n'
HEADER_LINE = '%%MatrixMarket matrix coordinate real general\n'
MAX_HEADER_LENGTH = 20
close()

Close self.fout file.

fake_headers(num_docs, num_terms, num_nnz)

Write “fake” headers to file.

Parameters:
  • num_docs (int) – Number of documents in corpus.
  • num_terms (int) – Number of term in corpus.
  • num_nnz (int) – Number of non-zero elements in corpus.
update_headers(num_docs, num_terms, num_nnz)

Update headers with actual values.

static write_corpus(corpus, progress_cnt=1000, index=False)

Write corpus in file.

Parameters:
  • fname (str) – Path to output file.
  • corpus (iterable of list of (int, int)) – Corpus in BoW format.
  • progress_cnt (int, optional) – Progress counter, write log message each progress_cnt documents.
  • index (bool, optional) – If True - return offsets, otherwise - nothing.
Returns:

Sequence of offsets to documents (in bytes), only if index=True.

Return type:

list of int

write_headers()

Write blank header lines. Will be updated later, once corpus stats are known.

write_vector(docno, vector)

Write a single sparse vector to the file.

Parameters:
  • docno (int) – Number of document.
  • vector (list of (int, number)) – Document in BoW format.
Returns:

Max word index in vector and len of vector. If vector is empty, return (-1, 0).

Return type:

(int, int)