Corpus in GibbsLda++ format of List-Of-Words.
List_Of_Words corpus handles input in GibbsLda++ format.
Both data for training/estimating the model and new data (i.e., previously unseen data) have the same format as follows: [M] [document1] [document2] ... [documentM] in which the first line is the total number for documents [M]. Each line after that is one document. [documenti] is the ith document of the dataset that consists of a list of Ni words/terms. [documenti] = [wordi1] [wordi2] ... [wordiNi] in which all [wordij] (i=1..M, j=1..Ni) are text strings and they are separated by the blank character.
Initialize the corpus from a file.
id2word and line2words are optional parameters. If provided, id2word is a dictionary mapping between word_ids (integers) and words (strings). If not provided, the mapping is constructed from the documents.
line2words is a function which converts lines into tokens. Defaults to simple splitting on spaces.
Return the document stored at file position offset.
Load a previously saved object from file (also see save).
Save the object to file via pickling (also see load).
Save a corpus in the List-of-words format.
This function is automatically called by LowCorpus.serialize; don’t call it directly, call serialize instead.
Iterate through the document stream corpus, saving the documents to fname and recording byte offset of each document. Save the resulting index structure to file index_fname (or fname.index is not set).
This relies on the underlying corpus class serializer providing (in addition to standard iteration):
each saved document,
the docbyoffset(offset) method, which returns a document positioned at offset bytes within the persistent storage (file).
>>> MmCorpus.serialize('test.mm', corpus) >>> mm = MmCorpus('test.mm') # `mm` document stream now has random access >>> print mm # retrieve document no. 42, etc.