gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

corpora.textcorpus – Building corpora with dictionaries

corpora.textcorpus – Building corpora with dictionaries

Text corpora usually reside on disk, as text files in one format or another In a common scenario, we need to build a dictionary (a word->integer id mapping), which is then used to construct sparse bag-of-word vectors (= sequences of (word_id, word_weight) 2-tuples).

This module provides some code scaffolding to simplify this pipeline. For example, given a corpus where each document is a separate line in file on disk, you would override the TextCorpus.get_texts method to read one line=document at a time, process it (lowercase, tokenize, whatever) and yield it as a sequence of words.

Overriding get_texts is enough; you can then initialize the corpus with e.g. MyTextCorpus(bz2.BZ2File(‘mycorpus.txt.bz2’)) and it will behave correctly like a corpus of sparse vectors. The __iter__ methods is automatically set up, and dictionary is automatically populated with all word->id mappings.

The resulting object can be used as input to all gensim models (TFIDF, LSI, ...), serialized with any format (Matrix Market, SvmLight, Blei’s LDA-C format etc).

See the gensim.test.test_miislita.CorpusMiislita class for a simple example.

class gensim.corpora.textcorpus.TextCorpus(input=None, dictionary=None, metadata=False, character_filters=None, tokenizer=None, token_filters=None)

Bases: gensim.interfaces.CorpusABC

Helper class to simplify the pipeline of getting bag-of-words vectors (= a gensim corpus) from plain text.

This is an abstract base class: override the get_texts() and __len__() methods to match your particular input.

Given a filename (or a file-like object) in constructor, the corpus object will be automatically initialized with a dictionary in self.dictionary and will support the iter corpus method. You have a few different ways of utilizing this class via subclassing or by construction with different preprocessing arguments.

The iter method converts the lists of tokens produced by get_texts to BoW format using Dictionary.doc2bow. get_texts does the following:

  1. Calls getstream to get a generator over the texts. It yields each document in turn from the underlying text file or files.
  2. For each document from the stream, calls preprocess_text to produce a list of tokens; if metadata is enabled, it yields a 2-tuple with the document number as the second element.

Preprocessing consists of 0+ character_filters, a tokenizer, and 0+ token_filters.

The preprocessing consists of calling each filter in character_filters with the document text; unicode is not guaranteed, and if desired, the first filter should convert to unicode. The output of each character filter should be another string. The output from the final filter is fed to the tokenizer, which should split the string into a list of tokens (strings). Afterwards, the list of tokens is fed through each filter in token_filters. The final output returned from preprocess_text is the output from the final token filter.

So to use this class, you can either pass in different preprocessing functions using the character_filters, tokenizer, and token_filters arguments, or you can subclass it. If subclassing: override getstream to take text from different input sources in different formats. Overrride preprocess_text if you must provide different initial preprocessing, then call the TextCorpus.preprocess_text method to apply the normal preprocessing. You can also overrride get_texts in order to tag the documents (token lists) with different metadata.

The default preprocessing consists of:

  1. lowercase and convert to unicode; assumes utf8 encoding
  2. deaccent (asciifolding)
  3. collapse multiple whitespaces into a single one
  4. tokenize by splitting on whitespace
  5. remove words less than 3 characters long
  6. remove stopwords; see gensim.parsing.preprocessing for the list of stopwords
Parameters:
  • input (str) – path to top-level directory to traverse for corpus documents.
  • dictionary (Dictionary) – if a dictionary is provided, it will not be updated with the given corpus on initialization. If none is provided, a new dictionary will be built for the given corpus. If no corpus is given, the dictionary will remain uninitialized.
  • metadata (bool) – True to yield metadata with each document, else False (default).
  • character_filters (iterable of callable) – each will be applied to the text of each document in order, and should return a single string with the modified text. For Python 2, the original text will not be unicode, so it may be useful to convert to unicode as the first character filter. The default character filters lowercase, convert to unicode (strict utf8), perform ASCII-folding, then collapse multiple whitespaces.
  • tokenizer (callable) – takes as input the document text, preprocessed by all filters in character_filters; should return an iterable of tokens (strings).
  • token_filters (iterable of callable) – each will be applied to the iterable of tokens in order, and should return another iterable of tokens. These filters can add, remove, or replace tokens, or do nothing at all. The default token filters remove tokens less than 3 characters long and remove stopwords using the list in gensim.parsing.preprocessing.STOPWORDS.
get_texts()

Iterate over the collection, yielding one document at a time. A document is a sequence of words (strings) that can be fed into Dictionary.doc2bow. Each document will be fed through preprocess_text. That method should be overridden to provide different preprocessing steps. This method will need to be overridden if the metadata you’d like to yield differs from the line number.

Returns:generator of lists of tokens (strings); each list corresponds to a preprocessed document from the corpus input.
getstream()

Yield documents from the underlying plain text collection (of one or more files). Each item yielded from this method will be considered a document by subsequent preprocessing methods.

init_dictionary(dictionary)

If dictionary is None, initialize to an empty Dictionary, and then if there is an input for the corpus, add all documents from that input. If the dictionary is already initialized, simply set it as the corpus’s dictionary.

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

preprocess_text(text)

Apply preprocessing to a single text document. This should perform tokenization in addition to any other desired preprocessing steps.

Parameters:text (str) – document text read from plain-text file.
Returns:tokens produced from text as a result of preprocessing.
Return type:iterable of str
sample_texts(n, seed=None, length=None)

Yield n random documents from the corpus without replacement.

Given the number of remaining documents in a corpus, we need to choose n elements. The probability for the current element to be chosen is n/remaining. If we choose it, we just decrease the n and move to the next element. Computing the corpus length may be a costly operation so you can use the optional parameter length instead.

Parameters:
  • n (int) – number of documents we want to sample.
  • seed (int|None) – if specified, use it as a seed for local random generator.
  • length (int|None) – if specified, use it as a guess of corpus length. It must be positive and not greater than actual corpus length.
Yields:

list[str] – document represented as a list of tokens. See get_texts method.

Raises:

ValueError – when n is invalid or length was set incorrectly.

save(*args, **kwargs)
save_corpus(fname, corpus, id2word=None, metadata=False)

Save an existing corpus to disk.

Some formats also support saving the dictionary (feature_id->word mapping), which can in this case be provided by the optional id2word parameter.

>>> MmCorpus.save_corpus('file.mm', corpus)

Some corpora also support an index of where each document begins, so that the documents on disk can be accessed in O(1) time (see the corpora.IndexedCorpus base class). In this case, save_corpus is automatically called internally by serialize, which does save_corpus plus saves the index at the same time, so you want to store the corpus with:

>>> MmCorpus.serialize('file.mm', corpus) # stores index as well, allowing random access to individual documents

Calling serialize() is preferred to calling save_corpus().

step_through_preprocess(text)

Yield tuples of functions and their output for each stage of preprocessing. This is useful for debugging issues with the corpus preprocessing pipeline.

class gensim.corpora.textcorpus.TextDirectoryCorpus(input, dictionary=None, metadata=False, min_depth=0, max_depth=None, pattern=None, exclude_pattern=None, lines_are_documents=False, **kwargs)

Bases: gensim.corpora.textcorpus.TextCorpus

Read documents recursively from a directory, where each file (or line of each file) is interpreted as a plain text document.

Parameters:
  • min_depth (int) – minimum depth in directory tree at which to begin searching for files. The default is 0, which means files starting in the top-level directory input will be considered.
  • max_depth (int) – max depth in directory tree at which files will no longer be considered. The default is None, which means recurse through all subdirectories.
  • pattern (str or Pattern) – regex to use for file name inclusion; all those files not matching this pattern will be ignored.
  • exclude_pattern (str or Pattern) – regex to use for file name exclusion; all files matching this pattern will be ignored.
  • lines_are_documents (bool) – if True, each line of each file is considered to be a document. If False (default), each file is considered to be a document.
  • kwargs – keyword arguments passed through to the TextCorpus constructor. This is in addition to the non-kwargs input, dictionary, and metadata. See TextCorpus.__init__ docstring for more details on these.
exclude_pattern
get_texts()

Iterate over the collection, yielding one document at a time. A document is a sequence of words (strings) that can be fed into Dictionary.doc2bow. Each document will be fed through preprocess_text. That method should be overridden to provide different preprocessing steps. This method will need to be overridden if the metadata you’d like to yield differs from the line number.

Returns:generator of lists of tokens (strings); each list corresponds to a preprocessed document from the corpus input.
getstream()

Yield documents from the underlying plain text collection (of one or more files). Each item yielded from this method will be considered a document by subsequent preprocessing methods.

If lines_are_documents was set to True, items will be lines from files. Otherwise there will be one item per file, containing the entire contents of the file.

init_dictionary(dictionary)

If dictionary is None, initialize to an empty Dictionary, and then if there is an input for the corpus, add all documents from that input. If the dictionary is already initialized, simply set it as the corpus’s dictionary.

iter_filepaths()

Lazily yield paths to each file in the directory structure within the specified range of depths. If a filename pattern to match was given, further filter to only those filenames that match.

lines_are_documents
load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

max_depth
min_depth
pattern
preprocess_text(text)

Apply preprocessing to a single text document. This should perform tokenization in addition to any other desired preprocessing steps.

Parameters:text (str) – document text read from plain-text file.
Returns:tokens produced from text as a result of preprocessing.
Return type:iterable of str
sample_texts(n, seed=None, length=None)

Yield n random documents from the corpus without replacement.

Given the number of remaining documents in a corpus, we need to choose n elements. The probability for the current element to be chosen is n/remaining. If we choose it, we just decrease the n and move to the next element. Computing the corpus length may be a costly operation so you can use the optional parameter length instead.

Parameters:
  • n (int) – number of documents we want to sample.
  • seed (int|None) – if specified, use it as a seed for local random generator.
  • length (int|None) – if specified, use it as a guess of corpus length. It must be positive and not greater than actual corpus length.
Yields:

list[str] – document represented as a list of tokens. See get_texts method.

Raises:

ValueError – when n is invalid or length was set incorrectly.

save(*args, **kwargs)
save_corpus(fname, corpus, id2word=None, metadata=False)

Save an existing corpus to disk.

Some formats also support saving the dictionary (feature_id->word mapping), which can in this case be provided by the optional id2word parameter.

>>> MmCorpus.save_corpus('file.mm', corpus)

Some corpora also support an index of where each document begins, so that the documents on disk can be accessed in O(1) time (see the corpora.IndexedCorpus base class). In this case, save_corpus is automatically called internally by serialize, which does save_corpus plus saves the index at the same time, so you want to store the corpus with:

>>> MmCorpus.serialize('file.mm', corpus) # stores index as well, allowing random access to individual documents

Calling serialize() is preferred to calling save_corpus().

step_through_preprocess(text)

Yield tuples of functions and their output for each stage of preprocessing. This is useful for debugging issues with the corpus preprocessing pipeline.

gensim.corpora.textcorpus.lower_to_unicode(text, encoding='utf8', errors='strict')

Lowercase text and convert to unicode.

gensim.corpora.textcorpus.remove_short(tokens, minsize=3)

Remove tokens smaller than minsize chars, which is 3 by default.

gensim.corpora.textcorpus.remove_stopwords(tokens, stopwords=frozenset(['all', 'six', 'just', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'four', 'not', 'own', 'through', 'using', 'fify', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'much', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom', 'seeming', 'yourselves', 'under', 'ours', 'two', 'has', 'might', 'thereafter', 'latterly', 'do', 'them', 'his', 'around', 'than', 'get', 'very', 'de', 'none', 'cannot', 'every', 'un', 'they', 'front', 'during', 'thus', 'now', 'him', 'nor', 'name', 'regarding', 'several', 'hereafter', 'did', 'always', 'who', 'didn', 'whither', 'this', 'someone', 'either', 'each', 'become', 'thereupon', 'sometime', 'side', 'towards', 'therein', 'twelve', 'because', 'often', 'ten', 'our', 'doing', 'km', 'eg', 'some', 'back', 'used', 'up', 'go', 'namely', 'computer', 'are', 'further', 'beyond', 'ourselves', 'yet', 'out', 'even', 'will', 'what', 'still', 'for', 'bottom', 'mine', 'since', 'please', 'forty', 'per', 'its', 'everything', 'behind', 'does', 'various', 'above', 'between', 'it', 'neither', 'seemed', 'ever', 'across', 'she', 'somehow', 'be', 'we', 'full', 'never', 'sixty', 'however', 'here', 'otherwise', 'were', 'whereupon', 'nowhere', 'although', 'found', 'alone', 're', 'along', 'quite', 'fifteen', 'by', 'both', 'about', 'last', 'would', 'anything', 'via', 'many', 'could', 'thence', 'put', 'against', 'keep', 'etc', 'amount', 'became', 'ltd', 'hence', 'onto', 'or', 'con', 'among', 'already', 'co', 'afterwards', 'formerly', 'within', 'seems', 'into', 'others', 'while', 'whatever', 'except', 'down', 'hers', 'everyone', 'done', 'least', 'another', 'whoever', 'moreover', 'couldnt', 'throughout', 'anyhow', 'yourself', 'three', 'from', 'her', 'few', 'together', 'top', 'there', 'due', 'been', 'next', 'anyone', 'eleven', 'cry', 'call', 'therefore', 'interest', 'then', 'thru', 'themselves', 'hundred', 'really', 'sincere', 'empty', 'more', 'himself', 'elsewhere', 'mostly', 'on', 'fire', 'am', 'becoming', 'hereby', 'amongst', 'else', 'part', 'everywhere', 'too', 'kg', 'herself', 'former', 'those', 'he', 'me', 'myself', 'made', 'twenty', 'these', 'was', 'bill', 'cant', 'us', 'until', 'besides', 'nevertheless', 'below', 'anywhere', 'nine', 'can', 'whether', 'of', 'your', 'toward', 'my', 'say', 'something', 'and', 'whereafter', 'whenever', 'give', 'almost', 'wherever', 'is', 'describe', 'beforehand', 'herein', 'doesn', 'an', 'as', 'itself', 'at', 'have', 'in', 'seem', 'whence', 'ie', 'any', 'fill', 'again', 'hasnt', 'inc', 'thereby', 'thin', 'no', 'perhaps', 'latter', 'meanwhile', 'when', 'detail', 'same', 'wherein', 'beside', 'also', 'that', 'other', 'take', 'which', 'becomes', 'you', 'if', 'nobody', 'unless', 'whereas', 'see', 'though', 'may', 'after', 'upon', 'most', 'hereupon', 'eight', 'but', 'serious', 'nothing', 'such', 'why', 'off', 'a', 'don', 'whereby', 'third', 'i', 'whole', 'noone', 'sometimes', 'well', 'amoungst', 'yours', 'their', 'rather', 'without', 'so', 'five', 'the', 'first', 'with', 'make', 'once']))

Remove stopwords using list from `gensim.parsing.preprocessing.STOPWORDS.

gensim.corpora.textcorpus.strip_multiple_whitespaces(s)

Collapse multiple whitespace characters into a single space.

gensim.corpora.textcorpus.walk(top, topdown=True, onerror=None, followlinks=False, depth=0)

This is a mostly copied version of os.walk from the Python 2 source code. The only difference is that it returns the depth in the directory tree structure at which each yield is taking place.