corpora.textcorpus – Tools for building corpora with dictionaries

`corpora.textcorpus` – Tools for building corpora with dictionaries¶

Module provides some code scaffolding to simplify use of built dictionary for constructing BoW vectors.

Notes

Text corpora usually reside on disk, as text files in one format or another In a common scenario, we need to build a dictionary (a word->integer id mapping), which is then used to construct sparse bag-of-word vectors (= iterable of (word_id, word_weight)).

This module provides some code scaffolding to simplify this pipeline. For example, given a corpus where each document is a separate line in file on disk, you would override the gensim.corpora.textcorpus.TextCorpus.get_texts() to read one line=document at a time, process it (lowercase, tokenize, whatever) and yield it as a sequence of words.

Overriding gensim.corpora.textcorpus.TextCorpus.get_texts() is enough, you can then initialize the corpus with e.g. MyTextCorpus(“mycorpus.txt.bz2”) and it will behave correctly like a corpus of sparse vectors. The __iter__() method is automatically set up, and dictionary is automatically populated with all word->id mappings.

The resulting object can be used as input to some of gensim models (TfidfModel, LsiModel, LdaModel, …), serialized with any format (Matrix Market, SvmLight, Blei’s LDA-C format, etc).

See also

gensim.test.test_miislita.CorpusMiislita: Good simple example.

class gensim.corpora.textcorpus.TextCorpus(input=None, dictionary=None, metadata=False, character_filters=None, tokenizer=None, token_filters=None)¶

Bases: gensim.interfaces.CorpusABC

Helper class to simplify the pipeline of getting BoW vectors from plain text.

Notes

This is an abstract base class: override the get_texts() and __len__() methods to match your particular input.

Given a filename (or a file-like object) in constructor, the corpus object will be automatically initialized with a dictionary in self.dictionary and will support the __iter__() corpus method. You have a few different ways of utilizing this class via subclassing or by construction with different preprocessing arguments.

The __iter__() method converts the lists of tokens produced by get_texts() to BoW format using gensim.corpora.dictionary.Dictionary.doc2bow().

get_texts() does the following:

Calls getstream() to get a generator over the texts. It yields each document in turn from the underlying text file or files.
For each document from the stream, calls preprocess_text() to produce a list of tokens. If metadata=True, it yields a 2-tuple with the document number as the second element.

Preprocessing consists of 0+ character_filters, a tokenizer, and 0+ token_filters.

The preprocessing consists of calling each filter in character_filters with the document text. Unicode is not guaranteed, and if desired, the first filter should convert to unicode. The output of each character filter should be another string. The output from the final filter is fed to the tokenizer, which should split the string into a list of tokens (strings). Afterwards, the list of tokens is fed through each filter in token_filters. The final output returned from preprocess_text() is the output from the final token filter.

So to use this class, you can either pass in different preprocessing functions using the character_filters, tokenizer, and token_filters arguments, or you can subclass it.

If subclassing: override getstream() to take text from different input sources in different formats. Override preprocess_text() if you must provide different initial preprocessing, then call the preprocess_text() method to apply the normal preprocessing. You can also override get_texts() in order to tag the documents (token lists) with different metadata.

The default preprocessing consists of:

lower_to_unicode() - lowercase and convert to unicode (assumes utf8 encoding)
deaccent()- deaccent (asciifolding)
strip_multiple_whitespaces() - collapse multiple whitespaces into a single one
simple_tokenize() - tokenize by splitting on whitespace
remove_short() - remove words less than 3 characters long
remove_stopwords() - remove stopwords

Parameters

input (str, optional) – Path to top-level directory (file) to traverse for corpus documents.
dictionary (Dictionary, optional) – If a dictionary is provided, it will not be updated with the given corpus on initialization. If None - new dictionary will be built for the given corpus. If input is None, the dictionary will remain uninitialized.
metadata (bool, optional) – If True - yield metadata with each document.
character_filters (iterable of callable, optional) – Each will be applied to the text of each document in order, and should return a single string with the modified text. For Python 2, the original text will not be unicode, so it may be useful to convert to unicode as the first character filter. If None - using lower_to_unicode(), deaccent() and strip_multiple_whitespaces().
tokenizer (callable, optional) – Tokenizer for document, if None - using simple_tokenize().
token_filters (iterable of callable, optional) – Each will be applied to the iterable of tokens in order, and should return another iterable of tokens. These filters can add, remove, or replace tokens, or do nothing at all. If None - using remove_short() and remove_stopwords().

Examples

>>> from gensim.corpora.textcorpus import TextCorpus
>>> from gensim.test.utils import datapath
>>> from gensim import utils
>>>
>>>
>>> class CorpusMiislita(TextCorpus):
...     stopwords = set('for a of the and to in on'.split())
...
...     def get_texts(self):
...         for doc in self.getstream():
...             yield [word for word in utils.to_unicode(doc).lower().split() if word not in self.stopwords]
...
...     def __len__(self):
...         self.length = sum(1 for _ in self.get_texts())
...         return self.length
>>>
>>>
>>> corpus = CorpusMiislita(datapath('head500.noblanks.cor.bz2'))
>>> len(corpus)
250
>>> document = next(iter(corpus.get_texts()))

get_texts()¶

Generate documents from corpus.

Yields: list of str – Document as sequence of tokens (+ lineno if self.metadata)

getstream()¶

Generate documents from the underlying plain text collection (of one or more files).

Yields: str – Document read from plain-text file.

Notes

After generator end - initialize self.length attribute.

init_dictionary(dictionary)¶

Initialize/update dictionary.

Parameters: dictionary (Dictionary, optional) – If a dictionary is provided, it will not be updated with the given corpus on initialization. If None - new dictionary will be built for the given corpus.

Notes

If self.input is None - make nothing.

classmethod load(fname, mmap=None)¶

Load an object previously saved using save() from a file.

Parameters

fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save(): Save object to file.

Returns: Object loaded from fname.
Return type: object
Raises: AttributeError – When called on an object instance instead of class (this is a class method).

preprocess_text(text)¶

Apply self.character_filters, self.tokenizer, self.token_filters to a single text document.

Parameters: text (str) – Document read from plain-text file.
Returns: List of tokens extracted from text.
Return type: list of str

sample_texts(n, seed=None, length=None)¶

Generate n random documents from the corpus without replacement.

Parameters

n (int) – Number of documents we want to sample.
seed (int, optional) – If specified, use it as a seed for local random generator.
length (int, optional) – Value will used as corpus length (because calculate length of corpus can be costly operation). If not specified - will call __length__.

Raises

ValueError – If n less than zero or greater than corpus size.

Notes

Given the number of remaining documents in a corpus, we need to choose n elements. The probability for the current element to be chosen is n / remaining. If we choose it, we just decrease the n and move to the next element.

Yields: list of str – Sampled document as sequence of tokens.

save(*args, **kwargs)¶: Saves corpus in-memory state.

Warning

This save only the “state” of a corpus class, not the corpus data!

For saving data use the serialize method of the output format you’d like to use (e.g. gensim.corpora.mmcorpus.MmCorpus.serialize()).

static save_corpus(fname, corpus, id2word=None, metadata=False)¶

Save corpus to disk.

Some formats support saving the dictionary (feature_id -> word mapping), which can be provided by the optional id2word parameter.

Notes

Some corpora also support random access via document indexing, so that the documents on disk can be accessed in O(1) time (see the gensim.corpora.indexedcorpus.IndexedCorpus base class).

In this case, save_corpus() is automatically called internally by serialize(), which does save_corpus() plus saves the index at the same time.

Calling serialize() is preferred to calling :meth:`gensim.interfaces.CorpusABC.save_corpus().

Parameters

fname (str) – Path to output file.
corpus (iterable of list of (int, number)) – Corpus in BoW format.
id2word (Dictionary, optional) – Dictionary of corpus.
metadata (bool, optional) – Write additional metadata to a separate too?

step_through_preprocess(text)¶

Apply preprocessor one by one and generate result.

Warning

This is useful for debugging issues with the corpus preprocessing pipeline.

Parameters: text (str) – Document text read from plain-text file.
Yields: (callable, object) – Pre-processor, output from pre-processor (based on text)

class gensim.corpora.textcorpus.TextDirectoryCorpus(input, dictionary=None, metadata=False, min_depth=0, max_depth=None, pattern=None, exclude_pattern=None, lines_are_documents=False, **kwargs)¶

Bases: gensim.corpora.textcorpus.TextCorpus

Read documents recursively from a directory. Each file/line (depends on lines_are_documents) is interpreted as a plain text document.

Parameters

input (str) – Path to input file/folder.
dictionary (Dictionary, optional) – If a dictionary is provided, it will not be updated with the given corpus on initialization. If None - new dictionary will be built for the given corpus. If input is None, the dictionary will remain uninitialized.
metadata (bool, optional) – If True - yield metadata with each document.
min_depth (int, optional) – Minimum depth in directory tree at which to begin searching for files.
max_depth (int, optional) – Max depth in directory tree at which files will no longer be considered. If None - not limited.
pattern (str, optional) – Regex to use for file name inclusion, all those files not matching this pattern will be ignored.
exclude_pattern (str, optional) – Regex to use for file name exclusion, all files matching this pattern will be ignored.
lines_are_documents (bool, optional) – If True - each line is considered a document, otherwise - each file is one document.
kwargs (keyword arguments passed through to the TextCorpus constructor.) – See gemsim.corpora.textcorpus.TextCorpus.__init__() docstring for more details on these.

property exclude_pattern¶

get_texts()¶

Generate documents from corpus.

Yields: list of str – Document as sequence of tokens (+ lineno if self.metadata)

getstream()¶

Generate documents from the underlying plain text collection (of one or more files).

Yields: str – One document (if lines_are_documents - True), otherwise - each file is one document.

init_dictionary(dictionary)¶

Initialize/update dictionary.

Parameters: dictionary (Dictionary, optional) – If a dictionary is provided, it will not be updated with the given corpus on initialization. If None - new dictionary will be built for the given corpus.

Notes

If self.input is None - make nothing.

iter_filepaths()¶

Generate (lazily) paths to each file in the directory structure within the specified range of depths. If a filename pattern to match was given, further filter to only those filenames that match.

Yields: str – Path to file

property lines_are_documents¶

classmethod load(fname, mmap=None)¶

Load an object previously saved using save() from a file.

Parameters

fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save(): Save object to file.

Returns: Object loaded from fname.
Return type: object
Raises: AttributeError – When called on an object instance instead of class (this is a class method).

property max_depth¶

property min_depth¶

property pattern¶

preprocess_text(text)¶

Apply self.character_filters, self.tokenizer, self.token_filters to a single text document.

Parameters: text (str) – Document read from plain-text file.
Returns: List of tokens extracted from text.
Return type: list of str

sample_texts(n, seed=None, length=None)¶

Generate n random documents from the corpus without replacement.

Parameters

n (int) – Number of documents we want to sample.
seed (int, optional) – If specified, use it as a seed for local random generator.
length (int, optional) – Value will used as corpus length (because calculate length of corpus can be costly operation). If not specified - will call __length__.

Raises

ValueError – If n less than zero or greater than corpus size.

Notes

Yields: list of str – Sampled document as sequence of tokens.

save(*args, **kwargs)¶: Saves corpus in-memory state.

Warning

This save only the “state” of a corpus class, not the corpus data!

For saving data use the serialize method of the output format you’d like to use (e.g. gensim.corpora.mmcorpus.MmCorpus.serialize()).

static save_corpus(fname, corpus, id2word=None, metadata=False)¶

Save corpus to disk.

Some formats support saving the dictionary (feature_id -> word mapping), which can be provided by the optional id2word parameter.

Notes

Some corpora also support random access via document indexing, so that the documents on disk can be accessed in O(1) time (see the gensim.corpora.indexedcorpus.IndexedCorpus base class).

In this case, save_corpus() is automatically called internally by serialize(), which does save_corpus() plus saves the index at the same time.

Calling serialize() is preferred to calling :meth:`gensim.interfaces.CorpusABC.save_corpus().

Parameters

fname (str) – Path to output file.
corpus (iterable of list of (int, number)) – Corpus in BoW format.
id2word (Dictionary, optional) – Dictionary of corpus.
metadata (bool, optional) – Write additional metadata to a separate too?

step_through_preprocess(text)¶

Apply preprocessor one by one and generate result.

Warning

This is useful for debugging issues with the corpus preprocessing pipeline.

Parameters: text (str) – Document text read from plain-text file.
Yields: (callable, object) – Pre-processor, output from pre-processor (based on text)

gensim.corpora.textcorpus.lower_to_unicode(text, encoding='utf8', errors='strict')¶

Lowercase text and convert to unicode, using gensim.utils.any2unicode().

Parameters

text (str) – Input text.
encoding (str, optional) – Encoding that will be used for conversion.
errors (str, optional) – Error handling behaviour, used as parameter for unicode function (python2 only).

Returns

Unicode version of text.

Return type

str

See also

gensim.utils.any2unicode(): Convert any string to unicode-string.

gensim.corpora.textcorpus.remove_short(tokens, minsize=3)¶

Remove tokens shorter than minsize chars.

Parameters

tokens (iterable of str) – Sequence of tokens.
minsize (int, optimal) – Minimal length of token (include).

Returns

List of tokens without short tokens.

Return type

list of str

gensim.corpora.textcorpus.remove_stopwords(tokens, stopwords=frozenset({'a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'computer', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'did', 'didn', 'do', 'does', 'doesn', 'doing', 'don', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'just', 'keep', 'kg', 'km', 'last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'make', 'many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'part', 'per', 'perhaps', 'please', 'put', 'quite', 'rather', 're', 'really', 'regarding', 'same', 'say', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these', 'they', 'thick', 'thin', 'third', 'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under', 'unless', 'until', 'up', 'upon', 'us', 'used', 'using', 'various', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves'}))¶

Remove stopwords using list from gensim.parsing.preprocessing.STOPWORDS.

Parameters

tokens (iterable of str) – Sequence of tokens.
stopwords (iterable of str, optional) – Sequence of stopwords

Returns

List of tokens without stopwords.

Return type

list of str

gensim.corpora.textcorpus.strip_multiple_whitespaces(s)¶

Collapse multiple whitespace characters into a single space.

Parameters: s (str) – Input string
Returns: String with collapsed whitespaces.
Return type: str

gensim.corpora.textcorpus.walk(top, topdown=True, onerror=None, followlinks=False, depth=0)¶

Generate the file names in a directory tree by walking the tree either top-down or bottom-up. For each directory in the tree rooted at directory top (including top itself), it yields a 4-tuple (depth, dirpath, dirnames, filenames).

Parameters

top (str) – Root directory.
topdown (bool, optional) – If True - you can modify dirnames in-place.
onerror (function, optional) – Some function, will be called with one argument, an OSError instance. It can report the error to continue with the walk, or raise the exception to abort the walk. Note that the filename is available as the filename attribute of the exception object.
followlinks (bool, optional) – If True - visit directories pointed to by symlinks, on systems that support them.
depth (int, optional) – Height of file-tree, don’t pass it manually (this used as accumulator for recursion).

Notes

This is a mostly copied version of os.walk from the Python 2 source code. The only difference is that it returns the depth in the directory tree structure at which each yield is taking place.

Yields: (int, str, list of str, list of str) – Depth, current path, visited directories, visited non-directories.

See also

os.walk documentation

Get Expert Help From The Gensim Authors

corpora.textcorpus – Tools for building corpora with dictionaries¶

`corpora.textcorpus` – Tools for building corpora with dictionaries¶