corpora.textcorpus
– Tools for building corpora with dictionaries¶Module provides some code scaffolding to simplify use of built dictionary for constructing BoW vectors.
Notes
Text corpora usually reside on disk, as text files in one format or another In a common scenario, we need to build a dictionary (a word->integer id mapping), which is then used to construct sparse bag-of-word vectors (= iterable of (word_id, word_weight)).
This module provides some code scaffolding to simplify this pipeline. For example, given a corpus where each document
is a separate line in file on disk, you would override the gensim.corpora.textcorpus.TextCorpus.get_texts()
to read one line=document at a time, process it (lowercase, tokenize, whatever) and yield it as a sequence of words.
Overriding gensim.corpora.textcorpus.TextCorpus.get_texts()
is enough, you can then initialize the corpus
with e.g. MyTextCorpus(“mycorpus.txt.bz2”) and it will behave correctly like a corpus of sparse vectors.
The __iter__()
method is automatically set up,
and dictionary is automatically populated with all word->id mappings.
The resulting object can be used as input to some of gensim models (TfidfModel
,
LsiModel
, LdaModel
, …), serialized with any format
(Matrix Market,
SvmLight, Blei’s LDA-C format, etc).
See also
gensim.test.test_miislita.CorpusMiislita
Good simple example.
gensim.corpora.textcorpus.
TextCorpus
(input=None, dictionary=None, metadata=False, character_filters=None, tokenizer=None, token_filters=None)¶Bases: gensim.interfaces.CorpusABC
Helper class to simplify the pipeline of getting BoW vectors from plain text.
Notes
This is an abstract base class: override the get_texts()
and
__len__()
methods to match your particular input.
Given a filename (or a file-like object) in constructor, the corpus object will be automatically initialized
with a dictionary in self.dictionary and will support the __iter__()
corpus method. You have a few different ways of utilizing this class via subclassing or by construction with
different preprocessing arguments.
The __iter__()
method converts the lists of tokens produced by
get_texts()
to BoW format using
gensim.corpora.dictionary.Dictionary.doc2bow()
.
get_texts()
does the following:
Calls getstream()
to get a generator over the texts.
It yields each document in turn from the underlying text file or files.
For each document from the stream, calls preprocess_text()
to produce
a list of tokens. If metadata=True, it yields a 2-tuple with the document number as the second element.
Preprocessing consists of 0+ character_filters, a tokenizer, and 0+ token_filters.
The preprocessing consists of calling each filter in character_filters with the document text.
Unicode is not guaranteed, and if desired, the first filter should convert to unicode.
The output of each character filter should be another string. The output from the final filter is fed
to the tokenizer, which should split the string into a list of tokens (strings).
Afterwards, the list of tokens is fed through each filter in token_filters. The final output returned from
preprocess_text()
is the output from the final token filter.
So to use this class, you can either pass in different preprocessing functions using the character_filters, tokenizer, and token_filters arguments, or you can subclass it.
If subclassing: override getstream()
to take text from different input
sources in different formats.
Override preprocess_text()
if you must provide different initial
preprocessing, then call the preprocess_text()
method to apply
the normal preprocessing.
You can also override get_texts()
in order to tag the documents
(token lists) with different metadata.
The default preprocessing consists of:
lower_to_unicode()
- lowercase and convert to unicode (assumes utf8 encoding)
deaccent()
- deaccent (asciifolding)
strip_multiple_whitespaces()
- collapse multiple whitespaces into a single one
simple_tokenize()
- tokenize by splitting on whitespace
remove_short()
- remove words less than 3 characters long
remove_stopwords()
- remove stopwords
input (str, optional) – Path to top-level directory (file) to traverse for corpus documents.
dictionary (Dictionary
, optional) – If a dictionary is provided, it will not be updated with the given corpus on initialization.
If None - new dictionary will be built for the given corpus.
If input is None, the dictionary will remain uninitialized.
metadata (bool, optional) – If True - yield metadata with each document.
character_filters (iterable of callable, optional) – Each will be applied to the text of each document in order, and should return a single string with
the modified text. For Python 2, the original text will not be unicode, so it may be useful to
convert to unicode as the first character filter.
If None - using lower_to_unicode()
,
deaccent()
and strip_multiple_whitespaces()
.
tokenizer (callable, optional) – Tokenizer for document, if None - using simple_tokenize()
.
token_filters (iterable of callable, optional) – Each will be applied to the iterable of tokens in order, and should return another iterable of tokens.
These filters can add, remove, or replace tokens, or do nothing at all.
If None - using remove_short()
and
remove_stopwords()
.
Examples
>>> from gensim.corpora.textcorpus import TextCorpus
>>> from gensim.test.utils import datapath
>>> from gensim import utils
>>>
>>>
>>> class CorpusMiislita(TextCorpus):
... stopwords = set('for a of the and to in on'.split())
...
... def get_texts(self):
... for doc in self.getstream():
... yield [word for word in utils.to_unicode(doc).lower().split() if word not in self.stopwords]
...
... def __len__(self):
... self.length = sum(1 for _ in self.get_texts())
... return self.length
>>>
>>>
>>> corpus = CorpusMiislita(datapath('head500.noblanks.cor.bz2'))
>>> len(corpus)
250
>>> document = next(iter(corpus.get_texts()))
get_texts
()¶Generate documents from corpus.
list of str – Document as sequence of tokens (+ lineno if self.metadata)
getstream
()¶Generate documents from the underlying plain text collection (of one or more files).
str – Document read from plain-text file.
Notes
After generator end - initialize self.length attribute.
init_dictionary
(dictionary)¶Initialize/update dictionary.
dictionary (Dictionary
, optional) – If a dictionary is provided, it will not be updated with the given corpus on initialization.
If None - new dictionary will be built for the given corpus.
Notes
If self.input is None - make nothing.
load
(fname, mmap=None)¶Load an object previously saved using save()
from a file.
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
Object loaded from fname.
object
AttributeError – When called on an object instance instead of class (this is a class method).
preprocess_text
(text)¶Apply self.character_filters, self.tokenizer, self.token_filters to a single text document.
text (str) – Document read from plain-text file.
List of tokens extracted from text.
list of str
sample_texts
(n, seed=None, length=None)¶Generate n random documents from the corpus without replacement.
n (int) – Number of documents we want to sample.
seed (int, optional) – If specified, use it as a seed for local random generator.
length (int, optional) – Value will used as corpus length (because calculate length of corpus can be costly operation). If not specified - will call __length__.
ValueError – If n less than zero or greater than corpus size.
Notes
Given the number of remaining documents in a corpus, we need to choose n elements. The probability for the current element to be chosen is n / remaining. If we choose it, we just decrease the n and move to the next element.
list of str – Sampled document as sequence of tokens.
save
(*args, **kwargs)¶Saves corpus in-memory state.
Warning
This save only the “state” of a corpus class, not the corpus data!
For saving data use the serialize method of the output format you’d like to use
(e.g. gensim.corpora.mmcorpus.MmCorpus.serialize()
).
save_corpus
(fname, corpus, id2word=None, metadata=False)¶Save corpus to disk.
Some formats support saving the dictionary (feature_id -> word mapping), which can be provided by the optional id2word parameter.
Notes
Some corpora also support random access via document indexing, so that the documents on disk
can be accessed in O(1) time (see the gensim.corpora.indexedcorpus.IndexedCorpus
base class).
In this case, save_corpus()
is automatically called internally by
serialize()
, which does save_corpus()
plus saves the index
at the same time.
Calling serialize() is preferred to calling :meth:`gensim.interfaces.CorpusABC.save_corpus()
.
fname (str) – Path to output file.
corpus (iterable of list of (int, number)) – Corpus in BoW format.
id2word (Dictionary
, optional) – Dictionary of corpus.
metadata (bool, optional) – Write additional metadata to a separate too?
step_through_preprocess
(text)¶Apply preprocessor one by one and generate result.
Warning
This is useful for debugging issues with the corpus preprocessing pipeline.
text (str) – Document text read from plain-text file.
(callable, object) – Pre-processor, output from pre-processor (based on text)
gensim.corpora.textcorpus.
TextDirectoryCorpus
(input, dictionary=None, metadata=False, min_depth=0, max_depth=None, pattern=None, exclude_pattern=None, lines_are_documents=False, **kwargs)¶Bases: gensim.corpora.textcorpus.TextCorpus
Read documents recursively from a directory. Each file/line (depends on lines_are_documents) is interpreted as a plain text document.
input (str) – Path to input file/folder.
dictionary (Dictionary
, optional) – If a dictionary is provided, it will not be updated with the given corpus on initialization.
If None - new dictionary will be built for the given corpus.
If input is None, the dictionary will remain uninitialized.
metadata (bool, optional) – If True - yield metadata with each document.
min_depth (int, optional) – Minimum depth in directory tree at which to begin searching for files.
max_depth (int, optional) – Max depth in directory tree at which files will no longer be considered. If None - not limited.
pattern (str, optional) – Regex to use for file name inclusion, all those files not matching this pattern will be ignored.
exclude_pattern (str, optional) – Regex to use for file name exclusion, all files matching this pattern will be ignored.
lines_are_documents (bool, optional) – If True - each line is considered a document, otherwise - each file is one document.
kwargs (keyword arguments passed through to the TextCorpus constructor.) – See gemsim.corpora.textcorpus.TextCorpus.__init__()
docstring for more details on these.
exclude_pattern
¶get_texts
()¶Generate documents from corpus.
list of str – Document as sequence of tokens (+ lineno if self.metadata)
getstream
()¶Generate documents from the underlying plain text collection (of one or more files).
str – One document (if lines_are_documents - True), otherwise - each file is one document.
init_dictionary
(dictionary)¶Initialize/update dictionary.
dictionary (Dictionary
, optional) – If a dictionary is provided, it will not be updated with the given corpus on initialization.
If None - new dictionary will be built for the given corpus.
Notes
If self.input is None - make nothing.
iter_filepaths
()¶Generate (lazily) paths to each file in the directory structure within the specified range of depths. If a filename pattern to match was given, further filter to only those filenames that match.
str – Path to file
lines_are_documents
¶load
(fname, mmap=None)¶Load an object previously saved using save()
from a file.
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
Object loaded from fname.
object
AttributeError – When called on an object instance instead of class (this is a class method).
max_depth
¶min_depth
¶pattern
¶preprocess_text
(text)¶Apply self.character_filters, self.tokenizer, self.token_filters to a single text document.
text (str) – Document read from plain-text file.
List of tokens extracted from text.
list of str
sample_texts
(n, seed=None, length=None)¶Generate n random documents from the corpus without replacement.
n (int) – Number of documents we want to sample.
seed (int, optional) – If specified, use it as a seed for local random generator.
length (int, optional) – Value will used as corpus length (because calculate length of corpus can be costly operation). If not specified - will call __length__.
ValueError – If n less than zero or greater than corpus size.
Notes
Given the number of remaining documents in a corpus, we need to choose n elements. The probability for the current element to be chosen is n / remaining. If we choose it, we just decrease the n and move to the next element.
list of str – Sampled document as sequence of tokens.
save
(*args, **kwargs)¶Saves corpus in-memory state.
Warning
This save only the “state” of a corpus class, not the corpus data!
For saving data use the serialize method of the output format you’d like to use
(e.g. gensim.corpora.mmcorpus.MmCorpus.serialize()
).
save_corpus
(fname, corpus, id2word=None, metadata=False)¶Save corpus to disk.
Some formats support saving the dictionary (feature_id -> word mapping), which can be provided by the optional id2word parameter.
Notes
Some corpora also support random access via document indexing, so that the documents on disk
can be accessed in O(1) time (see the gensim.corpora.indexedcorpus.IndexedCorpus
base class).
In this case, save_corpus()
is automatically called internally by
serialize()
, which does save_corpus()
plus saves the index
at the same time.
Calling serialize() is preferred to calling :meth:`gensim.interfaces.CorpusABC.save_corpus()
.
fname (str) – Path to output file.
corpus (iterable of list of (int, number)) – Corpus in BoW format.
id2word (Dictionary
, optional) – Dictionary of corpus.
metadata (bool, optional) – Write additional metadata to a separate too?
step_through_preprocess
(text)¶Apply preprocessor one by one and generate result.
Warning
This is useful for debugging issues with the corpus preprocessing pipeline.
text (str) – Document text read from plain-text file.
(callable, object) – Pre-processor, output from pre-processor (based on text)
gensim.corpora.textcorpus.
lower_to_unicode
(text, encoding='utf8', errors='strict')¶Lowercase text and convert to unicode, using gensim.utils.any2unicode()
.
text (str) – Input text.
encoding (str, optional) – Encoding that will be used for conversion.
errors (str, optional) – Error handling behaviour, used as parameter for unicode function (python2 only).
Unicode version of text.
str
See also
gensim.utils.any2unicode()
Convert any string to unicode-string.
gensim.corpora.textcorpus.
remove_short
(tokens, minsize=3)¶Remove tokens shorter than minsize chars.
tokens (iterable of str) – Sequence of tokens.
minsize (int, optimal) – Minimal length of token (include).
List of tokens without short tokens.
list of str
gensim.corpora.textcorpus.
remove_stopwords
(tokens, stopwords=frozenset({'a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'computer', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'did', 'didn', 'do', 'does', 'doesn', 'doing', 'don', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'just', 'keep', 'kg', 'km', 'last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'make', 'many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'part', 'per', 'perhaps', 'please', 'put', 'quite', 'rather', 're', 'really', 'regarding', 'same', 'say', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these', 'they', 'thick', 'thin', 'third', 'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under', 'unless', 'until', 'up', 'upon', 'us', 'used', 'using', 'various', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves'}))¶Remove stopwords using list from gensim.parsing.preprocessing.STOPWORDS.
tokens (iterable of str) – Sequence of tokens.
stopwords (iterable of str, optional) – Sequence of stopwords
List of tokens without stopwords.
list of str
gensim.corpora.textcorpus.
strip_multiple_whitespaces
(s)¶Collapse multiple whitespace characters into a single space.
s (str) – Input string
String with collapsed whitespaces.
str
gensim.corpora.textcorpus.
walk
(top, topdown=True, onerror=None, followlinks=False, depth=0)¶Generate the file names in a directory tree by walking the tree either top-down or bottom-up. For each directory in the tree rooted at directory top (including top itself), it yields a 4-tuple (depth, dirpath, dirnames, filenames).
top (str) – Root directory.
topdown (bool, optional) – If True - you can modify dirnames in-place.
onerror (function, optional) – Some function, will be called with one argument, an OSError instance. It can report the error to continue with the walk, or raise the exception to abort the walk. Note that the filename is available as the filename attribute of the exception object.
followlinks (bool, optional) – If True - visit directories pointed to by symlinks, on systems that support them.
depth (int, optional) – Height of file-tree, don’t pass it manually (this used as accumulator for recursion).
Notes
This is a mostly copied version of os.walk from the Python 2 source code. The only difference is that it returns the depth in the directory tree structure at which each yield is taking place.
(int, str, list of str, list of str) – Depth, current path, visited directories, visited non-directories.
See also