corpora.textcorpus
– Tools for building corpora with dictionaries¶
Module provides some code scaffolding to simplify use of built dictionary for constructing BoW vectors.
Notes
Text corpora usually reside on disk, as text files in one format or another In a common scenario, we need to build a dictionary (a word->integer id mapping), which is then used to construct sparse bag-of-word vectors (= iterable of (word_id, word_weight)).
This module provides some code scaffolding to simplify this pipeline. For example, given a corpus where each document
is a separate line in file on disk, you would override the gensim.corpora.textcorpus.TextCorpus.get_texts()
to read one line=document at a time, process it (lowercase, tokenize, whatever) and yield it as a sequence of words.
Overriding gensim.corpora.textcorpus.TextCorpus.get_texts()
is enough, you can then initialize the corpus
with e.g. MyTextCorpus(“mycorpus.txt.bz2”) and it will behave correctly like a corpus of sparse vectors.
The __iter__()
method is automatically set up,
and dictionary is automatically populated with all word->id mappings.
The resulting object can be used as input to some of gensim models (TfidfModel
,
LsiModel
, LdaModel
, …), serialized with any format
(Matrix Market,
SvmLight, Blei’s LDA-C format, etc).
See also
gensim.test.test_miislita.CorpusMiislita
Good simple example.
- class gensim.corpora.textcorpus.TextCorpus(input=None, dictionary=None, metadata=False, character_filters=None, tokenizer=None, token_filters=None)¶
Bases:
CorpusABC
Helper class to simplify the pipeline of getting BoW vectors from plain text.
Notes
This is an abstract base class: override the
get_texts()
and__len__()
methods to match your particular input.Given a filename (or a file-like object) in constructor, the corpus object will be automatically initialized with a dictionary in self.dictionary and will support the
__iter__()
corpus method. You have a few different ways of utilizing this class via subclassing or by construction with different preprocessing arguments.The
__iter__()
method converts the lists of tokens produced byget_texts()
to BoW format usinggensim.corpora.dictionary.Dictionary.doc2bow()
.get_texts()
does the following:Calls
getstream()
to get a generator over the texts. It yields each document in turn from the underlying text file or files.For each document from the stream, calls
preprocess_text()
to produce a list of tokens. If metadata=True, it yields a 2-tuple with the document number as the second element.
Preprocessing consists of 0+ character_filters, a tokenizer, and 0+ token_filters.
The preprocessing consists of calling each filter in character_filters with the document text. Unicode is not guaranteed, and if desired, the first filter should convert to unicode. The output of each character filter should be another string. The output from the final filter is fed to the tokenizer, which should split the string into a list of tokens (strings). Afterwards, the list of tokens is fed through each filter in token_filters. The final output returned from
preprocess_text()
is the output from the final token filter.So to use this class, you can either pass in different preprocessing functions using the character_filters, tokenizer, and token_filters arguments, or you can subclass it.
If subclassing: override
getstream()
to take text from different input sources in different formats. Overridepreprocess_text()
if you must provide different initial preprocessing, then call thepreprocess_text()
method to apply the normal preprocessing. You can also overrideget_texts()
in order to tag the documents (token lists) with different metadata.The default preprocessing consists of:
lower_to_unicode()
- lowercase and convert to unicode (assumes utf8 encoding)deaccent()
- deaccent (asciifolding)strip_multiple_whitespaces()
- collapse multiple whitespaces into onesimple_tokenize()
- tokenize by splitting on whitespaceremove_short_tokens()
- remove words less than 3 characters longremove_stopword_tokens()
- remove stopwords
- Parameters
input (str, optional) – Path to top-level directory (file) to traverse for corpus documents.
dictionary (
Dictionary
, optional) – If a dictionary is provided, it will not be updated with the given corpus on initialization. If None - new dictionary will be built for the given corpus. If input is None, the dictionary will remain uninitialized.metadata (bool, optional) – If True - yield metadata with each document.
character_filters (iterable of callable, optional) – Each will be applied to the text of each document in order, and should return a single string with the modified text. For Python 2, the original text will not be unicode, so it may be useful to convert to unicode as the first character filter. If None - using
lower_to_unicode()
,deaccent()
andstrip_multiple_whitespaces()
.tokenizer (callable, optional) – Tokenizer for document, if None - using
simple_tokenize()
.token_filters (iterable of callable, optional) – Each will be applied to the iterable of tokens in order, and should return another iterable of tokens. These filters can add, remove, or replace tokens, or do nothing at all. If None - using
remove_short_tokens()
andremove_stopword_tokens()
.
Examples
>>> from gensim.corpora.textcorpus import TextCorpus >>> from gensim.test.utils import datapath >>> from gensim import utils >>> >>> >>> class CorpusMiislita(TextCorpus): ... stopwords = set('for a of the and to in on'.split()) ... ... def get_texts(self): ... for doc in self.getstream(): ... yield [word for word in utils.to_unicode(doc).lower().split() if word not in self.stopwords] ... ... def __len__(self): ... self.length = sum(1 for _ in self.get_texts()) ... return self.length >>> >>> >>> corpus = CorpusMiislita(datapath('head500.noblanks.cor.bz2')) >>> len(corpus) 250 >>> document = next(iter(corpus.get_texts()))
- add_lifecycle_event(event_name, log_level=20, **event)¶
Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.
Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.
The lifecycle_events attribute is persisted across object’s
save()
andload()
operations. It has no impact on the use of the model, but is useful during debugging and support.Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.
- Parameters
event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.
This method will automatically add the following key-values to event, so you don’t have to specify them:
datetime: the current date & time
gensim: the current Gensim version
python: the current Python version
platform: the current platform
event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.
- get_texts()¶
Generate documents from corpus.
- Yields
list of str – Document as sequence of tokens (+ lineno if self.metadata)
- getstream()¶
Generate documents from the underlying plain text collection (of one or more files).
- Yields
str – Document read from plain-text file.
Notes
After generator end - initialize self.length attribute.
- init_dictionary(dictionary)¶
Initialize/update dictionary.
- Parameters
dictionary (
Dictionary
, optional) – If a dictionary is provided, it will not be updated with the given corpus on initialization. If None - new dictionary will be built for the given corpus.
Notes
If self.input is None - make nothing.
- classmethod load(fname, mmap=None)¶
Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
- preprocess_text(text)¶
Apply self.character_filters, self.tokenizer, self.token_filters to a single text document.
- Parameters
text (str) – Document read from plain-text file.
- Returns
List of tokens extracted from text.
- Return type
list of str
- sample_texts(n, seed=None, length=None)¶
Generate n random documents from the corpus without replacement.
- Parameters
n (int) – Number of documents we want to sample.
seed (int, optional) – If specified, use it as a seed for local random generator.
length (int, optional) – Value will used as corpus length (because calculate length of corpus can be costly operation). If not specified - will call __length__.
- Raises
ValueError – If n less than zero or greater than corpus size.
Notes
Given the number of remaining documents in a corpus, we need to choose n elements. The probability for the current element to be chosen is n / remaining. If we choose it, we just decrease the n and move to the next element.
- Yields
list of str – Sampled document as sequence of tokens.
- save(*args, **kwargs)¶
Saves the in-memory state of the corpus (pickles the object).
Warning
This saves only the “internal state” of the corpus object, not the corpus data!
To save the corpus data, use the serialize method of your desired output format instead, e.g.
gensim.corpora.mmcorpus.MmCorpus.serialize()
.
- static save_corpus(fname, corpus, id2word=None, metadata=False)¶
Save corpus to disk.
Some formats support saving the dictionary (feature_id -> word mapping), which can be provided by the optional id2word parameter.
Notes
Some corpora also support random access via document indexing, so that the documents on disk can be accessed in O(1) time (see the
gensim.corpora.indexedcorpus.IndexedCorpus
base class).In this case,
save_corpus()
is automatically called internally byserialize()
, which doessave_corpus()
plus saves the index at the same time.Calling
serialize() is preferred to calling :meth:`gensim.interfaces.CorpusABC.save_corpus()
.- Parameters
fname (str) – Path to output file.
corpus (iterable of list of (int, number)) – Corpus in BoW format.
id2word (
Dictionary
, optional) – Dictionary of corpus.metadata (bool, optional) – Write additional metadata to a separate too?
- step_through_preprocess(text)¶
Apply preprocessor one by one and generate result.
Warning
This is useful for debugging issues with the corpus preprocessing pipeline.
- Parameters
text (str) – Document text read from plain-text file.
- Yields
(callable, object) – Pre-processor, output from pre-processor (based on text)
- class gensim.corpora.textcorpus.TextDirectoryCorpus(input, dictionary=None, metadata=False, min_depth=0, max_depth=None, pattern=None, exclude_pattern=None, lines_are_documents=False, encoding='utf-8', **kwargs)¶
Bases:
TextCorpus
Read documents recursively from a directory. Each file/line (depends on lines_are_documents) is interpreted as a plain text document.
- Parameters
input (str) – Path to input file/folder.
dictionary (
Dictionary
, optional) – If a dictionary is provided, it will not be updated with the given corpus on initialization. If None - new dictionary will be built for the given corpus. If input is None, the dictionary will remain uninitialized.metadata (bool, optional) – If True - yield metadata with each document.
min_depth (int, optional) – Minimum depth in directory tree at which to begin searching for files.
max_depth (int, optional) – Max depth in directory tree at which files will no longer be considered. If None - not limited.
pattern (str, optional) – Regex to use for file name inclusion, all those files not matching this pattern will be ignored.
exclude_pattern (str, optional) – Regex to use for file name exclusion, all files matching this pattern will be ignored.
lines_are_documents (bool, optional) – If True - each line is considered a document, otherwise - each file is one document.
encoding (str, optional) – Encoding used to read the specified file or files in the specified directory.
kwargs (keyword arguments passed through to the TextCorpus constructor.) – See
gemsim.corpora.textcorpus.TextCorpus.__init__()
docstring for more details on these.
- add_lifecycle_event(event_name, log_level=20, **event)¶
Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.
Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.
The lifecycle_events attribute is persisted across object’s
save()
andload()
operations. It has no impact on the use of the model, but is useful during debugging and support.Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.
- Parameters
event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.
This method will automatically add the following key-values to event, so you don’t have to specify them:
datetime: the current date & time
gensim: the current Gensim version
python: the current Python version
platform: the current platform
event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.
- property exclude_pattern¶
- get_texts()¶
Generate documents from corpus.
- Yields
list of str – Document as sequence of tokens (+ lineno if self.metadata)
- getstream()¶
Generate documents from the underlying plain text collection (of one or more files).
- Yields
str – One document (if lines_are_documents - True), otherwise - each file is one document.
- init_dictionary(dictionary)¶
Initialize/update dictionary.
- Parameters
dictionary (
Dictionary
, optional) – If a dictionary is provided, it will not be updated with the given corpus on initialization. If None - new dictionary will be built for the given corpus.
Notes
If self.input is None - make nothing.
- iter_filepaths()¶
Generate (lazily) paths to each file in the directory structure within the specified range of depths. If a filename pattern to match was given, further filter to only those filenames that match.
- Yields
str – Path to file
- property lines_are_documents¶
- classmethod load(fname, mmap=None)¶
Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
- property max_depth¶
- property min_depth¶
- property pattern¶
- preprocess_text(text)¶
Apply self.character_filters, self.tokenizer, self.token_filters to a single text document.
- Parameters
text (str) – Document read from plain-text file.
- Returns
List of tokens extracted from text.
- Return type
list of str
- sample_texts(n, seed=None, length=None)¶
Generate n random documents from the corpus without replacement.
- Parameters
n (int) – Number of documents we want to sample.
seed (int, optional) – If specified, use it as a seed for local random generator.
length (int, optional) – Value will used as corpus length (because calculate length of corpus can be costly operation). If not specified - will call __length__.
- Raises
ValueError – If n less than zero or greater than corpus size.
Notes
Given the number of remaining documents in a corpus, we need to choose n elements. The probability for the current element to be chosen is n / remaining. If we choose it, we just decrease the n and move to the next element.
- Yields
list of str – Sampled document as sequence of tokens.
- save(*args, **kwargs)¶
Saves the in-memory state of the corpus (pickles the object).
Warning
This saves only the “internal state” of the corpus object, not the corpus data!
To save the corpus data, use the serialize method of your desired output format instead, e.g.
gensim.corpora.mmcorpus.MmCorpus.serialize()
.
- static save_corpus(fname, corpus, id2word=None, metadata=False)¶
Save corpus to disk.
Some formats support saving the dictionary (feature_id -> word mapping), which can be provided by the optional id2word parameter.
Notes
Some corpora also support random access via document indexing, so that the documents on disk can be accessed in O(1) time (see the
gensim.corpora.indexedcorpus.IndexedCorpus
base class).In this case,
save_corpus()
is automatically called internally byserialize()
, which doessave_corpus()
plus saves the index at the same time.Calling
serialize() is preferred to calling :meth:`gensim.interfaces.CorpusABC.save_corpus()
.- Parameters
fname (str) – Path to output file.
corpus (iterable of list of (int, number)) – Corpus in BoW format.
id2word (
Dictionary
, optional) – Dictionary of corpus.metadata (bool, optional) – Write additional metadata to a separate too?
- step_through_preprocess(text)¶
Apply preprocessor one by one and generate result.
Warning
This is useful for debugging issues with the corpus preprocessing pipeline.
- Parameters
text (str) – Document text read from plain-text file.
- Yields
(callable, object) – Pre-processor, output from pre-processor (based on text)
- gensim.corpora.textcorpus.walk(top, topdown=True, onerror=None, followlinks=False, depth=0)¶
Generate the file names in a directory tree by walking the tree either top-down or bottom-up. For each directory in the tree rooted at directory top (including top itself), it yields a 4-tuple (depth, dirpath, dirnames, filenames).
- Parameters
top (str) – Root directory.
topdown (bool, optional) – If True - you can modify dirnames in-place.
onerror (function, optional) – Some function, will be called with one argument, an OSError instance. It can report the error to continue with the walk, or raise the exception to abort the walk. Note that the filename is available as the filename attribute of the exception object.
followlinks (bool, optional) – If True - visit directories pointed to by symlinks, on systems that support them.
depth (int, optional) – Height of file-tree, don’t pass it manually (this used as accumulator for recursion).
Notes
This is a mostly copied version of os.walk from the Python 2 source code. The only difference is that it returns the depth in the directory tree structure at which each yield is taking place.
- Yields
(int, str, list of str, list of str) – Depth, current path, visited directories, visited non-directories.
See also