summarization.textcleaner
– Summarization pre-processing¶This module contains functions and processors used for processing text, extracting sentences from text, working with acronyms and abbreviations.
SEPARATOR - Special separator used in abbreviations.
RE_SENTENCE - Pattern to split text to sentences.
AB_SENIOR - Pattern for detecting abbreviations (example: Sgt. Pepper).
AB_ACRONYM - Pattern for detecting acronyms.
AB_ACRONYM_LETTERS - Pattern for detecting acronyms (example: P.S. I love you).
UNDO_AB_SENIOR - Pattern like AB_SENIOR but with SEPARATOR between abbreviation and next word.
UNDO_AB_ACRONYM - Pattern like AB_ACRONYM but with SEPARATOR between abbreviation and next word.
gensim.summarization.textcleaner.
clean_text_by_sentences
(text)¶Tokenize a given text into sentences, applying filters and lemmatize them.
text (str) – Given text.
Sentences of the given text.
list of SyntacticUnit
gensim.summarization.textcleaner.
clean_text_by_word
(text, deacc=True)¶Tokenize a given text into words, applying filters and lemmatize them.
text (str) – Given text.
deacc (bool, optional) – Remove accentuation if True.
Words as keys, SyntacticUnit
as values.
dict
Example
>>> from gensim.summarization.textcleaner import clean_text_by_word
>>> clean_text_by_word("God helps those who help themselves")
{'god': Original unit: 'god' *-*-*-* Processed unit: 'god',
'help': Original unit: 'help' *-*-*-* Processed unit: 'help',
'helps': Original unit: 'helps' *-*-*-* Processed unit: 'help'}
gensim.summarization.textcleaner.
get_sentences
(text)¶Sentence generator from provided text. Sentence pattern set
in RE_SENTENCE
.
text (str) – Input text.
str – Single sentence extracted from text.
Example
>>> text = "Does this text contains two sentences? Yes, it does."
>>> for sentence in get_sentences(text):
>>> print(sentence)
Does this text contains two sentences?
Yes, it does.
gensim.summarization.textcleaner.
join_words
(words, separator=' ')¶Concatenates words with separator between elements.
words (list of str) – Given words.
separator (str, optional) – The separator between elements.
String of merged words with separator between elements.
str
gensim.summarization.textcleaner.
merge_syntactic_units
(original_units, filtered_units, tags=None)¶Process given sentences and its filtered (tokenized) copies into
SyntacticUnit
. Also adds tags if they are provided to produced units.
original_units (list) – List of original sentences.
filtered_units (list) – List of tokenized sentences.
tags (list of str, optional) – List of strings used as tags for each unit. None as deafault.
list of – List of syntactic units (sentences).
class:~gensim.summarization.syntactic_unit.SyntacticUnit
gensim.summarization.textcleaner.
replace_abbreviations
(text)¶Replace blank space to ‘@’ separator after abbreviation and next word.
text (str) – Input sentence.
Sentence with changed separator.
str
Example
>>> replace_abbreviations("God bless you, please, Mrs. Robinson")
God bless you, please, Mrs.@Robinson
gensim.summarization.textcleaner.
replace_with_separator
(text, separator, regexs)¶Get text with replaced separator if provided regular expressions were matched.
text (str) – Input text.
separator (str) – The separator between words to be replaced.
regexs (list of _sre.SRE_Pattern) – Regular expressions used in processing text.
Text with replaced separators.
str
gensim.summarization.textcleaner.
split_sentences
(text)¶Split and get list of sentences from given text. It preserves abbreviations set in
AB_SENIOR
and AB_ACRONYM
.
text (str) – Input text.
Sentences of given text.
list of str
Example
>>> from gensim.summarization.textcleaner import split_sentences
>>> text = '''Beautiful is better than ugly.
... Explicit is better than implicit. Simple is better than complex.'''
>>> split_sentences(text)
['Beautiful is better than ugly.',
'Explicit is better than implicit.',
'Simple is better than complex.']
gensim.summarization.textcleaner.
tokenize_by_word
(text)¶Tokenize input text. Before tokenizing transforms text to lower case and removes accentuation and acronyms set
AB_ACRONYM_LETTERS
.
text (str) – Given text.
Generator that yields sequence words of the given text.
generator
Example
>>> from gensim.summarization.textcleaner import tokenize_by_word
>>> g = tokenize_by_word('Veni. Vedi. Vici.')
>>> print(next(g))
veni
>>> print(next(g))
vedi
>>> print(next(g))
vici
gensim.summarization.textcleaner.
undo_replacement
(sentence)¶Replace @ separator back to blank space after each abbreviation.
sentence (str) – Input sentence.
Sentence with changed separator.
str
Example
>>> undo_replacement("God bless you, please, Mrs.@Robinson")
God bless you, please, Mrs. Robinson