summarization.textcleaner – Summarization pre-processing¶This module contains functions and processors used for processing text, extracting sentences from text, working with acronyms and abbreviations.
SEPARATOR - Special separator used in abbreviations.RE_SENTENCE - Pattern to split text to sentences.AB_SENIOR - Pattern for detecting abbreviations (example: Sgt. Pepper).AB_ACRONYM - Pattern for detecting acronyms.AB_ACRONYM_LETTERS - Pattern for detecting acronyms (example: P.S. I love you).UNDO_AB_SENIOR - Pattern like AB_SENIOR but with SEPARATOR between abbreviation and next word.UNDO_AB_ACRONYM - Pattern like AB_ACRONYM but with SEPARATOR between abbreviation and next word.gensim.summarization.textcleaner.clean_text_by_sentences(text)¶Tokenize a given text into sentences, applying filters and lemmatize them.
text (str) – Given text.
Sentences of the given text.
list of SyntacticUnit
gensim.summarization.textcleaner.clean_text_by_word(text, deacc=True)¶Tokenize a given text into words, applying filters and lemmatize them.
text (str) – Given text.
deacc (bool, optional) – Remove accentuation if True.
Words as keys, SyntacticUnit as values.
dict
Example
>>> from gensim.summarization.textcleaner import clean_text_by_word
>>> clean_text_by_word("God helps those who help themselves")
{'god': Original unit: 'god' *-*-*-* Processed unit: 'god',
'help': Original unit: 'help' *-*-*-* Processed unit: 'help',
'helps': Original unit: 'helps' *-*-*-* Processed unit: 'help'}
gensim.summarization.textcleaner.get_sentences(text)¶Sentence generator from provided text. Sentence pattern set
in RE_SENTENCE.
text (str) – Input text.
str – Single sentence extracted from text.
Example
>>> text = "Does this text contains two sentences? Yes, it does."
>>> for sentence in get_sentences(text):
>>> print(sentence)
Does this text contains two sentences?
Yes, it does.
gensim.summarization.textcleaner.join_words(words, separator=' ')¶Concatenates words with separator between elements.
words (list of str) – Given words.
separator (str, optional) – The separator between elements.
String of merged words with separator between elements.
str
gensim.summarization.textcleaner.merge_syntactic_units(original_units, filtered_units, tags=None)¶Process given sentences and its filtered (tokenized) copies into
SyntacticUnit. Also adds tags if they are provided to produced units.
original_units (list) – List of original sentences.
filtered_units (list) – List of tokenized sentences.
tags (list of str, optional) – List of strings used as tags for each unit. None as deafault.
list of – List of syntactic units (sentences).
class:~gensim.summarization.syntactic_unit.SyntacticUnit
gensim.summarization.textcleaner.replace_abbreviations(text)¶Replace blank space to ‘@’ separator after abbreviation and next word.
text (str) – Input sentence.
Sentence with changed separator.
str
Example
>>> replace_abbreviations("God bless you, please, Mrs. Robinson")
God bless you, please, Mrs.@Robinson
gensim.summarization.textcleaner.replace_with_separator(text, separator, regexs)¶Get text with replaced separator if provided regular expressions were matched.
text (str) – Input text.
separator (str) – The separator between words to be replaced.
regexs (list of _sre.SRE_Pattern) – Regular expressions used in processing text.
Text with replaced separators.
str
gensim.summarization.textcleaner.split_sentences(text)¶Split and get list of sentences from given text. It preserves abbreviations set in
AB_SENIOR and AB_ACRONYM.
text (str) – Input text.
Sentences of given text.
list of str
Example
>>> from gensim.summarization.textcleaner import split_sentences
>>> text = '''Beautiful is better than ugly.
... Explicit is better than implicit. Simple is better than complex.'''
>>> split_sentences(text)
['Beautiful is better than ugly.',
'Explicit is better than implicit.',
'Simple is better than complex.']
gensim.summarization.textcleaner.tokenize_by_word(text)¶Tokenize input text. Before tokenizing transforms text to lower case and removes accentuation and acronyms set
AB_ACRONYM_LETTERS.
text (str) – Given text.
Generator that yields sequence words of the given text.
generator
Example
>>> from gensim.summarization.textcleaner import tokenize_by_word
>>> g = tokenize_by_word('Veni. Vedi. Vici.')
>>> print(next(g))
veni
>>> print(next(g))
vedi
>>> print(next(g))
vici
gensim.summarization.textcleaner.undo_replacement(sentence)¶Replace @ separator back to blank space after each abbreviation.
sentence (str) – Input sentence.
Sentence with changed separator.
str
Example
>>> undo_replacement("God bless you, please, Mrs.@Robinson")
God bless you, please, Mrs. Robinson