summarization.textcleaner – Summarization pre-processing

`summarization.textcleaner` – Summarization pre-processing¶

This module contains functions and processors used for processing text, extracting sentences from text, working with acronyms and abbreviations.

Data¶

SEPARATOR - Special separator used in abbreviations.

RE_SENTENCE - Pattern to split text to sentences.

AB_SENIOR - Pattern for detecting abbreviations (example: Sgt. Pepper).

AB_ACRONYM - Pattern for detecting acronyms.

AB_ACRONYM_LETTERS - Pattern for detecting acronyms (example: P.S. I love you).

UNDO_AB_SENIOR - Pattern like AB_SENIOR but with SEPARATOR between abbreviation and next word.

UNDO_AB_ACRONYM - Pattern like AB_ACRONYM but with SEPARATOR between abbreviation and next word.

gensim.summarization.textcleaner.clean_text_by_sentences(text)¶

Tokenize a given text into sentences, applying filters and lemmatize them.

Parameters: text (str) – Given text.
Returns: Sentences of the given text.
Return type: list of SyntacticUnit

gensim.summarization.textcleaner.clean_text_by_word(text, deacc=True)¶

Tokenize a given text into words, applying filters and lemmatize them.

Parameters

text (str) – Given text.
deacc (bool, optional) – Remove accentuation if True.

Returns

Words as keys, SyntacticUnit as values.

Return type

dict

Example

>>> from gensim.summarization.textcleaner import clean_text_by_word
>>> clean_text_by_word("God helps those who help themselves")
{'god': Original unit: 'god' *-*-*-* Processed unit: 'god',
'help': Original unit: 'help' *-*-*-* Processed unit: 'help',
'helps': Original unit: 'helps' *-*-*-* Processed unit: 'help'}

gensim.summarization.textcleaner.get_sentences(text)¶

Sentence generator from provided text. Sentence pattern set in RE_SENTENCE.

Parameters: text (str) – Input text.
Yields: str – Single sentence extracted from text.

Example

>>> text = "Does this text contains two sentences? Yes, it does."
>>> for sentence in get_sentences(text):
>>>     print(sentence)
Does this text contains two sentences?
Yes, it does.

gensim.summarization.textcleaner.join_words(words, separator=' ')¶

Concatenates words with separator between elements.

Parameters

words (list of str) – Given words.
separator (str, optional) – The separator between elements.

Returns

String of merged words with separator between elements.

Return type

str

gensim.summarization.textcleaner.merge_syntactic_units(original_units, filtered_units, tags=None)¶

Process given sentences and its filtered (tokenized) copies into SyntacticUnit. Also adds tags if they are provided to produced units.

Parameters

original_units (list) – List of original sentences.
filtered_units (list) – List of tokenized sentences.
tags (list of str, optional) – List of strings used as tags for each unit. None as deafault.

Returns

list of – List of syntactic units (sentences).

Return type

class:~gensim.summarization.syntactic_unit.SyntacticUnit

gensim.summarization.textcleaner.replace_abbreviations(text)¶

Replace blank space to ‘@’ separator after abbreviation and next word.

Parameters: text (str) – Input sentence.
Returns: Sentence with changed separator.
Return type: str

Example

>>> replace_abbreviations("God bless you, please, Mrs. Robinson")
God bless you, please, Mrs.@Robinson

gensim.summarization.textcleaner.replace_with_separator(text, separator, regexs)¶

Get text with replaced separator if provided regular expressions were matched.

Parameters

text (str) – Input text.
separator (str) – The separator between words to be replaced.
regexs (list of _sre.SRE_Pattern) – Regular expressions used in processing text.

Returns

Text with replaced separators.

Return type

str

gensim.summarization.textcleaner.split_sentences(text)¶

Split and get list of sentences from given text. It preserves abbreviations set in AB_SENIOR and AB_ACRONYM.

Parameters: text (str) – Input text.
Returns: Sentences of given text.
Return type: list of str

Example

>>> from gensim.summarization.textcleaner import split_sentences
>>> text = '''Beautiful is better than ugly.
... Explicit is better than implicit. Simple is better than complex.'''
>>> split_sentences(text)
['Beautiful is better than ugly.',
'Explicit is better than implicit.',
'Simple is better than complex.']

gensim.summarization.textcleaner.tokenize_by_word(text)¶

Tokenize input text. Before tokenizing transforms text to lower case and removes accentuation and acronyms set AB_ACRONYM_LETTERS.

Parameters: text (str) – Given text.
Returns: Generator that yields sequence words of the given text.
Return type: generator

Example

>>> from gensim.summarization.textcleaner import tokenize_by_word
>>> g = tokenize_by_word('Veni. Vedi. Vici.')
>>> print(next(g))
veni
>>> print(next(g))
vedi
>>> print(next(g))
vici

gensim.summarization.textcleaner.undo_replacement(sentence)¶

Replace @ separator back to blank space after each abbreviation.

Parameters: sentence (str) – Input sentence.
Returns: Sentence with changed separator.
Return type: str

Example

>>> undo_replacement("God bless you, please, Mrs.@Robinson")
God bless you, please, Mrs. Robinson

Get Expert Help From The Gensim Authors

summarization.textcleaner – Summarization pre-processing¶

Data¶

`summarization.textcleaner` – Summarization pre-processing¶