gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

Corporate trainings in Data Science, NLP and Deep Learning

summarization.textcleaner – Summarization pre-processing

summarization.textcleaner – Summarization pre-processing

This module contains functions and processors used for processing text, extracting sentences from text, working with acronyms and abbreviations.

Data

SEPARATOR - Special separator used in abbreviations.
RE_SENTENCE - Pattern to split text to sentences.
AB_SENIOR - Pattern for detecting abbreviations (example: Sgt. Pepper).
AB_ACRONYM - Pattern for detecting acronyms.
AB_ACRONYM_LETTERS - Pattern for detecting acronyms (example: P.S. I love you).
UNDO_AB_SENIOR - Pattern like AB_SENIOR but with SEPARATOR between abbreviation and next word.
UNDO_AB_ACRONYM - Pattern like AB_ACRONYM but with SEPARATOR between abbreviation and next word.
gensim.summarization.textcleaner.clean_text_by_sentences(text)

Tokenize a given text into sentences, applying filters and lemmatize them.

Parameters

text (str) – Given text.

Returns

Sentences of the given text.

Return type

list of SyntacticUnit

gensim.summarization.textcleaner.clean_text_by_word(text, deacc=True)

Tokenize a given text into words, applying filters and lemmatize them.

Parameters
  • text (str) – Given text.

  • deacc (bool, optional) – Remove accentuation if True.

Returns

Words as keys, SyntacticUnit as values.

Return type

dict

Example

>>> from gensim.summarization.textcleaner import clean_text_by_word
>>> clean_text_by_word("God helps those who help themselves")
{'god': Original unit: 'god' *-*-*-* Processed unit: 'god',
'help': Original unit: 'help' *-*-*-* Processed unit: 'help',
'helps': Original unit: 'helps' *-*-*-* Processed unit: 'help'}
gensim.summarization.textcleaner.get_sentences(text)

Sentence generator from provided text. Sentence pattern set in RE_SENTENCE.

Parameters

text (str) – Input text.

Yields

str – Single sentence extracted from text.

Example

>>> text = "Does this text contains two sentences? Yes, it does."
>>> for sentence in get_sentences(text):
>>>     print(sentence)
Does this text contains two sentences?
Yes, it does.
gensim.summarization.textcleaner.join_words(words, separator=' ')

Concatenates words with separator between elements.

Parameters
  • words (list of str) – Given words.

  • separator (str, optional) – The separator between elements.

Returns

String of merged words with separator between elements.

Return type

str

gensim.summarization.textcleaner.merge_syntactic_units(original_units, filtered_units, tags=None)

Process given sentences and its filtered (tokenized) copies into SyntacticUnit. Also adds tags if they are provided to produced units.

Parameters
  • original_units (list) – List of original sentences.

  • filtered_units (list) – List of tokenized sentences.

  • tags (list of str, optional) – List of strings used as tags for each unit. None as deafault.

Returns

list of – List of syntactic units (sentences).

Return type

class:~gensim.summarization.syntactic_unit.SyntacticUnit

gensim.summarization.textcleaner.replace_abbreviations(text)

Replace blank space to ‘@’ separator after abbreviation and next word.

Parameters

text (str) – Input sentence.

Returns

Sentence with changed separator.

Return type

str

Example

>>> replace_abbreviations("God bless you, please, Mrs. Robinson")
God bless you, please, Mrs.@Robinson
gensim.summarization.textcleaner.replace_with_separator(text, separator, regexs)

Get text with replaced separator if provided regular expressions were matched.

Parameters
  • text (str) – Input text.

  • separator (str) – The separator between words to be replaced.

  • regexs (list of _sre.SRE_Pattern) – Regular expressions used in processing text.

Returns

Text with replaced separators.

Return type

str

gensim.summarization.textcleaner.split_sentences(text)

Split and get list of sentences from given text. It preserves abbreviations set in AB_SENIOR and AB_ACRONYM.

Parameters

text (str) – Input text.

Returns

Sentences of given text.

Return type

list of str

Example

>>> from gensim.summarization.textcleaner import split_sentences
>>> text = '''Beautiful is better than ugly.
... Explicit is better than implicit. Simple is better than complex.'''
>>> split_sentences(text)
['Beautiful is better than ugly.',
'Explicit is better than implicit.',
'Simple is better than complex.']
gensim.summarization.textcleaner.tokenize_by_word(text)

Tokenize input text. Before tokenizing transforms text to lower case and removes accentuation and acronyms set AB_ACRONYM_LETTERS.

Parameters

text (str) – Given text.

Returns

Generator that yields sequence words of the given text.

Return type

generator

Example

>>> from gensim.summarization.textcleaner import tokenize_by_word
>>> g = tokenize_by_word('Veni. Vedi. Vici.')
>>> print(next(g))
veni
>>> print(next(g))
vedi
>>> print(next(g))
vici
gensim.summarization.textcleaner.undo_replacement(sentence)

Replace @ separator back to blank space after each abbreviation.

Parameters

sentence (str) – Input sentence.

Returns

Sentence with changed separator.

Return type

str

Example

>>> undo_replacement("God bless you, please, Mrs.@Robinson")
God bless you, please, Mrs. Robinson