gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

Corporate trainings in Data Science, NLP and Deep Learning

summarization.summarizer – TextRank Summariser

summarization.summarizer – TextRank Summariser

This module provides functions for summarizing texts. Summarizing is based on ranks of text sentences using a variation of the TextRank algorithm 1.

1(1,2)

Federico Barrios, Federico L´opez, Luis Argerich, Rosita Wachenchauzer (2016). Variations of the Similarity Function of TextRank for Automated Summarization, https://arxiv.org/abs/1602.03606

Data

INPUT_MIN_LENGTH - Minimal number of sentences in text
WEIGHT_THRESHOLD - Minimal weight of edge between graph nodes. Smaller weights set to zero.

Example

>>> from gensim.summarization.summarizer import summarize
>>> text = '''Rice Pudding - Poem by Alan Alexander Milne
... What is the matter with Mary Jane?
... She's crying with all her might and main,
... And she won't eat her dinner - rice pudding again -
... What is the matter with Mary Jane?
... What is the matter with Mary Jane?
... I've promised her dolls and a daisy-chain,
... And a book about animals - all in vain -
... What is the matter with Mary Jane?
... What is the matter with Mary Jane?
... She's perfectly well, and she hasn't a pain;
... But, look at her, now she's beginning again! -
... What is the matter with Mary Jane?
... What is the matter with Mary Jane?
... I've promised her sweets and a ride in the train,
... And I've begged her to stop for a bit and explain -
... What is the matter with Mary Jane?
... What is the matter with Mary Jane?
... She's perfectly well and she hasn't a pain,
... And it's lovely rice pudding for dinner again!
... What is the matter with Mary Jane?'''
>>> print(summarize(text))
And she won't eat her dinner - rice pudding again -
I've promised her dolls and a daisy-chain,
I've promised her sweets and a ride in the train,
And it's lovely rice pudding for dinner again!
gensim.summarization.summarizer.summarize(text, ratio=0.2, word_count=None, split=False)

Get a summarized version of the given text.

The output summary will consist of the most representative sentences and will be returned as a string, divided by newlines.

Note

The input should be a string, and must be longer than INPUT_MIN_LENGTH sentences for the summary to make sense. The text will be split into sentences using the split_sentences method in the gensim.summarization.texcleaner module. Note that newlines divide sentences.

Parameters
  • text (str) – Given text.

  • ratio (float, optional) – Number between 0 and 1 that determines the proportion of the number of sentences of the original text to be chosen for the summary.

  • word_count (int or None, optional) – Determines how many words will the output contain. If both parameters are provided, the ratio will be ignored.

  • split (bool, optional) – If True, list of sentences will be returned. Otherwise joined strings will bwe returned.

Returns

  • list of str – If split OR

  • str – Most representative sentences of given the text.

gensim.summarization.summarizer.summarize_corpus(corpus, ratio=0.2)
Get a list of the most important documents of a corpus using a variation of the TextRank algorithm 1.

Used as helper for summarize summarizer()

Note

The input must have at least INPUT_MIN_LENGTH documents for the summary to make sense.

Parameters
  • corpus (list of list of (int, int)) – Given corpus.

  • ratio (float, optional) – Number between 0 and 1 that determines the proportion of the number of sentences of the original text to be chosen for the summary, optional.

Returns

Most important documents of given corpus sorted by the document score, highest first.

Return type

list of str