gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

Corporate trainings in Data Science, NLP and Deep Learning

summarization.keywords – Keywords for TextRank summarization algorithm

summarization.keywords – Keywords for TextRank summarization algorithm

This module contains functions to find keywords of the text and building graph on tokens from text.

Examples

Extract keywords from text

>>> from gensim.summarization import keywords
>>> text = '''Challenges in natural language processing frequently involve
... speech recognition, natural language understanding, natural language
... generation (frequently from formal, machine-readable logical forms),
... connecting language and machine perception, dialog systems, or some
... combination thereof.'''
>>> keywords(text).split('\n')
[u'natural language', u'machine', u'frequently']

Notes

Check tags in http://www.clips.ua.ac.be/pages/mbsp-tags and use only first two letters for INCLUDING_FILTER and EXCLUDING_FILTER

Data:

WINDOW_SIZE - Size of window, number of consecutive tokens in processing.
INCLUDING_FILTER - Including part of speech filters.
EXCLUDING_FILTER - Excluding part of speech filters.
gensim.summarization.keywords.get_graph(text)

Creates and returns graph from given text, cleans and tokenize text before building graph.

Parameters

text (str) – Sequence of values.

Returns

Created graph.

Return type

Graph

gensim.summarization.keywords.keywords(text, ratio=0.2, words=None, split=False, scores=False, pos_filter=('NN', 'JJ'), lemmatize=False, deacc=True)

Get most ranked words of provided text and/or its combinations.

Parameters
  • text (str) – Input text.

  • ratio (float, optional) – If no “words” option is selected, the number of sentences is reduced by the provided ratio, else, the ratio is ignored.

  • words (int, optional) – Number of returned words.

  • split (bool, optional) – Whether split keywords if True.

  • scores (bool, optional) – Whether score of keyword.

  • pos_filter (tuple, optional) – Part of speech filters.

  • lemmatize (bool, optional) – If True - lemmatize words.

  • deacc (bool, optional) – If True - remove accentuation.

Returns

  • result (list of (str, float)) – If scores, keywords with scores OR

  • result (list of str) – If split, keywords only OR

  • result (str) – Keywords, joined by endl.