gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

summarization.keywords – Keywords for TextRank summarization algorithm

summarization.keywords – Keywords for TextRank summarization algorithm

This module contains functions to find keywords of the text and building graph on tokens from text.

Examples

Extract keywords from text

>>> from gensim.summarization import keywords
>>> text='''Challenges in natural language processing frequently involve
... speech recognition, natural language understanding, natural language
... generation (frequently from formal, machine-readable logical forms),
... connecting language and machine perception, dialog systems, or some
... combination thereof.'''
>>> keywords(text).split('\n')
[u'natural language', u'machine', u'frequently']

Notes

Check tags in http://www.clips.ua.ac.be/pages/mbsp-tags and use only first two letters for INCLUDING_FILTER and EXCLUDING_FILTER

Data:

WINDOW_SIZE - Size of window, number of consecutive tokens in processing.
INCLUDING_FILTER - Including part of speech filters.
EXCLUDING_FILTER - Excluding part of speech filters.
gensim.summarization.keywords.get_graph(text)

Creates and returns graph from given text, cleans and tokenize text before building graph.

Parameters:text (str) – Sequence of values.
Returns:Created graph.
Return type:Graph
gensim.summarization.keywords.keywords(text, ratio=0.2, words=None, split=False, scores=False, pos_filter=('NN', 'JJ'), lemmatize=False, deacc=True)

Get most ranked words of provided text and/or its combinations.

Parameters:
  • text (str) – Input text.
  • ratio (float, optional) – If no “words” option is selected, the number of sentences is reduced by the provided ratio, else, the ratio is ignored.
  • words (int, optional) – Number of returned words.
  • split (bool, optional) – Whether split keywords if True.
  • scores (bool, optional) – Whether score of keyword.
  • pos_filter (tuple, optional) – Part of speech filters.
  • lemmatize (bool, optional) – If True - lemmatize words.
  • deacc (bool, optional) – If True - remove accentuation.
Returns:

  • result (list of (str, float)) – If scores, keywords with scores OR
  • result (list of str) – If split, keywords only OR
  • result (str) – Keywords, joined by endl.