summarization.keywords
– Keywords for TextRank summarization algorithm¶This module contains functions to find keywords of the text and building graph on tokens from text.
Examples
Extract keywords from text
>>> from gensim.summarization import keywords
>>> text = '''Challenges in natural language processing frequently involve
... speech recognition, natural language understanding, natural language
... generation (frequently from formal, machine-readable logical forms),
... connecting language and machine perception, dialog systems, or some
... combination thereof.'''
>>> keywords(text).split('\n')
[u'natural language', u'machine', u'frequently']
Notes
Check tags in http://www.clips.ua.ac.be/pages/mbsp-tags and use only first two letters for INCLUDING_FILTER and EXCLUDING_FILTER
WINDOW_SIZE - Size of window, number of consecutive tokens in processing.
INCLUDING_FILTER - Including part of speech filters.
EXCLUDING_FILTER - Excluding part of speech filters.
gensim.summarization.keywords.
get_graph
(text)¶Creates and returns graph from given text, cleans and tokenize text before building graph.
text (str) – Sequence of values.
Created graph.
gensim.summarization.keywords.
keywords
(text, ratio=0.2, words=None, split=False, scores=False, pos_filter=('NN', 'JJ'), lemmatize=False, deacc=True)¶Get most ranked words of provided text and/or its combinations.
text (str) – Input text.
ratio (float, optional) – If no “words” option is selected, the number of sentences is reduced by the provided ratio, else, the ratio is ignored.
words (int, optional) – Number of returned words.
split (bool, optional) – Whether split keywords if True.
scores (bool, optional) – Whether score of keyword.
pos_filter (tuple, optional) – Part of speech filters.
lemmatize (bool, optional) – If True - lemmatize words.
deacc (bool, optional) – If True - remove accentuation.
result (list of (str, float)) – If scores, keywords with scores OR
result (list of str) – If split, keywords only OR
result (str) – Keywords, joined by endl.