summarization.mz_entropy
– Keywords for the Montemurro and Zanette entropy algorithm¶gensim.summarization.mz_entropy.
count_freqs_by_blocks
(words, vocab, blocksize)¶Count word frequencies in chunks
words (list(str)) – List of all words.
vocab (list(str)) – List of words in vocabulary.
blocksize (int) – Size of blocks to use for count.
results – Array of list of word frequencies in one chunk. The order of word frequencies is the same as words in vocab.
numpy.array(list(double))
gensim.summarization.mz_entropy.
mz_keywords
(text, blocksize=1024, scores=False, split=False, weighted=True, threshold=0.0)¶Extract keywords from text using the Montemurro and Zanette entropy algorithm. 1
text (str) – Document for summarization.
blocksize (int, optional) – Size of blocks to use in analysis.
scores (bool, optional) – Whether to return score with keywords.
split (bool, optional) – Whether to return results as list.
weighted (bool, optional) – Whether to weight scores by word frequency. False can useful for shorter texts, and allows automatic thresholding.
threshold (float or 'auto', optional) – Minimum score for returned keywords, ‘auto’ calculates the threshold as n_blocks / (n_blocks + 1.0) + 1e-8, use ‘auto’ with weighted=False.
results (str) – newline separated keywords if split == False OR
results (list(str)) – list of keywords if scores == False OR
results (list(tuple(str, float))) – list of (keyword, score) tuples if scores == True
Results are returned in descending order of score regardless of the format.
Note
This algorithm looks for keywords that contribute to the structure of the text on scales of blocksize words of larger. It is suitable for extracting keywords representing the major themes of long texts.
References
Marcello A Montemurro, Damian Zanette, “Towards the quantification of the semantic information encoded in written language”. Advances in Complex Systems, Volume 13, Issue 2 (2010), pp. 135-153, DOI: 10.1142/S0219525910002530, https://arxiv.org/abs/0907.1558