gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

summarization.bm25 – BM25 ranking function

summarization.bm25 – BM25 ranking function

This module contains function of computing rank scores for documents in corpus and helper class BM25 used in calculations. Original algorithm descibed in [1], also you may check Wikipedia page [2].

[1]Robertson, Stephen; Zaragoza, Hugo (2009). The Probabilistic Relevance Framework: BM25 and Beyond, http://www.staff.city.ac.uk/~sb317/papers/foundations_bm25_review.pdf
[2]Okapi BM25 on Wikipedia, https://en.wikipedia.org/wiki/Okapi_BM25

Examples

>>> from gensim.summarization.bm25 import get_bm25_weights
>>> corpus = [
...     ["black", "cat", "white", "cat"],
...     ["cat", "outer", "space"],
...     ["wag", "dog"]
... ]
>>> result = get_bm25_weights(corpus)

Data:

PARAM_K1 - Free smoothing parameter for BM25.
PARAM_B - Free smoothing parameter for BM25.
EPSILON - Constant used for negative idf of document in corpus.
class gensim.summarization.bm25.BM25(corpus)

Bases: object

Implementation of Best Matching 25 ranking function.

corpus_size

int – Size of corpus (number of documents).

avgdl

float – Average length of document in corpus.

corpus

list of list of str – Corpus of documents.

f

list of dicts of int – Dictionary with terms frequencies for each document in corpus. Words used as keys and frequencies as values.

df

dict – Dictionary with terms frequencies for whole corpus. Words used as keys and frequencies as values.

idf

dict – Dictionary with inversed terms frequencies for whole corpus. Words used as keys and frequencies as values.

doc_len

list of int – List of document lengths.

Parameters:corpus (list of list of str) – Given corpus.
get_score(document, index, average_idf)

Computes BM25 score of given document in relation to item of corpus selected by index.

Parameters:
  • document (list of str) – Document to be scored.
  • index (int) – Index of document in corpus selected to score with document.
  • average_idf (float) – Average idf in corpus.
Returns:

BM25 score.

Return type:

float

get_scores(document, average_idf)

Computes and returns BM25 scores of given document in relation to every item in corpus.

Parameters:
  • document (list of str) – Document to be scored.
  • average_idf (float) – Average idf in corpus.
Returns:

BM25 scores.

Return type:

list of float

initialize()

Calculates frequencies of terms in documents and in corpus. Also computes inverse document frequencies.

gensim.summarization.bm25.get_bm25_weights(corpus)

Returns BM25 scores (weights) of documents in corpus. Each document has to be weighted with every document in given corpus.

Parameters:corpus (list of list of str) – Corpus of documents.
Returns:BM25 scores.
Return type:list of list of float

Examples

>>> from gensim.summarization.bm25 import get_bm25_weights
>>> corpus = [
...     ["black", "cat", "white", "cat"],
...     ["cat", "outer", "space"],
...     ["wag", "dog"]
... ]
>>> result = get_bm25_weights(corpus)