gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

Corporate trainings in Data Science, NLP and Deep Learning

summarization.bm25 – BM25 ranking function

summarization.bm25 – BM25 ranking function

This module contains function of computing rank scores for documents in corpus and helper class BM25 used in calculations. Original algorithm descibed in 1, also you may check Wikipedia page 2.

1

Robertson, Stephen; Zaragoza, Hugo (2009). The Probabilistic Relevance Framework: BM25 and Beyond, http://www.staff.city.ac.uk/~sb317/papers/foundations_bm25_review.pdf

2

Okapi BM25 on Wikipedia, https://en.wikipedia.org/wiki/Okapi_BM25

Examples

>>> from gensim.summarization.bm25 import get_bm25_weights
>>> corpus = [
...     ["black", "cat", "white", "cat"],
...     ["cat", "outer", "space"],
...     ["wag", "dog"]
... ]
>>> result = get_bm25_weights(corpus, n_jobs=-1)

Data:

PARAM_K1 - Free smoothing parameter for BM25.
PARAM_B - Free smoothing parameter for BM25.
EPSILON - Constant used for negative idf of document in corpus.
class gensim.summarization.bm25.BM25(corpus)

Bases: object

Implementation of Best Matching 25 ranking function.

corpus_size

Size of corpus (number of documents).

Type

int

avgdl

Average length of document in corpus.

Type

float

doc_freqs

Dictionary with terms frequencies for each document in corpus. Words used as keys and frequencies as values.

Type

list of dicts of int

idf

Dictionary with inversed documents frequencies for whole corpus. Words used as keys and frequencies as values.

Type

dict

doc_len

List of document lengths.

Type

list of int

Parameters

corpus (list of list of str) – Given corpus.

get_score(document, index)

Computes BM25 score of given document in relation to item of corpus selected by index.

Parameters
  • document (list of str) – Document to be scored.

  • index (int) – Index of document in corpus selected to score with document.

Returns

BM25 score.

Return type

float

get_scores(document)

Computes and returns BM25 scores of given document in relation to every item in corpus.

Parameters

document (list of str) – Document to be scored.

Returns

BM25 scores.

Return type

list of float

get_scores_bow(document)

Computes and returns BM25 scores of given document in relation to every item in corpus.

Parameters

document (list of str) – Document to be scored.

Returns

BM25 scores.

Return type

list of float

gensim.summarization.bm25.get_bm25_weights(corpus, n_jobs=1)

Returns BM25 scores (weights) of documents in corpus. Each document has to be weighted with every document in given corpus.

Parameters
  • corpus (list of list of str) – Corpus of documents.

  • n_jobs (int) – The number of processes to use for computing bm25.

Returns

BM25 scores.

Return type

list of list of float

Examples

>>> from gensim.summarization.bm25 import get_bm25_weights
>>> corpus = [
...     ["black", "cat", "white", "cat"],
...     ["cat", "outer", "space"],
...     ["wag", "dog"]
... ]
>>> result = get_bm25_weights(corpus, n_jobs=-1)
gensim.summarization.bm25.iter_bm25_bow(corpus, n_jobs=1)

Yield BM25 scores (weights) of documents in corpus. Each document has to be weighted with every document in given corpus.

Parameters
  • corpus (list of list of str) – Corpus of documents.

  • n_jobs (int) – The number of processes to use for computing bm25.

Yields

list of (index, float) – BM25 scores in bag of weights format.

Examples

>>> from gensim.summarization.bm25 import iter_bm25_weights
>>> corpus = [
...     ["black", "cat", "white", "cat"],
...     ["cat", "outer", "space"],
...     ["wag", "dog"]
... ]
>>> result = iter_bm25_weights(corpus, n_jobs=-1)