summarization.bm25
– BM25 ranking function¶This module contains function of computing rank scores for documents in corpus and helper class BM25 used in calculations. Original algorithm descibed in 1, also you may check Wikipedia page 2.
Robertson, Stephen; Zaragoza, Hugo (2009). The Probabilistic Relevance Framework: BM25 and Beyond, http://www.staff.city.ac.uk/~sb317/papers/foundations_bm25_review.pdf
Okapi BM25 on Wikipedia, https://en.wikipedia.org/wiki/Okapi_BM25
Examples
>>> from gensim.summarization.bm25 import get_bm25_weights
>>> corpus = [
... ["black", "cat", "white", "cat"],
... ["cat", "outer", "space"],
... ["wag", "dog"]
... ]
>>> result = get_bm25_weights(corpus, n_jobs=-1)
PARAM_K1 - Free smoothing parameter for BM25.
PARAM_B - Free smoothing parameter for BM25.
EPSILON - Constant used for negative idf of document in corpus.
gensim.summarization.bm25.
BM25
(corpus)¶Bases: object
Implementation of Best Matching 25 ranking function.
corpus_size
¶Size of corpus (number of documents).
int
avgdl
¶Average length of document in corpus.
float
doc_freqs
¶Dictionary with terms frequencies for each document in corpus. Words used as keys and frequencies as values.
list of dicts of int
idf
¶Dictionary with inversed documents frequencies for whole corpus. Words used as keys and frequencies as values.
dict
doc_len
¶List of document lengths.
list of int
corpus (list of list of str) – Given corpus.
get_score
(document, index)¶Computes BM25 score of given document in relation to item of corpus selected by index.
document (list of str) – Document to be scored.
index (int) – Index of document in corpus selected to score with document.
BM25 score.
float
get_scores
(document)¶Computes and returns BM25 scores of given document in relation to every item in corpus.
document (list of str) – Document to be scored.
BM25 scores.
list of float
get_scores_bow
(document)¶Computes and returns BM25 scores of given document in relation to every item in corpus.
document (list of str) – Document to be scored.
BM25 scores.
list of float
gensim.summarization.bm25.
get_bm25_weights
(corpus, n_jobs=1)¶Returns BM25 scores (weights) of documents in corpus. Each document has to be weighted with every document in given corpus.
corpus (list of list of str) – Corpus of documents.
n_jobs (int) – The number of processes to use for computing bm25.
BM25 scores.
list of list of float
Examples
>>> from gensim.summarization.bm25 import get_bm25_weights
>>> corpus = [
... ["black", "cat", "white", "cat"],
... ["cat", "outer", "space"],
... ["wag", "dog"]
... ]
>>> result = get_bm25_weights(corpus, n_jobs=-1)
gensim.summarization.bm25.
iter_bm25_bow
(corpus, n_jobs=1)¶Yield BM25 scores (weights) of documents in corpus. Each document has to be weighted with every document in given corpus.
corpus (list of list of str) – Corpus of documents.
n_jobs (int) – The number of processes to use for computing bm25.
list of (index, float) – BM25 scores in bag of weights format.
Examples
>>> from gensim.summarization.bm25 import iter_bm25_weights
>>> corpus = [
... ["black", "cat", "white", "cat"],
... ["cat", "outer", "space"],
... ["wag", "dog"]
... ]
>>> result = iter_bm25_weights(corpus, n_jobs=-1)