models.tfidfmodel – TF-IDF model

`models.tfidfmodel` – TF-IDF model¶

This module implements functionality related to the Term Frequency - Inverse Document Frequency <https://en.wikipedia.org/wiki/Tf%E2%80%93idf> vector space bag-of-words models.

For a more in-depth exposition of TF-IDF and its various SMART variants (normalization, weighting schemes), see the blog post at https://rare-technologies.com/pivoted-document-length-normalisation/

class gensim.models.tfidfmodel.TfidfModel(corpus=None, id2word=None, dictionary=None, wlocal=<function identity>, wglobal=<function df2idf>, normalize=True, smartirs=None, pivot=None, slope=0.25)¶

Bases: gensim.interfaces.TransformationABC

Objects of this class realize the transformation between word-document co-occurrence matrix (int) into a locally/globally weighted TF-IDF matrix (positive floats).

Examples

>>> import gensim.downloader as api
>>> from gensim.models import TfidfModel
>>> from gensim.corpora import Dictionary
>>>
>>> dataset = api.load("text8")
>>> dct = Dictionary(dataset)  # fit dictionary
>>> corpus = [dct.doc2bow(line) for line in dataset]  # convert corpus to BoW format
>>>
>>> model = TfidfModel(corpus)  # fit model
>>> vector = model[corpus[0]]  # apply model to the first corpus document

Compute TF-IDF by multiplying a local component (term frequency) with a global component (inverse document frequency), and normalizing the resulting documents to unit length. Formula for non-normalized weight of term $i$ in document $j$ in a corpus of $D$ documents

$weight_{i,j} = frequency_{i,j} * log_2 \frac{D}{document\_freq_{i}}$

or, more generally

$weight_{i,j} = wlocal(frequency_{i,j}) * wglobal(document\_freq_{i}, D)$

so you can plug in your own custom $wlocal$ and $wglobal$ functions.

Parameters

corpus (iterable of iterable of (int, int), optional) – Input corpus
id2word ({dict, Dictionary}, optional) – Mapping token - id, that was used for converting input data to bag of words format.
dictionary (Dictionary) – If dictionary is specified, it must be a corpora.Dictionary object and it will be used. to directly construct the inverse document frequency mapping (then corpus, if specified, is ignored).
wlocals (callable, optional) – Function for local weighting, default for wlocal is identity() (other options: numpy.sqrt(), lambda tf: 0.5 + (0.5 * tf / tf.max()), etc.).
wglobal (callable, optional) – Function for global weighting, default is df2idf().
normalize ({bool, callable}, optional) – Normalize document vectors to unit euclidean length? You can also inject your own function into normalize.
smartirs (str, optional) –
SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System, a mnemonic scheme for denoting tf-idf weighting variants in the vector space model. The mnemonic for representing a combination of weights takes the form XYZ, for example ‘ntc’, ‘bpn’ and so on, where the letters represents the term weighting of the document vector.
Term frequency weighing:
- b - binary,
- t or n - raw,
- a - augmented,
- l - logarithm,
- d - double logarithm,
- L - log average.
Document frequency weighting:
- x or n - none,
- f - idf,
- t - zero-corrected idf,
- p - probabilistic idf.
Document normalization:
- x or n - none,
- c - cosine,
- u - pivoted unique,
- b - pivoted character length.
Default is ‘nfc’. For more information visit SMART Information Retrieval System.
pivot (float or None, optional) –
In information retrieval, TF-IDF is biased against long documents 1. Pivoted document length normalization solves this problem by changing the norm of a document to slope * old_norm + (1.0 - slope) * pivot.

You can either set the pivot by hand, or you can let Gensim figure it out automatically with the following two steps:
- Set either the u or b document normalization in the smartirs parameter.
- Set either the corpus or dictionary parameter. The pivot will be automatically determined from the properties of the corpus or dictionary.
If pivot is None and you don’t follow steps 1 and 2, then pivoted document length normalization will be disabled. Default is None.

See also the blog post at https://rare-technologies.com/pivoted-document-length-normalisation/.
slope (float, optional) –
In information retrieval, TF-IDF is biased against long documents 1. Pivoted document length normalization solves this problem by changing the norm of a document to slope * old_norm + (1.0 - slope) * pivot.

Setting the slope to 0.0 uses only the pivot as the norm, and setting the slope to 1.0 effectively disables pivoted document length normalization. Singhal 2 suggests setting the slope between 0.2 and 0.3 for best results. Default is 0.25.

See also the blog post at https://rare-technologies.com/pivoted-document-length-normalisation/.

Get Expert Help From The Gensim Authors

models.tfidfmodel – TF-IDF model¶

`models.tfidfmodel` – TF-IDF model¶