gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

models.tfidfmodel – TF-IDF model

models.tfidfmodel – TF-IDF model

This module implements functionality related to the Term Frequency - Inverse Document Frequency <https://en.wikipedia.org/wiki/Tf%E2%80%93idf> vector space bag-of-words models.

For a more in-depth exposition of TF-IDF and its various SMART variants (normalization, weighting schemes), see the blog post at https://rare-technologies.com/pivoted-document-length-normalisation/

class gensim.models.tfidfmodel.TfidfModel(corpus=None, id2word=None, dictionary=None, wlocal=<function identity>, wglobal=<function df2idf>, normalize=True, smartirs=None, pivot=None, slope=0.65)

Bases: gensim.interfaces.TransformationABC

Objects of this class realize the transformation between word-document co-occurrence matrix (int) into a locally/globally weighted TF-IDF matrix (positive floats).

Examples

>>> import gensim.downloader as api
>>> from gensim.models import TfidfModel
>>> from gensim.corpora import Dictionary
>>>
>>> dataset = api.load("text8")
>>> dct = Dictionary(dataset)  # fit dictionary
>>> corpus = [dct.doc2bow(line) for line in dataset]  # convert corpus to BoW format
>>>
>>> model = TfidfModel(corpus)  # fit model
>>> vector = model[corpus[0]]  # apply model to the first corpus document

Compute TF-IDF by multiplying a local component (term frequency) with a global component (inverse document frequency), and normalizing the resulting documents to unit length. Formula for non-normalized weight of term i in document j in a corpus of D documents

weight_{i,j} = frequency_{i,j} * log_2 \frac{D}{document\_freq_{i}}

or, more generally

weight_{i,j} = wlocal(frequency_{i,j}) * wglobal(document\_freq_{i}, D)

so you can plug in your own custom wlocal and wglobal functions.

Parameters:
  • corpus (iterable of iterable of (int, int), optional) – Input corpus
  • id2word ({dict, Dictionary}, optional) – Mapping token - id, that was used for converting input data to bag of words format.
  • dictionary (Dictionary) – If dictionary is specified, it must be a corpora.Dictionary object and it will be used. to directly construct the inverse document frequency mapping (then corpus, if specified, is ignored).
  • wlocals (function, optional) – Function for local weighting, default for wlocal is identity() (other options: math.sqrt(), math.log1p(), etc).
  • wglobal (function, optional) – Function for global weighting, default is df2idf().
  • normalize (bool, optional) – Normalize document vectors to unit euclidean length? You can also inject your own function into normalize.
  • smartirs (str, optional) –

    SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System, a mnemonic scheme for denoting tf-idf weighting variants in the vector space model. The mnemonic for representing a combination of weights takes the form XYZ, for example ‘ntc’, ‘bpn’ and so on, where the letters represents the term weighting of the document vector.

    Term frequency weighing:
    • n - natural,
    • l - logarithm,
    • a - augmented,
    • b - boolean,
    • L - log average.
    Document frequency weighting:
    • n - none,
    • t - idf,
    • p - prob idf.
    Document normalization:
    • n - none,
    • c - cosine.

    For more information visit SMART Information Retrieval System.

  • pivot (float, optional) –

    See the blog post at https://rare-technologies.com/pivoted-document-length-normalisation/.

    Pivot is the point around which the regular normalization curve is tilted to get the new pivoted normalization curve. In the paper Amit Singhal, Chris Buckley, Mandar Mitra: “Pivoted Document Length Normalization” it is the point where the retrieval and relevance curves intersect.

    This parameter along with slope is used for pivoted document length normalization. Only when pivot is not None will pivoted document length normalization be applied. Otherwise, regular TfIdf is used.

  • slope (float, optional) – Parameter required by pivoted document length normalization which determines the slope to which the old normalization can be tilted. This parameter only works when pivot is defined.
__getitem__(bow, eps=1e-12)

Get the tf-idf representation of an input vector and/or corpus.

bow : {list of (int, int), iterable of iterable of (int, int)}
Input document in the sparse Gensim bag-of-words format, or a streamed corpus of such documents.
eps : float
Threshold value, will remove all position that have tfidf-value less than eps.
Returns:
  • vector (list of (int, float)) – TfIdf vector, if bow is a single document
  • TransformedCorpus – TfIdf corpus, if bow is a corpus.
initialize(corpus)

Compute inverse document weights, which will be used to modify term frequencies for documents.

Parameters:corpus (iterable of iterable of (int, int)) – Input corpus.
classmethod load(*args, **kwargs)

Load a previously saved TfidfModel class. Handles backwards compatibility from older TfidfModel versions which did not use pivoted document normalization.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to a file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()
Load object from file.
gensim.models.tfidfmodel.df2idf(docfreq, totaldocs, log_base=2.0, add=0.0)

Compute inverse-document-frequency for a term with the given document frequency docfreq: idf = add + log_{log\_base} \frac{totaldocs}{docfreq}

Parameters:
  • docfreq ({int, float}) – Document frequency.
  • totaldocs (int) – Total number of documents.
  • log_base (float, optional) – Base of logarithm.
  • add (float, optional) – Offset.
Returns:

Inverse document frequency.

Return type:

float

gensim.models.tfidfmodel.precompute_idfs(wglobal, dfs, total_docs)

Pre-compute the inverse document frequency mapping for all terms.

Parameters:
  • wglobal (function) – Custom function for calculating the “global” weighting function. See for example the SMART alternatives under smartirs_wglobal().
  • dfs (dict) – Dictionary mapping term_id into how many documents did that term appear in.
  • total_docs (int) – Total number of documents.
Returns:

Inverse document frequencies in the format {term_id_1: idfs_1, term_id_2: idfs_2, …}.

Return type:

dict of (int, float)

gensim.models.tfidfmodel.resolve_weights(smartirs)

Check the validity of smartirs parameters.

Parameters:smartirs (str) –

smartirs or SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System, a mnemonic scheme for denoting tf-idf weighting variants in the vector space model. The mnemonic for representing a combination of weights takes the form ddd, where the letters represents the term weighting of the document vector. for more information visit SMART Information Retrieval System.

Returns:
  • 3-tuple (local_letter, global_letter, normalization_letter)
  • local_letter (str) –
    Term frequency weighing, one of:
    • n - natural,
    • l - logarithm,
    • a - augmented,
    • b - boolean,
    • L - log average.
  • global_letter (str) –
    Document frequency weighting, one of:
    • n - none,
    • t - idf,
    • p - prob idf.
  • normalization_letter (str) –
    Document normalization, one of:
    • n - none,
    • c - cosine.
Raises:ValueError – If smartirs is not a string of length 3 or one of the decomposed value doesn’t fit the list of permissible values.
gensim.models.tfidfmodel.smartirs_normalize(x, norm_scheme, return_norm=False)

Normalize a vector using the normalization scheme specified in norm_scheme.

Parameters:
  • x (numpy.ndarray) – Input array
  • norm_scheme ({'n', 'c'}) – Normalizing function to use: n: no normalization c: unit L2 norm (scale x to unit euclidean length)
  • return_norm (bool, optional) – Return the length of x as well?
Returns:

  • numpy.ndarray – Normalized array.
  • float (only if return_norm is set) – L2 norm of x.

gensim.models.tfidfmodel.smartirs_wglobal(docfreq, totaldocs, global_scheme)

Calculate global document weight based on the weighting scheme specified in global_scheme.

Parameters:
  • docfreq (int) – Document frequency.
  • totaldocs (int) – Total number of documents.
  • global_scheme ({'n', 't', 'p'}) – Global transformation scheme.
Returns:

Calculated global weight.

Return type:

float

gensim.models.tfidfmodel.smartirs_wlocal(tf, local_scheme)

Calculate local term weight for a term using the weighting scheme specified in local_scheme.

Parameters:
  • tf (int) – Term frequency.
  • local ({'n', 'l', 'a', 'b', 'L'}) – Local transformation scheme.
Returns:

Calculated local weight.

Return type:

float