gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

models.tfidfmodel – TF-IDF model

models.tfidfmodel – TF-IDF model

class gensim.models.tfidfmodel.TfidfModel(corpus=None, id2word=None, dictionary=None, wlocal=<function identity>, wglobal=<function df2idf>, normalize=True, smartirs=None)

Bases: gensim.interfaces.TransformationABC

Objects of this class realize the transformation between word-document co-occurrence matrix (int) into a locally/globally weighted TF_IDF matrix (positive floats).

Examples

>>> import gensim.downloader as api
>>> from gensim.models import TfidfModel
>>> from gensim.corpora import Dictionary
>>>
>>> dataset = api.load("text8")
>>> dct = Dictionary(dataset)  # fit dictionary
>>> corpus = [dct.doc2bow(line) for line in dataset]  # convert dataset to BoW format
>>>
>>> model = TfidfModel(corpus)  # fit model
>>> vector = model[corpus[0]]  # apply model

Compute tf-idf by multiplying a local component (term frequency) with a global component (inverse document frequency), and normalizing the resulting documents to unit length. Formula for non-normalized weight of term i in document j in a corpus of D documents

weight_{i,j} = frequency_{i,j} * log_2 \frac{D}{document\_freq_{i}}

or, more generally

weight_{i,j} = wlocal(frequency_{i,j}) * wglobal(document\_freq_{i}, D)

so you can plug in your own custom wlocal and wglobal functions.

Parameters:
  • corpus (iterable of iterable of (int, int), optional) – Input corpus
  • id2word ({dict, Dictionary}, optional) – Mapping token - id, that was used for converting input data to bag of words format.
  • dictionary (Dictionary) – If dictionary is specified, it must be a corpora.Dictionary object and it will be used. to directly construct the inverse document frequency mapping (then corpus, if specified, is ignored).
  • wlocals (function, optional) – Function for local weighting, default for wlocal is identity() (other options: math.sqrt(), math.log1p(), etc).
  • wglobal (function, optional) – Function for global weighting, default is df2idf().
  • normalize (bool, optional) – It dictates how the final transformed vectors will be normalized. normalize=True means set to unit length (default); False means don’t normalize. You can also set normalize to your own function that accepts and returns a sparse vector.
  • smartirs (str, optional) –

    SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System, a mnemonic scheme for denoting tf-idf weighting variants in the vector space model. The mnemonic for representing a combination of weights takes the form XYZ, for example ‘ntc’, ‘bpn’ and so on, where the letters represents the term weighting of the document vector.

    Term frequency weighing:
    • n - natural,
    • l - logarithm,
    • a - augmented,
    • b - boolean,
    • L - log average.
    Document frequency weighting:
    • n - none,
    • t - idf,
    • p - prob idf.
    Document normalization:
    • n - none,
    • c - cosine.

    For more information visit [1].

__getitem__(bow, eps=1e-12)

Get tf-idf representation of the input vector and/or corpus.

bow : {list of (int, int), iterable of iterable of (int, int)}
Input document or copus in BoW format.
eps : float
Threshold value, will remove all position that have tfidf-value less than eps.
Returns:
  • vector (list of (int, float)) – TfIdf vector, if bow is document OR
  • TransformedCorpus – TfIdf corpus, if bow is corpus.
initialize(corpus)

Compute inverse document weights, which will be used to modify term frequencies for documents.

Parameters:corpus (iterable of iterable of (int, int)) – Input corpus.
classmethod load(fname, mmap=None)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) – If None - automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently. If list of str - this attributes will be stored in separate files, the automatic check is not performed in this case.
  • sep_limit (int) – Limit for automatic separation.
  • ignore (frozenset of str) – Attributes that shouldn’t be serialize/store.
  • pickle_protocol (int) – Protocol number for pickle.

See also

load()

gensim.models.tfidfmodel.df2idf(docfreq, totaldocs, log_base=2.0, add=0.0)

Compute default inverse-document-frequency for a term with document frequency: idf = add + log_{log\_base} \frac{totaldocs}{doc\_freq}

Parameters:
  • docfreq (float) – Document frequency.
  • totaldocs (int) – Total number of documents.
  • log_base (float, optional) – Base of logarithm.
  • add (float, optional) – Offset.
Returns:

Inverse document frequency.

Return type:

float

gensim.models.tfidfmodel.precompute_idfs(wglobal, dfs, total_docs)

Pre-compute the inverse document frequency mapping for all terms.

Parameters:
  • wglobal (function) – Custom function for calculation idf, look at “universal” updated_wglobal().
  • dfs (dict) – Dictionary with term_id and how many documents this token appeared.
  • total_docs (int) – Total number of document.
Returns:

Precomputed idfs in format {term_id_1: idfs_1, term_id_2: idfs_2, …}

Return type:

dict

gensim.models.tfidfmodel.resolve_weights(smartirs)

Checks for validity of smartirs parameter.

Parameters:smartirs (str) – smartirs or SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System, a mnemonic scheme for denoting tf-idf weighting variants in the vector space model. The mnemonic for representing a combination of weights takes the form ddd, where the letters represents the term weighting of the document vector. for more information visit [1].
Returns:
  • w_tf (str) –
    Term frequency weighing:
    • n - natural,
    • l - logarithm,
    • a - augmented,
    • b - boolean,
    • L - log average.
  • w_df (str) –
    Document frequency weighting:
    • n - none,
    • t - idf,
    • p - prob idf.
  • w_n (str) –
    Document normalization:
    • n - none,
    • c - cosine.
Raises:ValueError – If smartirs is not a string of length 3 or one of the decomposed value doesn’t fit the list of permissible values

References

[1](1, 2) https://en.wikipedia.org/wiki/SMART_Information_Retrieval_System
gensim.models.tfidfmodel.updated_normalize(x, n_n)

Normalizes the final tf-idf value according to the value of n_n.

Parameters:
  • x (numpy.ndarray) – Input array
  • n_n ({'n', 'c'}) – Parameter that decides the normalizing function to be used.
Returns:

Normalized array.

Return type:

numpy.ndarray

gensim.models.tfidfmodel.updated_wglobal(docfreq, totaldocs, n_df)

A scheme to transform docfreq or document frequency based on the value of n_df.

Parameters:
  • docfreq (int) – Document frequency.
  • totaldocs (int) – Total number of documents.
  • n_df ({'n', 't', 'p'}) – Parameter to decide the current transformation scheme.
Returns:

Calculated wglobal.

Return type:

float

gensim.models.tfidfmodel.updated_wlocal(tf, n_tf)

A scheme to transform tf or term frequency based on the value of n_tf.

Parameters:
  • tf (int) – Term frequency.
  • n_tf ({'n', 'l', 'a', 'b', 'L'}) – Parameter to decide the current transformation scheme.
Returns:

Calculated wlocal.

Return type:

float