models.tfidfmodel
– TF-IDF model¶This module implements functionality related to the Term Frequency - Inverse Document Frequency <https://en.wikipedia.org/wiki/Tf%E2%80%93idf> vector space bag-of-words models.
For a more in-depth exposition of TF-IDF and its various SMART variants (normalization, weighting schemes), see the blog post at https://rare-technologies.com/pivoted-document-length-normalisation/
gensim.models.tfidfmodel.
TfidfModel
(corpus=None, id2word=None, dictionary=None, wlocal=<function identity>, wglobal=<function df2idf>, normalize=True, smartirs=None, pivot=None, slope=0.65)¶Bases: gensim.interfaces.TransformationABC
Objects of this class realize the transformation between word-document co-occurrence matrix (int) into a locally/globally weighted TF-IDF matrix (positive floats).
Examples
>>> import gensim.downloader as api
>>> from gensim.models import TfidfModel
>>> from gensim.corpora import Dictionary
>>>
>>> dataset = api.load("text8")
>>> dct = Dictionary(dataset) # fit dictionary
>>> corpus = [dct.doc2bow(line) for line in dataset] # convert corpus to BoW format
>>>
>>> model = TfidfModel(corpus) # fit model
>>> vector = model[corpus[0]] # apply model to the first corpus document
Compute TF-IDF by multiplying a local component (term frequency) with a global component
(inverse document frequency), and normalizing the resulting documents to unit length.
Formula for non-normalized weight of term in document
in a corpus of
documents
or, more generally
so you can plug in your own custom and
functions.
Parameters: |
|
---|
__getitem__
(bow, eps=1e-12)¶Get the tf-idf representation of an input vector and/or corpus.
Returns: |
|
---|
initialize
(corpus)¶Compute inverse document weights, which will be used to modify term frequencies for documents.
Parameters: | corpus (iterable of iterable of (int, int)) – Input corpus. |
---|
load
(*args, **kwargs)¶Load a previously saved TfidfModel class. Handles backwards compatibility from older TfidfModel versions which did not use pivoted document normalization.
save
(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)¶Save the object to a file.
Parameters: |
|
---|
See also
load()
gensim.models.tfidfmodel.
df2idf
(docfreq, totaldocs, log_base=2.0, add=0.0)¶Compute inverse-document-frequency for a term with the given document frequency docfreq:
Parameters: |
|
---|---|
Returns: | Inverse document frequency. |
Return type: | float |
gensim.models.tfidfmodel.
precompute_idfs
(wglobal, dfs, total_docs)¶Pre-compute the inverse document frequency mapping for all terms.
Parameters: |
|
---|---|
Returns: | Inverse document frequencies in the format {term_id_1: idfs_1, term_id_2: idfs_2, …}. |
Return type: | dict of (int, float) |
gensim.models.tfidfmodel.
resolve_weights
(smartirs)¶Check the validity of smartirs parameters.
Parameters: | smartirs (str) – smartirs or SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System, a mnemonic scheme for denoting tf-idf weighting variants in the vector space model. The mnemonic for representing a combination of weights takes the form ddd, where the letters represents the term weighting of the document vector. for more information visit SMART Information Retrieval System. |
---|---|
Returns: |
|
Raises: | ValueError – If smartirs is not a string of length 3 or one of the decomposed value doesn’t fit the list of permissible values. |
gensim.models.tfidfmodel.
smartirs_normalize
(x, norm_scheme, return_norm=False)¶Normalize a vector using the normalization scheme specified in norm_scheme.
Parameters: |
|
---|---|
Returns: |
|
gensim.models.tfidfmodel.
smartirs_wglobal
(docfreq, totaldocs, global_scheme)¶Calculate global document weight based on the weighting scheme specified in global_scheme.
Parameters: |
|
---|---|
Returns: | Calculated global weight. |
Return type: | float |
gensim.models.tfidfmodel.
smartirs_wlocal
(tf, local_scheme)¶Calculate local term weight for a term using the weighting scheme specified in local_scheme.
Parameters: |
|
---|---|
Returns: | Calculated local weight. |
Return type: | float |