gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

models.tfidfmodel – TF-IDF model

models.tfidfmodel – TF-IDF model

class gensim.models.tfidfmodel.TfidfModel(corpus=None, id2word=None, dictionary=None, wlocal=<function identity>, wglobal=<function df2idf>, normalize=True)

Bases: gensim.interfaces.TransformationABC

Objects of this class realize the transformation between word-document co-occurrence matrix (integers) into a locally/globally weighted TF_IDF matrix (positive floats).

The main methods are:

  1. constructor, which calculates inverse document counts for all terms in the training corpus.
  2. the [] method, which transforms a simple count representation into the TfIdf space.
>>> tfidf = TfidfModel(corpus)
>>> print(tfidf[some_doc])
>>> tfidf.save('/tmp/foo.tfidf_model')

Model persistency is achieved via its load/save methods.

Compute tf-idf by multiplying a local component (term frequency) with a global component (inverse document frequency), and normalizing the resulting documents to unit length. Formula for unnormalized weight of term i in document j in a corpus of D documents:

weight_{i,j} = frequency_{i,j} * log_2(D / document_freq_{i})

or, more generally:

weight_{i,j} = wlocal(frequency_{i,j}) * wglobal(document_freq_{i}, D)

so you can plug in your own custom wlocal and wglobal functions.

Default for wlocal is identity (other options: math.sqrt, math.log1p, …) and default for wglobal is log_2(total_docs / doc_freq), giving the formula above.

normalize dictates how the final transformed vectors will be normalized. normalize=True means set to unit length (default); False means don’t normalize. You can also set normalize to your own function that accepts and returns a sparse vector.

If dictionary is specified, it must be a corpora.Dictionary object and it will be used to directly construct the inverse document frequency mapping (then corpus, if specified, is ignored).

initialize(corpus)

Compute inverse document weights, which will be used to modify term frequencies for documents.

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.

gensim.models.tfidfmodel.df2idf(docfreq, totaldocs, log_base=2.0, add=0.0)

Compute default inverse-document-frequency for a term with document frequency doc_freq:

idf = add + log(totaldocs / doc_freq)
gensim.models.tfidfmodel.precompute_idfs(wglobal, dfs, total_docs)

Precompute the inverse document frequency mapping for all terms.