gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

sklearn_api.tfidf – Scikit learn wrapper for TF-IDF model

sklearn_api.tfidf – Scikit learn wrapper for TF-IDF model

Scikit learn interface for TfidfModel.

Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.

Examples

>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.sklearn_api import TfIdfTransformer
>>>
>>> # Transform the word counts inversely to their global frequency using the sklearn interface.
>>> model = TfIdfTransformer(dictionary=common_dictionary)
>>> tfidf_corpus = model.fit_transform(common_corpus)
class gensim.sklearn_api.tfidf.TfIdfTransformer(id2word=None, dictionary=None, wlocal=<function identity>, wglobal=<function df2idf>, normalize=True, smartirs='ntc', pivot=None, slope=0.65)

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Base TfIdf module, wraps TfidfModel.

For more information please have a look to tf-idf.

Parameters:
  • id2word ({dict, Dictionary}, optional) – Mapping from int id to word token, that was used for converting input data to bag of words format.
  • dictionary (Dictionary, optional) – If specified it will be used to directly construct the inverse document frequency mapping.
  • wlocals (function, optional) – Function for local weighting, default for wlocal is identity() which does nothing. Other options include math.sqrt(), math.log1p(), etc.
  • wglobal (function, optional) – Function for global weighting, default is df2idf().
  • normalize (bool, optional) – It dictates how the final transformed vectors will be normalized. normalize=True means set to unit length (default); False means don’t normalize. You can also set normalize to your own function that accepts and returns a sparse vector.
  • smartirs (str, optional) –

    SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System, a mnemonic scheme for denoting tf-idf weighting variants in the vector space model. The mnemonic for representing a combination of weights takes the form XYZ, for example ‘ntc’, ‘bpn’ and so on, where the letters represents the term weighting of the document vector.

    Term frequency weighing:
    • n - natural,
    • l - logarithm,
    • a - augmented,
    • b - boolean,
    • L - log average.
    Document frequency weighting:
    • n - none,
    • t - idf,
    • p - prob idf.
    Document normalization:
    • n - none,
    • c - cosine.

    For more info, visit “Wikipedia”.

  • pivot (float, optional) – It is the point around which the regular normalization curve is tilted to get the new pivoted normalization curve. In the paper Amit Singhal, Chris Buckley, Mandar Mitra: “Pivoted Document Length Normalization” it is the point where the retrieval and relevance curves intersect. This parameter along with slope is used for pivoted document length normalization. Only when pivot is not None pivoted document length normalization will be applied else regular TfIdf is used.
  • slope (float, optional) – It is the parameter required by pivoted document length normalization which determines the slope to which the old normalization can be tilted. This parameter only works when pivot is defined by user and is not None.
fit(X, y=None)

Fit the model according to the given training data.

Parameters:X (iterable of iterable of (int, int)) – Input corpus
Returns:The trained model.
Return type:TfIdfTransformer
fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (numpy array of shape [n_samples, n_features]) – Training set.
  • y (numpy array of shape [n_samples]) – Target values.
Returns:

X_new – Transformed array.

Return type:

numpy array of shape [n_samples, n_features_new]

get_params(deep=True)

Get parameters for this estimator.

Parameters:deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
Return type:self
transform(docs)

Get the tf-idf scores in BoW representation for docs

Parameters:docs ({iterable of list of (int, number), list of (int, number)}) – Document or corpus in BoW format.
Returns:The BOW representation of each document. Will have the same shape as docs.
Return type:iterable of list (int, float) 2-tuples.