gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

Corporate trainings in Data Science, NLP and Deep Learning

sklearn_api.tfidf – Scikit learn wrapper for TF-IDF model

sklearn_api.tfidf – Scikit learn wrapper for TF-IDF model

Scikit learn interface for TfidfModel.

Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.

Examples

>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.sklearn_api import TfIdfTransformer
>>>
>>> # Transform the word counts inversely to their global frequency using the sklearn interface.
>>> model = TfIdfTransformer(dictionary=common_dictionary)
>>> tfidf_corpus = model.fit_transform(common_corpus)
class gensim.sklearn_api.tfidf.TfIdfTransformer(id2word=None, dictionary=None, wlocal=<function identity>, wglobal=<function df2idf>, normalize=True, smartirs='nfc', pivot=None, slope=0.65)

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Base TfIdf module, wraps TfidfModel.

For more information please have a look to tf-idf.

Parameters
  • id2word ({dict, Dictionary}, optional) – Mapping from int id to word token, that was used for converting input data to bag of words format.

  • dictionary (Dictionary, optional) – If specified it will be used to directly construct the inverse document frequency mapping.

  • wlocals (function, optional) – Function for local weighting, default for wlocal is identity() which does nothing. Other options include math.sqrt(), math.log1p(), etc.

  • wglobal (function, optional) – Function for global weighting, default is df2idf().

  • normalize (bool, optional) – It dictates how the final transformed vectors will be normalized. normalize=True means set to unit length (default); False means don’t normalize. You can also set normalize to your own function that accepts and returns a sparse vector.

  • smartirs (str, optional) –

    SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System, a mnemonic scheme for denoting tf-idf weighting variants in the vector space model. The mnemonic for representing a combination of weights takes the form XYZ, for example ‘ntc’, ‘bpn’ and so on, where the letters represents the term weighting of the document vector.

    local_letterstr
    Term frequency weighing, one of:
    • b - binary,

    • t or n - raw,

    • a - augmented,

    • l - logarithm,

    • d - double logarithm,

    • L - log average.

    global_letterstr
    Document frequency weighting, one of:
    • x or n - none,

    • f - idf,

    • t - zero-corrected idf,

    • p - probabilistic idf.

    normalization_letterstr
    Document normalization, one of:
    • x or n - none,

    • c - cosine,

    • u - pivoted unique,

    • b - pivoted character length.

    Default is nfc. For more info, visit “Wikipedia”.

  • pivot (float, optional) – It is the point around which the regular normalization curve is tilted to get the new pivoted normalization curve. In the paper Amit Singhal, Chris Buckley, Mandar Mitra: “Pivoted Document Length Normalization” it is the point where the retrieval and relevance curves intersect. This parameter along with slope is used for pivoted document length normalization. When pivot is None, smartirs specifies the pivoted unique document normalization scheme, and either corpus or dictionary are specified, then the pivot will be determined automatically. Otherwise, no pivoted document length normalization is applied.

  • slope (float, optional) – It is the parameter required by pivoted document length normalization which determines the slope to which the old normalization can be tilted. This parameter only works when pivot is defined by user and is not None.

See also

~gensim.models.tfidfmodel.TfidfModel : Class that also uses the SMART scheme. ~gensim.models.tfidfmodel.resolve_weights : Function that also uses the SMART scheme.

fit(X, y=None)

Fit the model according to the given training data.

Parameters

X (iterable of iterable of (int, int)) – Input corpus

Returns

The trained model.

Return type

TfIdfTransformer

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
  • X (numpy array of shape [n_samples, n_features]) – Training set.

  • y (numpy array of shape [n_samples]) – Target values.

Returns

X_new – Transformed array.

Return type

numpy array of shape [n_samples, n_features_new]

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

mapping of string to any

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns

Return type

self

transform(docs)

Get the tf-idf scores in BoW representation for docs

Parameters

docs ({iterable of list of (int, number), list of (int, number)}) – Document or corpus in BoW format.

Returns

The BOW representation of each document. Will have the same shape as docs.

Return type

iterable of list (int, float) 2-tuples.