sklearn_api.tfidf
– Scikit learn wrapper for TF-IDF model¶Scikit learn interface for TfidfModel
.
Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.
Examples
>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.sklearn_api import TfIdfTransformer
>>>
>>> # Transform the word counts inversely to their global frequency using the sklearn interface.
>>> model = TfIdfTransformer(dictionary=common_dictionary)
>>> tfidf_corpus = model.fit_transform(common_corpus)
gensim.sklearn_api.tfidf.
TfIdfTransformer
(id2word=None, dictionary=None, wlocal=<function identity>, wglobal=<function df2idf>, normalize=True, smartirs='nfc', pivot=None, slope=0.65)¶Bases: sklearn.base.TransformerMixin
, sklearn.base.BaseEstimator
Base TfIdf module, wraps TfidfModel
.
For more information please have a look to tf-idf.
id2word ({dict, Dictionary
}, optional) – Mapping from int id to word token, that was used for converting input data to bag of words format.
dictionary (Dictionary
, optional) – If specified it will be used to directly construct the inverse document frequency mapping.
wlocals (function, optional) – Function for local weighting, default for wlocal is identity()
which does nothing.
Other options include math.sqrt()
, math.log1p()
, etc.
wglobal (function, optional) – Function for global weighting, default is df2idf()
.
normalize (bool, optional) – It dictates how the final transformed vectors will be normalized. normalize=True means set to unit length (default); False means don’t normalize. You can also set normalize to your own function that accepts and returns a sparse vector.
smartirs (str, optional) –
SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System, a mnemonic scheme for denoting tf-idf weighting variants in the vector space model. The mnemonic for representing a combination of weights takes the form XYZ, for example ‘ntc’, ‘bpn’ and so on, where the letters represents the term weighting of the document vector.
b - binary,
t or n - raw,
a - augmented,
l - logarithm,
d - double logarithm,
L - log average.
x or n - none,
f - idf,
t - zero-corrected idf,
p - probabilistic idf.
x or n - none,
c - cosine,
u - pivoted unique,
b - pivoted character length.
Default is nfc. For more info, visit “Wikipedia”.
pivot (float, optional) – It is the point around which the regular normalization curve is tilted to get the new pivoted normalization curve. In the paper Amit Singhal, Chris Buckley, Mandar Mitra: “Pivoted Document Length Normalization” it is the point where the retrieval and relevance curves intersect. This parameter along with slope is used for pivoted document length normalization. When pivot is None, smartirs specifies the pivoted unique document normalization scheme, and either corpus or dictionary are specified, then the pivot will be determined automatically. Otherwise, no pivoted document length normalization is applied.
slope (float, optional) – It is the parameter required by pivoted document length normalization which determines the slope to which the old normalization can be tilted. This parameter only works when pivot is defined by user and is not None.
See also
~gensim.models.tfidfmodel.TfidfModel : Class that also uses the SMART scheme. ~gensim.models.tfidfmodel.resolve_weights : Function that also uses the SMART scheme.
fit
(X, y=None)¶Fit the model according to the given training data.
X (iterable of iterable of (int, int)) – Input corpus
The trained model.
fit_transform
(X, y=None, **fit_params)¶Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
X_new – Transformed array.
numpy array of shape [n_samples, n_features_new]
get_params
(deep=True)¶Get parameters for this estimator.
deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
params – Parameter names mapped to their values.
mapping of string to any
set_params
(**params)¶Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects
(such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each
component of a nested object.
self
transform
(docs)¶Get the tf-idf scores in BoW representation for docs
docs ({iterable of list of (int, number), list of (int, number)}) – Document or corpus in BoW format.
The BOW representation of each document. Will have the same shape as docs.
iterable of list (int, float) 2-tuples.