sklearn_api.ldaseqmodel
– Scikit learn wrapper for LdaSeq model¶Scikit learn interface for LdaSeqModel
.
Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.
Examples
>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.sklearn_api.ldaseqmodel import LdaSeqTransformer
>>>
>>> # Create a sequential LDA transformer to extract 2 topics from the common corpus.
>>> # Divide the work into 3 unequal time slices.
>>> model = LdaSeqTransformer(id2word=common_dictionary, num_topics=2, time_slice=[3, 4, 2], initialize='gensim')
>>>
>>> # Each document almost entirely belongs to one of the two topics.
>>> transformed_corpus = model.fit_transform(common_corpus)
gensim.sklearn_api.ldaseqmodel.
LdaSeqTransformer
(time_slice=None, id2word=None, alphas=0.01, num_topics=10, initialize='gensim', sstats=None, lda_model=None, obs_variance=0.5, chain_variance=0.005, passes=10, random_state=None, lda_inference_max_iter=25, em_min_iter=6, em_max_iter=20, chunksize=100)¶Bases: sklearn.base.TransformerMixin
, sklearn.base.BaseEstimator
Base Sequential LDA module, wraps LdaSeqModel
model.
For more information take a look at David M. Blei, John D. Lafferty: “Dynamic Topic Models”.
time_slice (list of int, optional) – Number of documents in each time-slice.
id2word (Dictionary
, optional) – Mapping from an ID to the word it represents in the vocabulary.
alphas (float, optional) – The prior probability of each topic.
num_topics (int, optional) – Number of latent topics to be discovered in the corpus.
initialize ({'gensim', 'own', 'ldamodel'}, optional) –
’gensim’: Uses gensim’s own LDA initialization.
’own’: Uses your own initialization matrix of an LDA model that has been previously trained.
’lda_model’: Use a previously used LDA model, passing it through the lda_model argument.
sstats (np.ndarray of shape [vocab_len, num_topics], optional) – If initialize is set to ‘own’ this will be used to initialize the DTM model.
lda_model (LdaModel
, optional) – If initialize is set to ‘lda_model’ this object will be used to create the sstats initialization matrix.
obs_variance (float, optional) –
Observed variance used to approximate the true and forward variance as shown in David M. Blei, John D. Lafferty: “Dynamic Topic Models”.
chain_variance (float, optional) – Gaussian parameter defined in the beta distribution to dictate how the beta values evolve.
passes (int, optional) – Number of passes over the corpus for the initial LdaModel
random_state ({numpy.random.RandomState, int}, optional) – Can be a np.random.RandomState object, or the seed to generate one. Used for reproducibility of results.
lda_inference_max_iter (int, optional) – Maximum number of iterations in the inference step of the LDA training.
em_min_iter (int, optional) – Minimum number of iterations until converge of the Expectation-Maximization algorithm
em_max_iter (int, optional) – Maximum number of iterations until converge of the Expectation-Maximization algorithm
chunksize (int, optional) – Number of documents in the corpus do be processed in in a chunk.
fit
(X, y=None)¶Fit the model according to the given training data.
X ({iterable of list of (int, number), scipy.sparse matrix}) – A collection of documents in BOW format used for training the model.
The trained model.
fit_transform
(X, y=None, **fit_params)¶Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
X_new – Transformed array.
numpy array of shape [n_samples, n_features_new]
get_params
(deep=True)¶Get parameters for this estimator.
deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
params – Parameter names mapped to their values.
mapping of string to any
set_params
(**params)¶Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects
(such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each
component of a nested object.
self
transform
(docs)¶Infer the topic distribution for docs.
docs ({iterable of list of (int, number), scipy.sparse matrix}) – A collection of documents in BOW format to be transformed.
The topic representation of each document.
numpy.ndarray of shape [len(docs), num_topics]