`models.nmf` – Non-Negative Matrix factorization¶

Online Non-Negative Matrix Factorization. Implementation of the efficient incremental algorithm of Renbo Zhao, Vincent Y. F. Tan et al. [PDF].

This NMF implementation updates in a streaming fashion and works best with sparse corpora.

W is a word-topic matrix
h is a topic-document matrix
v is an input corpus batch, word-document matrix
A, B - matrices that accumulate information from every consecutive chunk. A = h.dot(ht), B = v.dot(ht).

The idea of the algorithm is as follows:

Initialize W, A and B matrices

Input the corpus
Split the corpus into batches

for v in batches:
    infer h:
        do coordinate gradient descent step to find h that minimizes (v - Wh) l2 norm

        bound h so that it is non-negative

    update A and B:
        A = h.dot(ht)
        B = v.dot(ht)

    update W:
        do gradient descent step to find W that minimizes 0.5*trace(WtWA) - trace(WtB) l2 norm

Examples

Train an NMF model using a Gensim corpus

>>> from gensim.models import Nmf
>>> from gensim.test.utils import common_texts
>>> from gensim.corpora.dictionary import Dictionary
>>>
>>> # Create a corpus from a list of texts
>>> common_dictionary = Dictionary(common_texts)
>>> common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]
>>>
>>> # Train the model on the corpus.
>>> nmf = Nmf(common_corpus, num_topics=10)

Save a model to disk, or reload a pre-trained model

>>> from gensim.test.utils import datapath
>>>
>>> # Save model to disk.
>>> temp_file = datapath("model")
>>> nmf.save(temp_file)
>>>
>>> # Load a potentially pretrained model from disk.
>>> nmf = Nmf.load(temp_file)

Infer vectors for new documents

>>> # Create a new corpus, made of previously unseen documents.
>>> other_texts = [
...     ['computer', 'time', 'graph'],
...     ['survey', 'response', 'eps'],
...     ['human', 'system', 'computer']
... ]
>>> other_corpus = [common_dictionary.doc2bow(text) for text in other_texts]
>>>
>>> unseen_doc = other_corpus[0]
>>> vector = Nmf[unseen_doc]  # get topic probability distribution for a document

Update the model by incrementally training on the new corpus

>>> nmf.update(other_corpus)
>>> vector = nmf[unseen_doc]

A lot of parameters can be tuned to optimize training for your specific case

>>> nmf = Nmf(common_corpus, num_topics=50, kappa=0.1, eval_every=5)  # decrease training step size

The NMF should be used whenever one needs extremely fast and memory optimized topic model.

class gensim.models.nmf.Nmf(corpus=None, num_topics=100, id2word=None, chunksize=2000, passes=1, kappa=1.0, minimum_probability=0.01, w_max_iter=200, w_stop_condition=0.0001, h_max_iter=50, h_stop_condition=0.001, eval_every=10, normalize=True, random_state=None)¶

Bases: TransformationABC, BaseTopicModel

Online Non-Negative Matrix Factorization.

Renbo Zhao et al :”Online Nonnegative Matrix Factorization with Outliers”

Parameters

corpus (iterable of list of (int, float) or csc_matrix with the shape (n_tokens, n_documents), optional) – Training corpus. Can be either iterable of documents, which are lists of (word_id, word_count), or a sparse csc matrix of BOWs for each document. If not specified, the model is left uninitialized (presumably, to be trained later with self.train()).
num_topics (int, optional) – Number of topics to extract.
id2word ({dict of (int, str), gensim.corpora.dictionary.Dictionary}) – Mapping from word IDs to words. It is used to determine the vocabulary size, as well as for debugging and topic printing.
chunksize (int, optional) – Number of documents to be used in each training chunk.
passes (int, optional) – Number of full passes over the training corpus. Leave at default passes=1 if your input is an iterator.
kappa (float, optional) – Gradient descent step size. Larger value makes the model train faster, but could lead to non-convergence if set too large.
minimum_probability – If normalize is True, topics with smaller probabilities are filtered out. If normalize is False, topics with smaller factors are filtered out. If set to None, a value of 1e-8 is used to prevent 0s.
w_max_iter (int, optional) – Maximum number of iterations to train W per each batch.
w_stop_condition (float, optional) – If error difference gets less than that, training of W stops for the current batch.
h_max_iter (int, optional) – Maximum number of iterations to train h per each batch.
h_stop_condition (float) – If error difference gets less than that, training of h stops for the current batch.
eval_every (int, optional) – Number of batches after which l2 norm of (v - Wh) is computed. Decreases performance if set too low.
normalize (bool or None, optional) – Whether to normalize the result. Allows for estimation of perplexity, coherence, e.t.c.
random_state ({np.random.RandomState, int}, optional) – Seed for random generator. Needed for reproducibility.

add_lifecycle_event(event_name, log_level=20, **event)¶

Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.

Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.

The lifecycle_events attribute is persisted across object’s save() and load() operations. It has no impact on the use of the model, but is useful during debugging and support.

Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.

Parameters

event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.

This method will automatically add the following key-values to event, so you don’t have to specify them:
- datetime: the current date & time
- gensim: the current Gensim version
- python: the current Python version
- platform: the current platform
- event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.

get_document_topics(bow, minimum_probability=None, normalize=None)¶

Get the topic distribution for the given document.

Parameters

bow (list of (int, float)) – The document in BOW format.
minimum_probability (float) – If normalize is True, topics with smaller probabilities are filtered out. If normalize is False, topics with smaller factors are filtered out. If set to None, a value of 1e-8 is used to prevent 0s.
normalize (bool or None, optional) – Whether to normalize the result. Allows for estimation of perplexity, coherence, e.t.c.

Returns

Topic distribution for the whole document. Each element in the list is a pair of a topic’s id, and the probability that was assigned to it.

Return type

list of (int, float)

get_term_topics(word_id, minimum_probability=None, normalize=None)¶

Get the most relevant topics to the given word.

Parameters

word_id (int) – The word for which the topic distribution will be computed.
minimum_probability (float, optional) – If normalize is True, topics with smaller probabilities are filtered out. If normalize is False, topics with smaller factors are filtered out. If set to None, a value of 1e-8 is used to prevent 0s.
normalize (bool or None, optional) – Whether to normalize the result. Allows for estimation of perplexity, coherence, e.t.c.

Returns

The relevant topics represented as pairs of their ID and their assigned probability, sorted by relevance to the given word.

Return type

list of (int, float)

get_topic_terms(topicid, topn=10, normalize=None)¶

Get the representation for a single topic. Words the integer IDs, in constrast to show_topic() that represents words by the actual strings.

Parameters

topicid (int) – The ID of the topic to be returned
topn (int, optional) – Number of the most significant words that are associated with the topic.
normalize (bool or None, optional) – Whether to normalize the result. Allows for estimation of perplexity, coherence, e.t.c.

Returns

Word ID - probability pairs for the most relevant words generated by the topic.

Return type

list of (int, float)

get_topics(normalize=None)¶

Get the term-topic matrix learned during inference.

Parameters: normalize (bool or None, optional) – Whether to normalize the result. Allows for estimation of perplexity, coherence, e.t.c.
Returns: The probability for each word in each topic, shape (num_topics, vocabulary_size).
Return type: numpy.ndarray

l2_norm(v)¶

classmethod load(fname, mmap=None)¶

Load an object previously saved using save() from a file.

Parameters

fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

Please sponsor Gensim to help sustain this open source project!

models.nmf – Non-Negative Matrix factorization¶

`models.nmf` – Non-Negative Matrix factorization¶