models.lsimodel– Latent Semantic Indexing¶
Module for Latent Semantic Analysis (aka Latent Semantic Indexing) in Python.
Implements fast truncated SVD (Singular Value Decomposition). The SVD decomposition can be updated with new observations at any time, for an online, incremental, memory-efficient training.
This module actually contains several algorithms for decomposition of large corpora, a combination of which effectively and transparently allows building LSI models for:
Wall-clock performance on the English Wikipedia (2G corpus positions, 3.2M documents, 100K features, 0.5G non-zero entries in the final TF-IDF matrix), requesting the top 400 LSI factors:
|one-pass merge algorithm||5h14m||1h41m|
|multi-pass stochastic algo (with 2 power iterations)||5h39m||N/A |
serial = Core 2 Duo MacBook Pro 2.53Ghz, 4GB RAM, libVec
distributed = cluster of four logical nodes on three physical machines, each with dual core Xeon 2.0GHz, 4GB RAM, ATLAS
|||The stochastic algo could be distributed too, but most time is already spent reading/decompressing the input from disk in its 4 passes. The extra network traffic due to data distribution across cluster nodes would likely make it slower.|
LsiModel(corpus=None, num_topics=200, id2word=None, chunksize=20000, decay=1.0, distributed=False, onepass=True, power_iters=2, extra_samples=100)¶
Objects of this class allow building and maintaining a model for Latent Semantic Indexing (also known as Latent Semantic Analysis).
The main methods are:
method, which returns representation of any input document in the latent space,
The left singular vectors are stored in lsi.projection.u, singular values in lsi.projection.s. Right singular vectors can be reconstructed from the output of lsi[training_corpus], if needed. See also FAQ .
Model persistency is achieved via its load/save methods.
num_topics is the number of requested factors (latent dimensions).
After the model has been trained, you can estimate topics for an
arbitrary, unseen document, using the
topics = self[document] dictionary
notation. You can also add new training documents, with
so that training can be stopped and resumed at any time, and the
LSI transformation is available at any point.
If you specify a corpus, it will be used to train the model. See the method add_documents for a description of the chunksize and decay parameters.
Turn onepass off to force a multi-pass stochastic algorithm.
power_iters and extra_samples affect the accuracy of the stochastic multi-pass algorithm, which is used either internally (onepass=True) or as the front-end algorithm (onepass=False). Increasing the number of power iterations improves accuracy, but lowers performance. See  for some hard numbers.
Turn on distributed to enable distributed computing.
>>> lsi = LsiModel(corpus, num_topics=10) >>> print(lsi[doc_tfidf]) # project some document into LSI space >>> lsi.add_documents(corpus2) # update LSI on additional documents >>> print(lsi[doc_tfidf])
add_documents(corpus, chunksize=None, decay=None)¶
Update singular value decomposition to take into account a new corpus of documents.
Training proceeds in chunks of chunksize documents at a time. The size of chunksize is a tradeoff between increased speed (bigger chunksize) vs. lower memory footprint (smaller chunksize). If the distributed mode is on, each chunk is sent to a different worker/computer.
Setting decay < 1.0 causes re-orientation towards new data trends in the input document stream, by giving less emphasis to old observations. This allows LSA to gradually “forget” old observations (documents) and give more preference to new ones.
load(fname, *args, **kwargs)¶
Load a previously saved object from file (also see save).
Large arrays can be memmap’ed back as read-only (shared memory) by setting mmap=’r’:
>>> LsiModel.load(fname, mmap='r')
Print (to log) the most salient words of the first num_topics topics.
Unlike print_topics(), this looks for words that are significant for a particular topic and not for others. This should result in a more human-interpretable description of topics.
Return a single topic as a formatted string. See show_topic() for parameters.
>>> lsimodel.print_topic(10, topn=5) '-0.340 * "category" + 0.298 * "$M$" + 0.183 * "algebra" + -0.174 * "functor" + -0.168 * "operator"'
Alias for show_topics() that prints the num_words most probable words for topics number of topics to log. Set topics=-1 to print all topics.
save(fname, *args, **kwargs)¶
Save the model to file.
Large internal arrays may be stored into separate files, with fname as prefix.
Note: do not save as a compressed file if you intend to load the file back with mmap.
Return a specified topic (=left singular vector), 0 <= topicno < self.num_topics, as a string.
Return only the topn words which contribute the most to the direction of the topic (both negative and positive).
>>> lsimodel.show_topic(10, topn=5) [("category", -0.340), ("$M$", 0.298), ("algebra", 0.183), ("functor", -0.174), ("operator", -0.168)]
show_topics(num_topics=-1, num_words=10, log=False, formatted=True)¶
Return num_topics most significant topics (return all by default). For each topic, show num_words most significant words (10 words by default).
The topics are returned as a list – a list of strings if formatted is True, or a list of (word, probability) 2-tuples if False.
If log is True, also output this result to log.
Projection(m, k, docs=None, use_svdlibc=False, power_iters=2, extra_dims=100)¶
Construct the (U, S) projection from a corpus docs. The projection can be later updated by merging it with another Projection via self.merge().
This is the class taking care of the ‘core math’; interfacing with corpora, splitting large corpora into chunks and merging them etc. is done through the higher-level LsiModel class.
Load a previously saved object from file (also see save).
If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.
If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.
Merge this Projection with another.
The content of other is destroyed in the process, so pass this function a copy of other if you need it further.
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset(), pickle_protocol=2)¶
Save the object to file (also see load).
fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.
If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.
You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.
ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.
pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.
clip_spectrum(s, k, discard=0.001)¶
Given eigenvalues s, return how many factors should be kept to avoid storing spurious (tiny, numerically instable) values.
This will ignore the tail of the spectrum with relative combined mass < min(discard, 1/k).
The returned value is clipped against k (= never return more than k).
print_debug(id2token, u, s, topics, num_words=10, num_neg=None)¶
stochastic_svd(corpus, rank, num_terms, chunksize=20000, extra_dims=None, power_iters=0, dtype=<type 'numpy.float64'>, eps=1e-06)¶
Run truncated Singular Value Decomposition (SVD) on a sparse input.
Return (U, S): the left singular vectors and the singular values of the input data stream corpus . The corpus may be larger than RAM (iterator of vectors).
This may return less than the requested number of top rank factors, in case the input itself is of lower rank. The extra_dims (oversampling) and especially power_iters (power iterations) parameters affect accuracy of the decomposition.
This algorithm uses 2+power_iters passes over the input data. In case you can only
afford a single pass, set onepass=True in
LsiModel and avoid using
this function directly.
The decomposition algorithm is based on Halko, Martinsson, Tropp. Finding structure with randomness, 2009.
|||If corpus is a scipy.sparse matrix instead, it is assumed the whole corpus fits into core memory and a different (more efficient) code path is chosen.|