models.lsimodel
– Latent Semantic Indexing¶Module for Latent Semantic Analysis (aka Latent Semantic Indexing).
Implements fast truncated SVD (Singular Value Decomposition). The SVD decomposition can be updated with new observations at any time, for an online, incremental, memory-efficient training.
This module actually contains several algorithms for decomposition of large corpora, a combination of which effectively and transparently allows building LSI models for:
corpora much larger than RAM: only constant memory is needed, independent of the corpus size
corpora that are streamed: documents are only accessed sequentially, no random access
corpora that cannot be even temporarily stored: each document can only be seen once and must be processed immediately (one-pass algorithm)
distributed computing for very large corpora, making use of a cluster of machines
Wall-clock performance on the English Wikipedia (2G corpus positions, 3.2M documents, 100K features, 0.5G non-zero entries in the final TF-IDF matrix), requesting the top 400 LSI factors:
algorithm |
serial |
distributed |
---|---|---|
one-pass merge algorithm |
5h14m |
1h41m |
multi-pass stochastic algo (with 2 power iterations) |
5h39m |
N/A 1 |
serial = Core 2 Duo MacBook Pro 2.53Ghz, 4GB RAM, libVec
distributed = cluster of four logical nodes on three physical machines, each with dual core Xeon 2.0GHz, 4GB RAM, ATLAS
Examples
>>> from gensim.test.utils import common_dictionary, common_corpus
>>> from gensim.models import LsiModel
>>>
>>> model = LsiModel(common_corpus, id2word=common_dictionary)
>>> vectorized_corpus = model[common_corpus] # vectorize input copus in BoW format
The stochastic algo could be distributed too, but most time is already spent reading/decompressing the input from disk in its 4 passes. The extra network traffic due to data distribution across cluster nodes would likely make it slower.
gensim.models.lsimodel.
LsiModel
(corpus=None, num_topics=200, id2word=None, chunksize=20000, decay=1.0, distributed=False, onepass=True, power_iters=2, extra_samples=100, dtype=<class 'numpy.float64'>)¶Bases: gensim.interfaces.TransformationABC
, gensim.models.basemodel.BaseTopicModel
Model for Latent Semantic Indexing.
The decomposition algorithm is described in “Fast and Faster: A Comparison of Two Streamed Matrix Decomposition Algorithms”.
Notes
gensim.models.lsimodel.LsiModel.projection.u
- left singular vectors,
gensim.models.lsimodel.LsiModel.projection.s
- singular values,
model[training_corpus]
- right singular vectors (can be reconstructed if needed).
See also
Examples
>>> from gensim.test.utils import common_corpus, common_dictionary, get_tmpfile
>>> from gensim.models import LsiModel
>>>
>>> model = LsiModel(common_corpus[:3], id2word=common_dictionary) # train model
>>> vector = model[common_corpus[4]] # apply model to BoW document
>>> model.add_documents(common_corpus[4:]) # update model with new documents
>>> tmp_fname = get_tmpfile("lsi.model")
>>> model.save(tmp_fname) # save model
>>> loaded_model = LsiModel.load(tmp_fname) # load model
Construct an LsiModel object.
Either corpus or id2word must be supplied in order to train the model.
corpus ({iterable of list of (int, float), scipy.sparse.csc}, optional) – Stream of document vectors or sparse matrix of shape (num_terms, num_documents).
num_topics (int, optional) – Number of requested factors (latent dimensions)
id2word (dict of {int: str}, optional) – ID to word mapping, optional.
chunksize (int, optional) – Number of documents to be used in each training chunk.
decay (float, optional) – Weight of existing observations relatively to new ones.
distributed (bool, optional) – If True - distributed mode (parallel execution on several machines) will be used.
onepass (bool, optional) – Whether the one-pass algorithm should be used for training. Pass False to force a multi-pass stochastic algorithm.
power_iters (int, optional) – Number of power iteration steps to be used. Increasing the number of power iterations improves accuracy, but lowers performance
extra_samples (int, optional) – Extra samples to be used besides the rank k. Can improve accuracy.
dtype (type, optional) – Enforces a type for elements of the decomposed matrix.
add_documents
(corpus, chunksize=None, decay=None)¶Update model with new corpus.
corpus ({iterable of list of (int, float), scipy.sparse.csc}) – Stream of document vectors or sparse matrix of shape (num_terms, num_documents).
chunksize (int, optional) – Number of documents to be used in each training chunk, will use self.chunksize if not specified.
decay (float, optional) – Weight of existing observations relatively to new ones, will use self.decay if not specified.
Notes
Training proceeds in chunks of chunksize documents at a time. The size of chunksize is a tradeoff between increased speed (bigger chunksize) vs. lower memory footprint (smaller chunksize). If the distributed mode is on, each chunk is sent to a different worker/computer.
get_topics
()¶Get the topic vectors.
Notes
The number of topics can actually be smaller than self.num_topics, if there were not enough factors in the matrix (real rank of input matrix smaller than self.num_topics).
The term topic matrix with shape (num_topics, vocabulary_size)
np.ndarray
load
(fname, *args, **kwargs)¶Load a previously saved object using save()
from file.
Notes
Large arrays can be memmap’ed back as read-only (shared memory) by setting the mmap=’r’ parameter.
fname (str) – Path to file that contains LsiModel.
*args – Variable length argument list, see gensim.utils.SaveLoad.load()
.
**kwargs – Arbitrary keyword arguments, see gensim.utils.SaveLoad.load()
.
See also
Loaded instance.
IOError – When methods are called on instance (should be called from class).
print_debug
(num_topics=5, num_words=10)¶Print (to log) the most salient words of the first num_topics topics.
Unlike print_topics()
, this looks for words that are significant for
a particular topic and not for others. This should result in a
more human-interpretable description of topics.
Alias for print_debug()
.
num_topics (int, optional) – The number of topics to be selected (ordered by significance).
num_words (int, optional) – The number of words to be included per topics (ordered by significance).
print_topic
(topicno, topn=10)¶Get a single topic as a formatted string.
topicno (int) – Topic id.
topn (int) – Number of words from topic that will be used.
String representation of topic, like ‘-0.340 * “category” + 0.298 * “$M$” + 0.183 * “algebra” + … ‘.
str
print_topics
(num_topics=20, num_words=10)¶Get the most significant topics (alias for show_topics() method).
num_topics (int, optional) – The number of topics to be selected, if -1 - all topics will be in result (ordered by significance).
num_words (int, optional) – The number of words to be included per topics (ordered by significance).
Sequence with (topic_id, [(word, value), … ]).
list of (int, list of (str, float))
save
(fname, *args, **kwargs)¶Save the model to a file.
Notes
Large internal arrays may be stored into separate files, with fname as prefix.
Warning
Do not save as a compressed file if you intend to load the file back with mmap.
fname (str) – Path to output file.
*args – Variable length argument list, see gensim.utils.SaveLoad.save()
.
**kwargs – Arbitrary keyword arguments, see gensim.utils.SaveLoad.save()
.
See also
show_topic
(topicno, topn=10)¶Get the words that define a topic along with their contribution.
This is actually the left singular vector of the specified topic.
The most important words in defining the topic (greatest absolute value) are included in the output, along with their contribution to the topic.
topicno (int) – The topics id number.
topn (int) – Number of words to be included to the result.
Topic representation in BoW format.
list of (str, float)
show_topics
(num_topics=-1, num_words=10, log=False, formatted=True)¶Get the most significant topics.
num_topics (int, optional) – The number of topics to be selected, if -1 - all topics will be in result (ordered by significance).
num_words (int, optional) – The number of words to be included per topics (ordered by significance).
log (bool, optional) – If True - log topics with logger.
formatted (bool, optional) – If True - each topic represented as string, otherwise - in BoW format.
list of (int, str) – If formatted=True, return sequence with (topic_id, string representation of topics) OR
list of (int, list of (str, float)) – Otherwise, return sequence with (topic_id, [(word, value), … ]).
gensim.models.lsimodel.
Projection
(m, k, docs=None, use_svdlibc=False, power_iters=2, extra_dims=100, dtype=<class 'numpy.float64'>)¶Bases: gensim.utils.SaveLoad
Low dimensional projection of a term-document matrix.
This is the class taking care of the ‘core math’: interfacing with corpora, splitting large corpora into chunks
and merging them etc. This done through the higher-level LsiModel
class.
Notes
The projection can be later updated by merging it with another Projection
via merge()
. This is how incremental training actually happens.
Construct the (U, S) projection from a corpus.
m (int) – Number of features (terms) in the corpus.
k (int) – Desired rank of the decomposed matrix.
docs ({iterable of list of (int, float), scipy.sparse.csc}) – Corpus in BoW format or as sparse matrix.
use_svdlibc (bool, optional) – If True - will use sparsesvd library, otherwise - our own version will be used.
power_iters (int, optional) – Number of power iteration steps to be used. Tune to improve accuracy.
extra_dims (int, optional) – Extra samples to be used besides the rank k. Tune to improve accuracy.
dtype (numpy.dtype, optional) – Enforces a type for elements of the decomposed matrix.
empty_like
()¶Get an empty Projection with the same parameters as the current object.
An empty copy (without corpus) of the current projection.
load
(fname, mmap=None)¶Load an object previously saved using save()
from a file.
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
Object loaded from fname.
object
AttributeError – When called on an object instance instead of class (this is a class method).
merge
(other, decay=1.0)¶Merge current Projection
instance with another.
Warning
The content of other is destroyed in the process, so pass this function a copy of other
if you need it further. The other Projection
is expected to contain
the same number of features.
other (Projection
) – The Projection object to be merged into the current one. It will be destroyed after merging.
decay (float, optional) – Weight of existing observations relatively to new ones. Setting decay < 1.0 causes re-orientation towards new data trends in the input document stream, by giving less emphasis to old observations. This allows LSA to gradually “forget” old observations (documents) and give more preference to new ones.
save
(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)¶Save the object to a file.
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
gensim.models.lsimodel.
ascarray
(a, name='')¶Return a contiguous array in memory (C order).
a (numpy.ndarray) – Input array.
name (str, optional) – Array name, used for logging purposes.
Contiguous array (row-major order) of same shape and content as a.
np.ndarray
gensim.models.lsimodel.
asfarray
(a, name='')¶Get an array laid out in Fortran order in memory.
a (numpy.ndarray) – Input array.
name (str, optional) – Array name, used only for logging purposes.
The input a in Fortran, or column-major order.
np.ndarray
gensim.models.lsimodel.
clip_spectrum
(s, k, discard=0.001)¶Find how many factors should be kept to avoid storing spurious (tiny, numerically unstable) values.
s (list of float) – Eigenvalues of the original matrix.
k (int) – Maximum desired rank (number of factors)
discard (float) – Percentage of the spectrum’s energy to be discarded.
Rank (number of factors) of the reduced matrix.
int
gensim.models.lsimodel.
print_debug
(id2token, u, s, topics, num_words=10, num_neg=None)¶Log the most salient words per topic.
id2token (Dictionary
) – Mapping from ID to word in the Dictionary.
u (np.ndarray) – The 2D U decomposition matrix.
s (np.ndarray) – The 1D reduced array of eigenvalues used for decomposition.
topics (list of int) – Sequence of topic IDs to be printed
num_words (int, optional) – Number of words to be included for each topic.
num_neg (int, optional) – Number of words with a negative contribution to a topic that should be included.
gensim.models.lsimodel.
stochastic_svd
(corpus, rank, num_terms, chunksize=20000, extra_dims=None, power_iters=0, dtype=<class 'numpy.float64'>, eps=1e-06)¶Run truncated Singular Value Decomposition (SVD) on a sparse input.
corpus ({iterable of list of (int, float), scipy.sparse}) – Input corpus as a stream (does not have to fit in RAM) or a sparse matrix of shape (num_terms, num_documents).
rank (int) – Desired number of factors to be retained after decomposition.
num_terms (int) – The number of features (terms) in corpus.
chunksize (int, optional) – Number of documents to be used in each training chunk.
extra_dims (int, optional) – Extra samples to be used besides the rank k. Can improve accuracy.
power_iters (int, optional) – Number of power iteration steps to be used. Increasing the number of power iterations improves accuracy, but lowers performance.
dtype (numpy.dtype, optional) – Enforces a type for elements of the decomposed matrix.
eps (float, optional) – Percentage of the spectrum’s energy to be discarded.
Notes
The corpus may be larger than RAM (iterator of vectors), if corpus is a scipy.sparse.csc instead, it is assumed the whole corpus fits into core memory and a different (more efficient) code path is chosen. This may return less than the requested number of top rank factors, in case the input itself is of lower rank. The extra_dims (oversampling) and especially power_iters (power iterations) parameters affect accuracy of the decomposition.
This algorithm uses 2 + power_iters passes over the input data. In case you can only afford a single pass,
set onepass=True in LsiModel
and avoid using this function directly.
The decomposition algorithm is based on “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions”.
The left singular vectors and the singular values of the corpus.
(np.ndarray 2D, np.ndarray 1D)