gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

models.lsimodel – Latent Semantic Indexing

models.lsimodel – Latent Semantic Indexing

Module for Latent Semantic Analysis (aka Latent Semantic Indexing).

Implements fast truncated SVD (Singular Value Decomposition). The SVD decomposition can be updated with new observations at any time, for an online, incremental, memory-efficient training.

This module actually contains several algorithms for decomposition of large corpora, a combination of which effectively and transparently allows building LSI models for:

  • corpora much larger than RAM: only constant memory is needed, independent of the corpus size
  • corpora that are streamed: documents are only accessed sequentially, no random access
  • corpora that cannot be even temporarily stored: each document can only be seen once and must be processed immediately (one-pass algorithm)
  • distributed computing for very large corpora, making use of a cluster of machines

Wall-clock performance on the English Wikipedia (2G corpus positions, 3.2M documents, 100K features, 0.5G non-zero entries in the final TF-IDF matrix), requesting the top 400 LSI factors:

algorithm serial distributed
one-pass merge algorithm 5h14m 1h41m
multi-pass stochastic algo (with 2 power iterations) 5h39m N/A [1]

serial = Core 2 Duo MacBook Pro 2.53Ghz, 4GB RAM, libVec

distributed = cluster of four logical nodes on three physical machines, each with dual core Xeon 2.0GHz, 4GB RAM, ATLAS

Examples

>>> from gensim.test.utils import common_dictionary, common_corpus
>>> from gensim.models import LsiModel
>>>
>>> model = LsiModel(common_corpus, id2word=common_dictionary)
>>> vectorized_corpus = model[common_corpus]  # vectorize input copus in BoW format
[1]The stochastic algo could be distributed too, but most time is already spent reading/decompressing the input from disk in its 4 passes. The extra network traffic due to data distribution across cluster nodes would likely make it slower.
class gensim.models.lsimodel.LsiModel(corpus=None, num_topics=200, id2word=None, chunksize=20000, decay=1.0, distributed=False, onepass=True, power_iters=2, extra_samples=100, dtype=<type 'numpy.float64'>)

Bases: gensim.interfaces.TransformationABC, gensim.models.basemodel.BaseTopicModel

Model for Latent Semantic Indexing.

Algorithm of decomposition described in “Fast and Faster: A Comparison of Two Streamed Matrix Decomposition Algorithms”.

Notes

  • gensim.models.lsimodel.LsiModel.projection.u - left singular vectors,
  • gensim.models.lsimodel.LsiModel.projection.s - singular values,
  • model[training_corpus] - right singular vectors (can be reconstructed if needed).

Examples

>>> from gensim.test.utils import common_corpus, common_dictionary, get_tmpfile
>>> from gensim.models import LsiModel
>>>
>>> model = LsiModel(common_corpus[:3], id2word=common_dictionary)  # train model
>>> vector = model[common_corpus[4]]  # apply model to BoW document
>>> model.add_documents(common_corpus[4:])  # update model with new documents
>>> tmp_fname = get_tmpfile("lsi.model")
>>> model.save(tmp_fname)  # save model
>>> loaded_model = LsiModel.load(tmp_fname)  # load model

Construct an LsiModel object.

Either corpus or id2word must be supplied in order to train the model.

Parameters:
  • corpus ({iterable of list of (int, float), scipy.sparse.csc}, optional) – Stream of document vectors or sparse matrix of shape (num_terms, num_documents).
  • num_topics (int, optional) – Number of requested factors (latent dimensions)
  • id2word (dict of {int: str}, optional) – ID to word mapping, optional.
  • chunksize (int, optional) – Number of documents to be used in each training chunk.
  • decay (float, optional) – Weight of existing observations relatively to new ones.
  • distributed (bool, optional) – If True - distributed mode (parallel execution on several machines) will be used.
  • onepass (bool, optional) – Whether the one-pass algorithm should be used for training. Pass False to force a multi-pass stochastic algorithm.
  • power_iters (int, optional) – Number of power iteration steps to be used. Increasing the number of power iterations improves accuracy, but lowers performance
  • extra_samples (int, optional) – Extra samples to be used besides the rank k. Can improve accuracy.
  • dtype (type, optional) – Enforces a type for elements of the decomposed matrix.
add_documents(corpus, chunksize=None, decay=None)

Update model with new corpus.

Parameters:
  • corpus ({iterable of list of (int, float), scipy.sparse.csc}) – Stream of document vectors or sparse matrix of shape (num_terms, num_documents).
  • chunksize (int, optional) – Number of documents to be used in each training chunk, will use self.chunksize if not specified.
  • decay (float, optional) – Weight of existing observations relatively to new ones, will use self.decay if not specified.

Notes

Training proceeds in chunks of chunksize documents at a time. The size of chunksize is a tradeoff between increased speed (bigger chunksize) vs. lower memory footprint (smaller chunksize). If the distributed mode is on, each chunk is sent to a different worker/computer.

get_topics()

Get the topic vectors.

Notes

The number of topics can actually be smaller than self.num_topics, if there were not enough factors (real rank of input matrix smaller than self.num_topics).

Returns:The term topic matrix with shape (num_topics, vocabulary_size)
Return type:np.ndarray
classmethod load(fname, *args, **kwargs)

Load a previously saved object using save() from file.

Notes

Large arrays can be memmap’ed back as read-only (shared memory) by setting mmap=’r’:

Parameters:

See also

save()

Returns:Loaded instance.
Return type:LsiModel
Raises:IOError – When methods are called on instance (should be called from class).
print_debug(num_topics=5, num_words=10)

Print (to log) the most salient words of the first num_topics topics.

Unlike print_topics(), this looks for words that are significant for a particular topic and not for others. This should result in a more human-interpretable description of topics.

Alias for print_debug().

Parameters:
  • num_topics (int, optional) – The number of topics to be selected (ordered by significance).
  • num_words (int, optional) – The number of words to be included per topics (ordered by significance).
print_topic(topicno, topn=10)

Get a single topic as a formatted string.

Parameters:
  • topicno (int) – Topic id.
  • topn (int) – Number of words from topic that will be used.
Returns:

String representation of topic, like ‘-0.340 * “category” + 0.298 * “$M$” + 0.183 * “algebra” + … ‘.

Return type:

str

print_topics(num_topics=20, num_words=10)

Get the most significant topics (alias for show_topics() method).

Parameters:
  • num_topics (int, optional) – The number of topics to be selected, if -1 - all topics will be in result (ordered by significance).
  • num_words (int, optional) – The number of words to be included per topics (ordered by significance).
Returns:

Sequence with (topic_id, [(word, value), … ]).

Return type:

list of (int, list of (str, float))

save(fname, *args, **kwargs)

Save the model to a file.

Notes

Large internal arrays may be stored into separate files, with fname as prefix.

Warning

Do not save as a compressed file if you intend to load the file back with mmap.

Parameters:

See also

load()

show_topic(topicno, topn=10)

Get the words that define a topic along with their contribution.

This is actually the left singular vector of the specified topic. The most important words in defining the topic (in both directions) are included in the string, along with their contribution to the topic.

Parameters:
  • topicno (int) – The topics id number.
  • topn (int) – Number of words to be included to the result.
Returns:

Topic representation in BoW format.

Return type:

list of (str, float)

show_topics(num_topics=-1, num_words=10, log=False, formatted=True)

Get the most significant topics.

Parameters:
  • num_topics (int, optional) – The number of topics to be selected, if -1 - all topics will be in result (ordered by significance).
  • num_words (int, optional) – The number of words to be included per topics (ordered by significance).
  • log (bool, optional) – If True - log topics with logger.
  • formatted (bool, optional) – If True - each topic represented as string, otherwise - in BoW format.
Returns:

  • list of (int, str) – If formatted=True, return sequence with (topic_id, string representation of topics) OR
  • list of (int, list of (str, float)) – Otherwise, return sequence with (topic_id, [(word, value), … ]).

class gensim.models.lsimodel.Projection(m, k, docs=None, use_svdlibc=False, power_iters=2, extra_dims=100, dtype=<type 'numpy.float64'>)

Bases: gensim.utils.SaveLoad

Lower dimension projections of a Term-Passage matrix.

This is the class taking care of the ‘core math’: interfacing with corpora, splitting large corpora into chunks and merging them etc. This done through the higher-level LsiModel class.

Notes

The projection can be later updated by merging it with another Projection via merge().

Construct the (U, S) projection from a corpus.

Parameters:
  • m (int) – Number of features (terms) in the corpus.
  • k (int) – Desired rank of the decomposed matrix.
  • docs ({iterable of list of (int, float), scipy.sparse.csc}) – Corpus in BoW format or as sparse matrix.
  • use_svdlibc (bool, optional) – If True - will use sparsesvd library, otherwise - our own version will be used.
  • power_iters (int, optional) – Number of power iteration steps to be used. Tune to improve accuracy.
  • extra_dims (int, optional) – Extra samples to be used besides the rank k. Tune to improve accuracy.
  • dtype (numpy.dtype, optional) – Enforces a type for elements of the decomposed matrix.
empty_like()

Get an empty Projection with the same parameters as the current object.

Returns:An empty copy (without corpus) of the current projection.
Return type:Projection
classmethod load(fname, mmap=None)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
merge(other, decay=1.0)

Merge current Projection instance with another.

Warning

The content of other is destroyed in the process, so pass this function a copy of other if you need it further. The other Projection is expected to contain the same number of features.

Parameters:
  • other (Projection) – The Projection object to be merged into the current one. It will be destroyed after merging.
  • decay (float, optional) – Weight of existing observations relatively to new ones. Setting decay < 1.0 causes re-orientation towards new data trends in the input document stream, by giving less emphasis to old observations. This allows LSA to gradually “forget” old observations (documents) and give more preference to new ones.
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) – If None - automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently. If list of str - this attributes will be stored in separate files, the automatic check is not performed in this case.
  • sep_limit (int) – Limit for automatic separation.
  • ignore (frozenset of str) – Attributes that shouldn’t be serialize/store.
  • pickle_protocol (int) – Protocol number for pickle.

See also

load()

gensim.models.lsimodel.ascarray(a, name='')

Return a contiguous array in memory (C order).

Parameters:
  • a (numpy.ndarray) – Input array.
  • name (str, optional) – Array name, used for logging purposes.
Returns:

Contiguous array (row-major order) of same shape and content as a.

Return type:

np.ndarray

gensim.models.lsimodel.asfarray(a, name='')

Get an array laid out in Fortran order in memory.

Parameters:
  • a (numpy.ndarray) – Input array.
  • name (str, optional) – Array name, used for logging purposes.
Returns:

The input a in Fortran, or column-major order.

Return type:

np.ndarray

gensim.models.lsimodel.clip_spectrum(s, k, discard=0.001)

Find how many factors should be kept to avoid storing spurious (tiny, numerically unstable) values.

Parameters:
  • s (list of float) – Eigenvalues of the original matrix.
  • k (int) – Maximum desired rank (number of factors)
  • discard (float) – Percentage of the spectrum’s energy to be discarded.
Returns:

Rank (number of factors) of the reduced matrix.

Return type:

int

gensim.models.lsimodel.print_debug(id2token, u, s, topics, num_words=10, num_neg=None)

Log the most salient words per topic.

Parameters:
  • id2token (Dictionary) – Mapping from ID to word in the Dictionary.
  • u (np.ndarray) – The 2D U decomposition matrix.
  • s (np.ndarray) – The 1D reduced array of eigenvalues used for decomposition.
  • topics (list of int) – Sequence of topic IDs to be printed
  • num_words (int, optional) – Number of words to be included for each topic.
  • num_neg (int, optional) – Number of words with a negative contribution to a topic that should be included.
gensim.models.lsimodel.stochastic_svd(corpus, rank, num_terms, chunksize=20000, extra_dims=None, power_iters=0, dtype=<type 'numpy.float64'>, eps=1e-06)

Run truncated Singular Value Decomposition (SVD) on a sparse input.

Parameters:
  • corpus ({iterable of list of (int, float), scipy.sparse}) – Input corpus as a stream (does not have to fit in RAM) or a sparse matrix of shape (num_terms, num_documents).
  • rank (int) – Desired number of factors to be retained after decomposition.
  • num_terms (int) – The number of features (terms) in corpus.
  • chunksize (int, optional) – Number of documents to be used in each training chunk.
  • extra_dims (int, optional) – Extra samples to be used besides the rank k. Can improve accuracy.
  • power_iters (int, optional) – Number of power iteration steps to be used. Increasing the number of power iterations improves accuracy, but lowers performance.
  • dtype (numpy.dtype, optional) – Enforces a type for elements of the decomposed matrix.
  • eps (float, optional) – Percentage of the spectrum’s energy to be discarded.

Notes

The corpus may be larger than RAM (iterator of vectors), if corpus is a scipy.sparse.csc instead, it is assumed the whole corpus fits into core memory and a different (more efficient) code path is chosen. This may return less than the requested number of top rank factors, in case the input itself is of lower rank. The extra_dims (oversampling) and especially power_iters (power iterations) parameters affect accuracy of the decomposition.

This algorithm uses 2 + power_iters passes over the input data. In case you can only afford a single pass, set onepass=True in LsiModel and avoid using this function directly.

The decomposition algorithm is based on “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions”.

Returns:The left singular vectors and the singular values of the corpus.
Return type:(np.ndarray 2D, np.ndarray 1D)