gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

similarities.docsim – Document similarity queries

similarities.docsim – Document similarity queries

Computing similarities across a collection of documents in the Vector Space Model.

The main class is Similarity, which builds an index for a given set of documents. Once the index is built, you can perform efficient queries like “Tell me how similar is this query document to each document in the index?”. The result is a vector of numbers as large as the size of the initial set of documents, that is, one float for each index document. Alternatively, you can also request only the top-N most similar index documents to the query.

How It Works

The Similarity class splits the index into several smaller sub-indexes (“shards”), which are disk-based. If your entire index fits in memory (~hundreds of thousands documents for 1GB of RAM), you can also use the MatrixSimilarity or SparseMatrixSimilarity classes directly. These are more simple but do not scale as well (they keep the entire index in RAM, no sharding).

Once the index has been initialized, you can query for document similarity simply by: >>> from gensim.test.utils import common_corpus, common_dictionary, get_tmpfile >>> >>> index_tmpfile = get_tmpfile(“index”) >>> query = [(1, 2), (6, 1), (7, 2)] >>> >>> index = Similarity(index_tmpfile, common_corpus, num_features=len(common_dictionary)) # build the index >>> similarities = index[query] # get similarities between the query and all index documents

If you have more query documents, you can submit them all at once, in a batch:

>>> from gensim.test.utils import common_corpus, common_dictionary, get_tmpfile
>>>
>>> index_tmpfile = get_tmpfile("index")
>>> batch_of_documents = common_corpus[:]  # only as example
>>> index = Similarity(index_tmpfile, common_corpus, num_features=len(common_dictionary)) # build the index
>>>
>>> for similarities in index[batch_of_documents]: # the batch is simply an iterable of documents, aka gensim corpus.
...     pass

The benefit of this batch (aka “chunked”) querying is much better performance. To see the speed-up on your machine, run python -m gensim.test.simspeed (compare to my results here).

There is also a special syntax for when you need similarity of documents in the index to the index itself (i.e. queries=indexed documents themselves). This special syntax uses the faster, batch queries internally and is ideal for all-vs-all pairwise similarities:

>>> from gensim.test.utils import common_corpus, common_dictionary, get_tmpfile
>>>
>>> index_tmpfile = get_tmpfile("index")
>>> index = Similarity(index_tmpfile, common_corpus, num_features=len(common_dictionary)) # build the index
>>>
>>> for similarities in index: # yield similarities of the 1st indexed document, then 2nd...
...     pass
class gensim.similarities.docsim.MatrixSimilarity(corpus, num_best=None, dtype=<type 'numpy.float32'>, num_features=None, chunksize=256, corpus_len=None)

Compute cosine similarity against a corpus of documents by storing the index matrix in memory.

Unless the entire matrix fits into main memory, use Similarity instead.

Examples

>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.similarities import MatrixSimilarity
>>>
>>> query = [(1, 2), (5, 4)]
>>> index = MatrixSimilarity(common_corpus, num_features=len(common_dictionary))
>>> sims = index[query]
Parameters:
  • corpus (iterable of list of (int, number)) – Corpus in BoW format.
  • num_best (int, optional) – If set, return only the num_best most similar documents, always leaving out documents with similarity = 0. Otherwise, return a full vector with one float for every document in the index.
  • dtype (numpy.dtype) – Datatype of internal matrix
  • num_features (int, optional) – Size of the dictionary.
  • chunksize (int, optional) – Size of chunk.
  • corpus_len (int, optional) – Size of corpus, if not specified - will scan corpus to determine size.
get_similarities(query)

Get similarity between query and current index instance.

Warning

Do not use this function directly, use the __getitem__ instead.

Parameters:query ({list of (int, number), iterable of list of (int, number), scipy.sparse.csr_matrix) – Document or collection of documents.
Returns:Similarity matrix.
Return type:numpy.ndarray
classmethod load(fname, mmap=None)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) – If None - automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently. If list of str - this attributes will be stored in separate files, the automatic check is not performed in this case.
  • sep_limit (int) – Limit for automatic separation.
  • ignore (frozenset of str) – Attributes that shouldn’t be serialize/store.
  • pickle_protocol (int) – Protocol number for pickle.

See also

load()

class gensim.similarities.docsim.Shard(fname, index)

A proxy that represents a single shard instance within Similarity index.

Basically just wraps MatrixSimilarity, SparseMatrixSimilarity, etc, so that it mmaps from disk on request (query).

Parameters:
  • fname (str) – Path to top-level directory (file) to traverse for corpus documents.
  • index (SimilarityABC) – Index object.
fullname()

Get full path to shard file.

Returns:Path to shard instance.
Return type:str
get_document_id(pos)

Get index vector at position pos.

Parameters:pos (int) – Vector position.
Returns:Index vector. Type depends on underlying index.
Return type:{scipy.sparse.csr_matrix, numpy.ndarray}

Notes

The vector is of the same type as the underlying index (ie., dense for MatrixSimilarity and scipy.sparse for SparseMatrixSimilarity. TODO: Can dense be scipy.sparse?

get_index()

Load & get index.

Returns:Index instance.
Return type:SimilarityABC
classmethod load(fname, mmap=None)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) – If None - automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently. If list of str - this attributes will be stored in separate files, the automatic check is not performed in this case.
  • sep_limit (int) – Limit for automatic separation.
  • ignore (frozenset of str) – Attributes that shouldn’t be serialize/store.
  • pickle_protocol (int) – Protocol number for pickle.

See also

load()

class gensim.similarities.docsim.Similarity(output_prefix, corpus, num_features, num_best=None, chunksize=256, shardsize=32768, norm='l2')

Compute cosine similarity of a dynamic query against a static corpus of documents (‘the index’).

Notes

Scalability is achieved by sharding the index into smaller pieces, each of which fits into core memory The shards themselves are simply stored as files to disk and mmap’ed back as needed.

Examples

>>> from gensim.corpora.textcorpus import TextCorpus
>>> from gensim.test.utils import datapath, get_tmpfile
>>> from gensim.similarities import Similarity
>>>
>>> corpus = TextCorpus(datapath('testcorpus.mm'))
>>> index_temp = get_tmpfile("index")
>>> index = Similarity(index_temp, corpus, num_features=400)  # create index
>>>
>>> query = next(iter(corpus))
>>> result = index[query]  # search similar to `query` in index
>>>
>>> for sims in index[corpus]: # if you have more query documents, you can submit them all at once, in a batch
...     pass
>>>
>>> # There is also a special syntax for when you need similarity of documents in the index
>>> # to the index itself (i.e. queries=indexed documents themselves). This special syntax
>>> # uses the faster, batch queries internally and **is ideal for all-vs-all pairwise similarities**:
>>> for similarities in index: # yield similarities of the 1st indexed document, then 2nd...
...     pass

See also

MatrixSimilarity
Index similarity (dense with cosine distance).
SparseMatrixSimilarity
Index similarity (sparse with cosine distance).
SoftCosineSimilarity
Index similarity (with soft-cosine distance).
WmdSimilarity
Index similarity (with word-mover distance).
Parameters:
  • output_prefix (str) – Prefix for shard filename. If None - random filename in temp will be used.
  • corpus (iterable of list of (int, number)) – Corpus in BoW format.
  • num_features (int) – Size of the dictionary (number of features).
  • num_best (int, optional) – If set, return only the num_best most similar documents, always leaving out documents with similarity = 0. Otherwise, return a full vector with one float for every document in the index.
  • chunksize (int, optional) – Size of block.
  • shardsize (int, optional) – Size of shards should be chosen so that a shardsize x chunksize matrix of floats fits comfortably into memory.
  • norm ({'l1', 'l2'}, optional) – Normalization to use.

Notes

Documents are split (internally, transparently) into shards of shardsize documents each, converted to matrix, for faster BLAS calls. Each shard is stored to disk under output_prefix.shard_number. If you don’t specify an output prefix, a random filename in temp will be used. If your entire index fits in memory (~hundreds of thousands documents for 1GB of RAM), you can also use the MatrixSimilarity or SparseMatrixSimilarity classes directly. These are more simple but do not scale as well (they keep the entire index in RAM, no sharding).

add_documents(corpus)

Extend the index with new documents.

Parameters:corpus (iterable of list of (int, number)) – Corpus in BoW format.

Notes

Internally, documents are buffered and then spilled to disk when there’s self.shardsize of them (or when a query is issued).

Examples

>>> from gensim.corpora.textcorpus import TextCorpus
>>> from gensim.test.utils import datapath, get_tmpfile
>>> from gensim.similarities import Similarity
>>>
>>> corpus = TextCorpus(datapath('testcorpus.mm'))
>>> index_temp = get_tmpfile("index")
>>> index = Similarity(index_temp, corpus, num_features=400)  # create index
>>>
>>> one_more_corpus = TextCorpus(datapath('testcorpus.txt'))
>>> index.add_documents(one_more_corpus)  # add more documents in corpus
check_moved()

Update shard locations (for case if the server directory has moved on filesystem).

close_shard()
Force the latest shard to close (be converted to a matrix and stored to disk).
Do nothing if no new documents added since last call.

Notes

The shard is closed even if it is not full yet (its size is smaller than self.shardsize). If documents are added later via add_documents() this incomplete shard will be loaded again and completed.

destroy()

Delete all files under self.output_prefix, object is not usable after calling this method anymore.

get_similarities(doc)

Get similarity measures of documents of corpus to given doc, should be overridden in inheritor class.

Parameters:doc (list of (int, number)) – Document in BoW format.
Raises:NotImplementedError – Since it’s abstract class this method should be reimplemented later.
iter_chunks(chunksize=None)

Iteratively yield the index as chunks of documents, each of size <= chunksize.

Parameters:chunksize (int, optional) – Size of chunk,, if None - self.chunksize will be used.

Notes

The chunk is returned in its raw form. The size of the chunk may be smaller than requested; it is up to the caller to check the result for real length.

Yields:
classmethod load(fname, mmap=None)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
query_shards(query)

Applying shard[query] for each shard in self.shards, as a sequence.

Parameters:query ({iterable of list of (int, number) , list of (int, number))}) – Document in BoW format or corpus of documents.
Returns:Result of search.
Return type:(None, list of …)
reopen_shard()

Reopen incomplete shard.

save(fname=None, *args, **kwargs)

Save the object via pickling (also see load) under filename specified in the constructor.

Parameters:

Notes

Call close_shard() internally to spill unfinished shards to disk first.

Examples

>>> from gensim.corpora.textcorpus import TextCorpus
>>> from gensim.test.utils import datapath, get_tmpfile
>>> from gensim.similarities import Similarity
>>>
>>> temp_fname = get_tmpfile("index")
>>> output_fname = get_tmpfile("saved_index")
>>>
>>> corpus = TextCorpus(datapath('testcorpus.txt'))
>>> index = Similarity(temp_fname, corpus, num_features=400)
>>>
>>> index.save(output_fname)
>>> loaded_index = index.load(output_fname)
shardid2filename(shardid)

Get shard file by shardid.

Parameters:shardid (int) – Shard index.
Returns:Path to shard file.
Return type:str
similarity_by_id(docpos)

Get similarity of the given document only by docpos.

Parameters:docpos (int) – Document position in index
Returns:

Examples

>>> from gensim.corpora.textcorpus import TextCorpus
>>> from gensim.test.utils import datapath
>>> from gensim.similarities import Similarity
>>>
>>> corpus = TextCorpus(datapath('testcorpus.txt'))
>>> index = Similarity('temp', corpus, num_features=400)
>>> similarities = index.similarity_by_id(1)
vector_by_id(docpos)

Get indexed vector corresponding to the document at position docpos.

Parameters:docpos (int) – Document position
Returns:Indexed vector, internal type depends on underlying index.
Return type:scipy.sparse.csr_matrix

Examples

>>> from gensim.corpora.textcorpus import TextCorpus
>>> from gensim.test.utils import datapath
>>> from gensim.similarities import Similarity
>>> import gensim.downloader as api
>>>
>>> # Create index:
>>> corpus = TextCorpus(datapath('testcorpus.txt'))
>>> index = Similarity('temp', corpus, num_features=400)
>>> vector = index.vector_by_id(1)
class gensim.similarities.docsim.SoftCosineSimilarity(corpus, similarity_matrix, num_best=None, chunksize=256)

Compute soft cosine similarity against a corpus of documents by storing the index matrix in memory.

Examples

>>> from gensim.test.utils import common_texts
>>> from gensim.corpora import Dictionary
>>> from gensim.models import Word2Vec
>>> from gensim.similarities import SoftCosineSimilarity
>>>
>>> model = Word2Vec(common_texts, size=20, min_count=1)  # train word-vectors
>>> dictionary = Dictionary(common_texts)
>>> bow_corpus = [dictionary.doc2bow(document) for document in common_texts]
>>>
>>> similarity_matrix = model.wv.similarity_matrix(dictionary)  # construct similarity matrix
>>> index = SoftCosineSimilarity(bow_corpus, similarity_matrix, num_best=10)
>>>
>>> # Make a query.
>>> query = 'graph trees computer'.split()
>>> # calculate similarity between query and each doc from bow_corpus
>>> sims = index[dictionary.doc2bow(query)]

Check out Tutorial Notebook for more examples.

Parameters:
  • corpus (iterable of list of (int, float)) – A list of documents in the BoW format.
  • similarity_matrix (scipy.sparse.csc_matrix) – A term similarity matrix, typically produced by similarity_matrix().
  • num_best (int, optional) – The number of results to retrieve for a query, if None - return similarities with all elements from corpus.
  • chunksize (int, optional) – Size of one corpus chunk.

See also

gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similarity_matrix()
A term similarity matrix produced from term embeddings.
gensim.matutils.softcossim()
The Soft Cosine Measure.
get_similarities(query)

Get similarity between query and current index instance.

Warning

Do not use this function directly; use the self[query] syntax instead.

Parameters:query ({list of (int, number), iterable of list of (int, number), scipy.sparse.csr_matrix) – Document or collection of documents.
Returns:Similarity matrix.
Return type:numpy.ndarray
classmethod load(fname, mmap=None)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) – If None - automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently. If list of str - this attributes will be stored in separate files, the automatic check is not performed in this case.
  • sep_limit (int) – Limit for automatic separation.
  • ignore (frozenset of str) – Attributes that shouldn’t be serialize/store.
  • pickle_protocol (int) – Protocol number for pickle.

See also

load()

class gensim.similarities.docsim.SparseMatrixSimilarity(corpus, num_features=None, num_terms=None, num_docs=None, num_nnz=None, num_best=None, chunksize=500, dtype=<type 'numpy.float32'>, maintain_sparsity=False)

Compute cosine similarity against a corpus of documents by storing the index matrix in memory.

Notes

Use this if your input corpus contains sparse vectors (such as documents in bag-of-words format) and fits into RAM.

The matrix is internally stored as a scipy.sparse.csr_matrix matrix. Unless the entire matrix fits into main memory, use Similarity instead.

Takes an optional maintain_sparsity argument, setting this to True causes get_similarities to return a sparse matrix instead of a dense representation if possible.

See also

Similarity
Index similarity (wrapper for other inheritors of SimilarityABC).
MatrixSimilarity
Index similarity (dense with cosine distance).
Parameters:
  • corpus (iterable of list of (int, float)) – A list of documents in the BoW format.
  • num_features (int, optional) – Size of the dictionary.
  • num_terms (int, optional) – Number of terms, must be specified.
  • num_docs (int, optional) – Number of documents in corpus.
  • num_nnz (int, optional) – Number of non-zero terms.
  • num_best (int, optional) – If set, return only the num_best most similar documents, always leaving out documents with similarity = 0. Otherwise, return a full vector with one float for every document in the index.
  • chunksize (int, optional) – Size of chunk.
  • dtype (numpy.dtype, optional) – Data type of internal matrix.
  • maintain_sparsity (bool, optional) – if True - will return sparse arr from get_similarities().
get_similarities(query)

Get similarity between query and current index instance.

Warning

Do not use this function directly; use the self[query] syntax instead.

Parameters:query ({list of (int, number), iterable of list of (int, number), scipy.sparse.csr_matrix) – Document or collection of documents.
Returns:
  • numpy.ndarray – Similarity matrix (if maintain_sparsity=False) OR
  • scipy.sparse.csc – otherwise
classmethod load(fname, mmap=None)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) – If None - automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently. If list of str - this attributes will be stored in separate files, the automatic check is not performed in this case.
  • sep_limit (int) – Limit for automatic separation.
  • ignore (frozenset of str) – Attributes that shouldn’t be serialize/store.
  • pickle_protocol (int) – Protocol number for pickle.

See also

load()

class gensim.similarities.docsim.WmdSimilarity(corpus, w2v_model, num_best=None, normalize_w2v_and_replace=True, chunksize=256)

Compute negative WMD similarity against a corpus of documents by storing the index matrix in memory.

See WordEmbeddingsKeyedVectors for more information. Also, tutorial notebook for more examples.

When using this code, please consider citing the following papers: Ofir Pele and Michael Werman, “A linear time histogram metric for improved SIFT matching”, Ofir Pele and Michael Werman, “Fast and robust earth mover’s distances”, “Matt Kusner et al. “From Word Embeddings To Document Distances”.

Example

>>> from gensim.test.utils import common_texts
>>> from gensim.corpora import Dictionary
>>> from gensim.models import Word2Vec
>>> from gensim.similarities import WmdSimilarity
>>>
>>> model = Word2Vec(common_texts, size=20, min_count=1)  # train word-vectors
>>> dictionary = Dictionary(common_texts)
>>> bow_corpus = [dictionary.doc2bow(document) for document in common_texts]
>>>
>>> index = WmdSimilarity(bow_corpus, model)
>>> # Make query.
>>> query = 'trees'
>>> sims = index[query]
Parameters:
  • corpus (iterable of list of (int, float)) – A list of documents in the BoW format.
  • w2v_model (Word2VecTrainables) – A trained word2vec model.
  • num_best (int, optional) – Number of results to retrieve.
  • normalize_w2v_and_replace (bool, optional) – Whether or not to normalize the word2vec vectors to length 1.
  • chunksize (int, optional) – Size of chunk.
get_similarities(query)

Get similarity between query and current index instance.

Warning

Do not use this function directly; use the self[query] syntax instead.

Parameters:query ({list of (int, number), iterable of list of (int, number), scipy.sparse.csr_matrix) – Document or collection of documents.
Returns:Similarity matrix.
Return type:numpy.ndarray
classmethod load(fname, mmap=None)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) – If None - automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently. If list of str - this attributes will be stored in separate files, the automatic check is not performed in this case.
  • sep_limit (int) – Limit for automatic separation.
  • ignore (frozenset of str) – Attributes that shouldn’t be serialize/store.
  • pickle_protocol (int) – Protocol number for pickle.

See also

load()

gensim.similarities.docsim.query_shard(args)

Helper for request query from shard, same as shard[query].

Parameters:args ((list of (int, number), SimilarityABC)) – Query and Shard instances
Returns: