similarities.nmslib – Approximate Vector Search using NMSLIB

This module integrates NMSLIB fast similarity search with Gensim’s Word2Vec, Doc2Vec, FastText and KeyedVectors vector embeddings.

Important

To use this module, you must have the external nmslib library installed. To install it, run pip install nmslib.

To use the integration, instantiate a NmslibIndexer class and pass the instance as the indexer parameter to your model’s model.most_similar() method.

Example usage

>>> from gensim.similarities.nmslib import NmslibIndexer
>>> from gensim.models import Word2Vec
>>>
>>> sentences = [['cute', 'cat', 'say', 'meow'], ['cute', 'dog', 'say', 'woof']]
>>> model = Word2Vec(sentences, min_count=1, epochs=10, seed=2)
>>>
>>> indexer = NmslibIndexer(model)
>>> model.wv.most_similar("cat", topn=2, indexer=indexer)
[('cat', 1.0), ('meow', 0.16398882865905762)]

Load and save example

>>> from gensim.similarities.nmslib import NmslibIndexer
>>> from gensim.models import Word2Vec
>>> from tempfile import mkstemp
>>>
>>> sentences = [['cute', 'cat', 'say', 'meow'], ['cute', 'dog', 'say', 'woof']]
>>> model = Word2Vec(sentences, min_count=1, seed=2, epochs=10)
>>>
>>> indexer = NmslibIndexer(model)
>>> _, temp_fn = mkstemp()
>>> indexer.save(temp_fn)
>>>
>>> new_indexer = NmslibIndexer.load(temp_fn)
>>> model.wv.most_similar("cat", topn=2, indexer=new_indexer)
[('cat', 1.0), ('meow', 0.5595494508743286)]

What is NMSLIB

Non-Metric Space Library (NMSLIB) is an efficient cross-platform similarity search library and a toolkit for evaluation of similarity search methods. The core-library does not have any third-party dependencies. More information about NMSLIB: github repository.

Why use NMSIB?

Gensim’s native Similarity for finding the k nearest neighbors to a vector uses brute force and has linear complexity, albeit with extremely low constant factors.

The retrieved results are exact, which is an overkill in many applications: approximate results retrieved in sub-linear time may be enough.

NMSLIB can find approximate nearest neighbors much faster, similar to Spotify’s Annoy library. Compared to Annoy, NMSLIB has more parameters to control the build and query time and accuracy. NMSLIB often achieves faster and more accurate nearest neighbors search than Annoy.

class gensim.similarities.nmslib.NmslibIndexer(model, index_params=None, query_time_params=None)

This class allows to use NMSLIB as indexer for most_similar method from Word2Vec, Doc2Vec, FastText and Word2VecKeyedVectors classes.

Parameters
classmethod load(fname)

Load a NmslibIndexer instance from a file.

Parameters

fname (str) – Path previously used in save().

most_similar(vector, num_neighbors)

Find the approximate num_neighbors most similar items.

Parameters
  • vector (numpy.array) – Vector for a word or document.

  • num_neighbors (int) – How many most similar items to look for?

Returns

List of most similar items in the format [(item, cosine_similarity), … ].

Return type

list of (str, float)

save(fname, protocol=4)

Save this NmslibIndexer instance to a file.

Parameters
  • fname (str) – Path to the output file, will produce 2 files: fname - parameters and fname.d - NmslibIndex.

  • protocol (int, optional) – Protocol for pickle.

Notes

This method saves only the index (the model isn’t preserved).