`similarities.nmslib` – Approximate Vector Search using NMSLIB¶

This module integrates NMSLIB fast similarity search with Gensim’s Word2Vec, Doc2Vec, FastText and KeyedVectors vector embeddings.

Important

To use this module, you must have the external nmslib library installed. To install it, run pip install nmslib.

To use the integration, instantiate a NmslibIndexer class and pass the instance as the indexer parameter to your model’s model.most_similar() method.

Example usage¶

>>> from gensim.similarities.nmslib import NmslibIndexer
>>> from gensim.models import Word2Vec
>>>
>>> sentences = [['cute', 'cat', 'say', 'meow'], ['cute', 'dog', 'say', 'woof']]
>>> model = Word2Vec(sentences, min_count=1, epochs=10, seed=2)
>>>
>>> indexer = NmslibIndexer(model)
>>> model.wv.most_similar("cat", topn=2, indexer=indexer)
[('cat', 1.0), ('meow', 0.16398882865905762)]

Load and save example¶

>>> from gensim.similarities.nmslib import NmslibIndexer
>>> from gensim.models import Word2Vec
>>> from tempfile import mkstemp
>>>
>>> sentences = [['cute', 'cat', 'say', 'meow'], ['cute', 'dog', 'say', 'woof']]
>>> model = Word2Vec(sentences, min_count=1, seed=2, epochs=10)
>>>
>>> indexer = NmslibIndexer(model)
>>> _, temp_fn = mkstemp()
>>> indexer.save(temp_fn)
>>>
>>> new_indexer = NmslibIndexer.load(temp_fn)
>>> model.wv.most_similar("cat", topn=2, indexer=new_indexer)
[('cat', 1.0), ('meow', 0.5595494508743286)]

What is NMSLIB¶

Non-Metric Space Library (NMSLIB) is an efficient cross-platform similarity search library and a toolkit for evaluation of similarity search methods. The core-library does not have any third-party dependencies. More information about NMSLIB: github repository.

Why use NMSIB?¶

Gensim’s native Similarity for finding the k nearest neighbors to a vector uses brute force and has linear complexity, albeit with extremely low constant factors.

The retrieved results are exact, which is an overkill in many applications: approximate results retrieved in sub-linear time may be enough.

NMSLIB can find approximate nearest neighbors much faster, similar to Spotify’s Annoy library. Compared to Annoy, NMSLIB has more parameters to control the build and query time and accuracy. NMSLIB often achieves faster and more accurate nearest neighbors search than Annoy.

class gensim.similarities.nmslib.NmslibIndexer(model, index_params=None, query_time_params=None)¶

This class allows to use NMSLIB as indexer for most_similar method from Word2Vec, Doc2Vec, FastText and Word2VecKeyedVectors classes.

Parameters

model (BaseWordEmbeddingsModel) – Model, that will be used as source for index.
index_params (dict, optional) –
Indexing parameters passed through to NMSLIB: https://github.com/nmslib/nmslib/blob/master/manual/methods.md#graph-based-search-methods-sw-graph-and-hnsw

If not specified, defaults to {‘M’: 100, ‘indexThreadQty’: 1, ‘efConstruction’: 100, ‘post’: 0}.
query_time_params (dict, optional) – query_time_params for NMSLIB indexer. If not specified, defaults to {‘efSearch’: 100}.

classmethod load(fname)¶

Load a NmslibIndexer instance from a file.

Parameters: fname (str) – Path previously used in save().

most_similar(vector, num_neighbors)¶

Find the approximate num_neighbors most similar items.

Parameters

vector (numpy.array) – Vector for a word or document.
num_neighbors (int) – How many most similar items to look for?

Returns

List of most similar items in the format [(item, cosine_similarity), … ].

Return type

list of (str, float)

save(fname, protocol=4)¶

Save this NmslibIndexer instance to a file.

Parameters

fname (str) – Path to the output file, will produce 2 files: fname - parameters and fname.d - NmslibIndex.
protocol (int, optional) – Protocol for pickle.

Notes

This method saves only the index (the model isn’t preserved).

Please sponsor Gensim to help sustain this open source project!

similarities.nmslib – Approximate Vector Search using NMSLIB¶

Example usage¶

Load and save example¶

What is NMSLIB¶

Why use NMSIB?¶

`similarities.nmslib` – Approximate Vector Search using NMSLIB¶