similarities.nmslib
– Approximate Vector Search using NMSLIB¶
This module integrates NMSLIB fast similarity
search with Gensim’s Word2Vec
, Doc2Vec
,
FastText
and KeyedVectors
vector embeddings.
Important
To use this module, you must have the external nmslib
library installed.
To install it, run pip install nmslib
.
To use the integration, instantiate a NmslibIndexer
class
and pass the instance as the indexer parameter to your model’s model.most_similar() method.
Example usage¶
>>> from gensim.similarities.nmslib import NmslibIndexer
>>> from gensim.models import Word2Vec
>>>
>>> sentences = [['cute', 'cat', 'say', 'meow'], ['cute', 'dog', 'say', 'woof']]
>>> model = Word2Vec(sentences, min_count=1, epochs=10, seed=2)
>>>
>>> indexer = NmslibIndexer(model)
>>> model.wv.most_similar("cat", topn=2, indexer=indexer)
[('cat', 1.0), ('meow', 0.16398882865905762)]
Load and save example¶
>>> from gensim.similarities.nmslib import NmslibIndexer
>>> from gensim.models import Word2Vec
>>> from tempfile import mkstemp
>>>
>>> sentences = [['cute', 'cat', 'say', 'meow'], ['cute', 'dog', 'say', 'woof']]
>>> model = Word2Vec(sentences, min_count=1, seed=2, epochs=10)
>>>
>>> indexer = NmslibIndexer(model)
>>> _, temp_fn = mkstemp()
>>> indexer.save(temp_fn)
>>>
>>> new_indexer = NmslibIndexer.load(temp_fn)
>>> model.wv.most_similar("cat", topn=2, indexer=new_indexer)
[('cat', 1.0), ('meow', 0.5595494508743286)]
What is NMSLIB¶
Non-Metric Space Library (NMSLIB) is an efficient cross-platform similarity search library and a toolkit for evaluation of similarity search methods. The core-library does not have any third-party dependencies. More information about NMSLIB: github repository.
Why use NMSIB?¶
Gensim’s native Similarity
for finding the k nearest neighbors to a vector
uses brute force and has linear complexity, albeit with extremely low constant factors.
The retrieved results are exact, which is an overkill in many applications: approximate results retrieved in sub-linear time may be enough.
NMSLIB can find approximate nearest neighbors much faster, similar to Spotify’s Annoy library.
Compared to Annoy
, NMSLIB has more parameters to
control the build and query time and accuracy. NMSLIB often achieves faster and more accurate
nearest neighbors search than Annoy.
- class gensim.similarities.nmslib.NmslibIndexer(model, index_params=None, query_time_params=None)¶
This class allows to use NMSLIB as indexer for most_similar method from
Word2Vec
,Doc2Vec
,FastText
andWord2VecKeyedVectors
classes.- Parameters
model (
BaseWordEmbeddingsModel
) – Model, that will be used as source for index.index_params (dict, optional) –
Indexing parameters passed through to NMSLIB: https://github.com/nmslib/nmslib/blob/master/manual/methods.md#graph-based-search-methods-sw-graph-and-hnsw
If not specified, defaults to {‘M’: 100, ‘indexThreadQty’: 1, ‘efConstruction’: 100, ‘post’: 0}.
query_time_params (dict, optional) – query_time_params for NMSLIB indexer. If not specified, defaults to {‘efSearch’: 100}.
- classmethod load(fname)¶
Load a NmslibIndexer instance from a file.
- Parameters
fname (str) – Path previously used in save().
- most_similar(vector, num_neighbors)¶
Find the approximate num_neighbors most similar items.
- Parameters
vector (numpy.array) – Vector for a word or document.
num_neighbors (int) – How many most similar items to look for?
- Returns
List of most similar items in the format [(item, cosine_similarity), … ].
- Return type
list of (str, float)
- save(fname, protocol=4)¶
Save this NmslibIndexer instance to a file.
- Parameters
fname (str) – Path to the output file, will produce 2 files: fname - parameters and fname.d -
NmslibIndex
.protocol (int, optional) – Protocol for pickle.
Notes
This method saves only the index (the model isn’t preserved).