similarities.index – Fast Approximate Nearest Neighbor Similarity with Annoy package

`similarities.index` – Fast Approximate Nearest Neighbor Similarity with Annoy package¶

Intro¶

This module contains integration Annoy with Word2Vec, Doc2Vec, FastText and KeyedVectors.

Important

To use this module, you must have the annoy library install. To install it, run pip install annoy.

What is Annoy¶

Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.

How it works¶

Using random projections and by building up a tree. At every intermediate node in the tree, a random hyperplane is chosen, which divides the space into two subspaces. This hyperplane is chosen by sampling two points from the subset and taking the hyperplane equidistant from them.

More information about Annoy: github repository, author in twitter and annoy-user maillist.

class gensim.similarities.index.AnnoyIndexer(model=None, num_trees=None)¶

This class allows to use Annoy as indexer for most_similar method from Word2Vec, Doc2Vec, FastText and Word2VecKeyedVectors classes.

Parameters

model (BaseWordEmbeddingsModel, optional) – Model, that will be used as source for index.
num_trees (int, optional) – Number of trees for Annoy indexer.

Examples

>>> from gensim.similarities.index import AnnoyIndexer
>>> from gensim.models import Word2Vec
>>>
>>> sentences = [['cute', 'cat', 'say', 'meow'], ['cute', 'dog', 'say', 'woof']]
>>> model = Word2Vec(sentences, min_count=1, seed=1)
>>>
>>> indexer = AnnoyIndexer(model, 2)
>>> model.most_similar("cat", topn=2, indexer=indexer)
[('cat', 1.0), ('dog', 0.32011348009109497)]

build_from_doc2vec()¶: Build an Annoy index using document vectors from a Doc2Vec model.

build_from_keyedvectors()¶: Build an Annoy index using word vectors from a KeyedVectors model.

build_from_word2vec()¶: Build an Annoy index using word vectors from a Word2Vec model.

load(fname)¶

Load AnnoyIndexer instance

Parameters: fname (str) – Path to dump with AnnoyIndexer.

Examples

>>> from gensim.similarities.index import AnnoyIndexer
>>> from gensim.models import Word2Vec
>>> from tempfile import mkstemp
>>>
>>> sentences = [['cute', 'cat', 'say', 'meow'], ['cute', 'dog', 'say', 'woof']]
>>> model = Word2Vec(sentences, min_count=1, seed=1, iter=10)
>>>
>>> indexer = AnnoyIndexer(model, 2)
>>> _, temp_fn = mkstemp()
>>> indexer.save(temp_fn)
>>>
>>> new_indexer = AnnoyIndexer()
>>> new_indexer.load(temp_fn)
>>> new_indexer.model = model

most_similar(vector, num_neighbors)¶

Find the approximate num_neighbors most similar items.

Parameters

vector (numpy.array) – Vector for word/document.
num_neighbors (int) – Number of most similar items

Returns

List of most similar items in format [(item, cosine_distance), … ]

Return type

list of (str, float)

save(fname, protocol=2)¶

Save AnnoyIndexer instance.

Parameters

fname (str) – Path to output file, will produce 2 files: fname - parameters and fname.d - AnnoyIndex.
protocol (int, optional) – Protocol for pickle.

Notes

This method save only index (model isn’t preserved).

Get Expert Help From The Gensim Authors

similarities.index – Fast Approximate Nearest Neighbor Similarity with Annoy package¶

Intro¶

What is Annoy¶

How it works¶

`similarities.index` – Fast Approximate Nearest Neighbor Similarity with Annoy package¶