models.keyedvectors
– Store and query word vectors¶This module implements word vectors and their similarity lookups.
Since trained word vectors are independent from the way they were trained (Word2Vec
,
FastText
, WordRank
,
VarEmbed
etc), they can be represented by a standalone structure,
as implemented in this module.
The structure is called “KeyedVectors” and is essentially a mapping between entities and vectors. Each entity is identified by its string id, so this is a mapping between {str => 1D numpy array}.
The entity typically corresponds to a word (so the mapping maps words to 1D vectors), but for some models, the key can also correspond to a document, a graph node etc. To generalize over different usecases, this module calls the keys entities. Each entity is always represented by its string id, no matter whether the entity is a word, a document or a graph node.
capability  KeyedVectors  full model  note 
continue training vectors  ❌  ✅  You need the full model to train or update vectors. 
smaller objects  ✅  ❌  KeyedVectors are smaller and need less RAM, because they don’t need to store the model state that enables training. 
save/load from native fasttext/word2vec format  ✅  ❌  Vectors exported by the Facebook and Google tools do not support further training, but you can still load them into KeyedVectors. 
append new vectors  ✅  ✅  Add new entityvector entries to the mapping dynamically. 
concurrency  ✅  ✅  Threadsafe, allows concurrent vector queries. 
shared RAM  ✅  ✅  Multiple processes can reuse the same data, keeping only a single copy in RAM using mmap. 
fast load  ✅  ✅  Supports mmap to load data from disk instantaneously. 
TL;DR: the main difference is that KeyedVectors do not support further training. On the other hand, by shedding the internal data structures necessary for training, KeyedVectors offer a smaller RAM footprint and a simpler interface.
Train a full model, then access its model.wv property, which holds the standalone keyed vectors. For example, using the Word2Vec algorithm to train the vectors
>>> from gensim.test.utils import common_texts
>>> from gensim.models import Word2Vec
>>>
>>> model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)
>>> word_vectors = model.wv
Persist the word vectors to disk with
>>> from gensim.test.utils import get_tmpfile
>>> from gensim.models import KeyedVectors
>>>
>>> fname = get_tmpfile("vectors.kv")
>>> word_vectors.save(fname)
>>> word_vectors = KeyedVectors.load(fname, mmap='r')
The vectors can also be instantiated from an existing file on disk in the original Google’s word2vec C format as a KeyedVectors instance
>>> from gensim.test.utils import datapath
>>>
>>> wv_from_text = KeyedVectors.load_word2vec_format(datapath('word2vec_pre_kv_c'), binary=False) # C text format
>>> wv_from_bin = KeyedVectors.load_word2vec_format(datapath("euclidean_vectors.bin"), binary=True) # C bin format
You can perform various syntactic/semantic NLP word tasks with the trained vectors. Some of them are already builtin
>>> import gensim.downloader as api
>>>
>>> word_vectors = api.load("glovewikigigaword100") # load pretrained wordvectors from gensimdata
>>>
>>> result = word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])
>>> print("{}: {:.4f}".format(*result[0]))
queen: 0.7699
>>>
>>> result = word_vectors.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])
>>> print("{}: {:.4f}".format(*result[0]))
queen: 0.8965
>>>
>>> print(word_vectors.doesnt_match("breakfast cereal dinner lunch".split()))
cereal
>>>
>>> similarity = word_vectors.similarity('woman', 'man')
>>> similarity > 0.8
True
>>>
>>> result = word_vectors.similar_by_word("cat")
>>> print("{}: {:.4f}".format(*result[0]))
dog: 0.8798
>>>
>>> sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
>>> sentence_president = 'The president greets the press in Chicago'.lower().split()
>>>
>>> similarity = word_vectors.wmdistance(sentence_obama, sentence_president)
>>> print("{:.4f}".format(similarity))
3.4893
>>>
>>> distance = word_vectors.distance("media", "media")
>>> print("{:.1f}".format(distance))
0.0
>>>
>>> sim = word_vectors.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant'])
>>> print("{:.4f}".format(sim))
0.7067
>>>
>>> vector = word_vectors['computer'] # numpy vector of a word
>>> vector.shape
(100,)
>>>
>>> vector = word_vectors.wv.word_vec('office', use_norm=True)
>>> vector.shape
(100,)
Correlation with human opinion on word similarity
>>> from gensim.test.utils import datapath
>>>
>>> similarities = model.wv.evaluate_word_pairs(datapath('wordsim353.tsv'))
And on word analogies
>>> analogy_scores = model.wv.evaluate_word_analogies(datapath('questionswords.txt'))
and so on.
gensim.models.keyedvectors.
BaseKeyedVectors
(vector_size)¶Bases: gensim.utils.SaveLoad
Abstract base class / interface for various types of word vectors.
add
(entities, weights, replace=False)¶Append entities and theirs vectors in a manual way. If some entity is already in the vocabulary, the old vector is kept unless replace flag is True.
Parameters: 


closer_than
(entity1, entity2)¶Get all entities that are closer to entity1 than entity2 is to entity1.
distance
(entity1, entity2)¶Compute distance between vectors of two input entities, specified by their string id.
distances
(entity1, other_entities=())¶Compute distances from a given entity (its string id) to all entities in other_entity. If other_entities is empty, return the distance between entity1 and all entities in vocab.
get_vector
(entity)¶Get the entity’s representations in vector space, as a 1D numpy array.
Parameters:  entity (str) – Identifier of the entity to return the vector for. 

Returns:  Vector for the specified entity. 
Return type:  numpy.ndarray 
Raises:  KeyError – If the given entity identifier doesn’t exist. 
load
(fname_or_handle, **kwargs)¶Load an object previously saved using save()
from a file.
Parameters: 


See also
save()
Returns:  Object loaded from fname. 

Return type:  object 
Raises:  AttributeError – When called on an object instance instead of class (this is a class method). 
most_similar
(**kwargs)¶Find the topN most similar entities. Possibly have positive and negative list of entities in **kwargs.
most_similar_to_given
(entity1, entities_list)¶Get the entity from entities_list most similar to entity1.
rank
(entity1, entity2)¶Rank of the distance of entity2 from entity1, in relation to distances of all entities from entity1.
save
(fname_or_handle, **kwargs)¶Save the object to a file.
Parameters: 


See also
load()
similarity
(entity1, entity2)¶Compute cosine similarity between two entities, specified by their string id.
gensim.models.keyedvectors.
Doc2VecKeyedVectors
(vector_size, mapfile_path)¶Bases: gensim.models.keyedvectors.BaseKeyedVectors
add
(entities, weights, replace=False)¶Append entities and theirs vectors in a manual way. If some entity is already in the vocabulary, the old vector is kept unless replace flag is True.
Parameters: 


closer_than
(entity1, entity2)¶Get all entities that are closer to entity1 than entity2 is to entity1.
distance
(d1, d2)¶Compute cosine distance between two documents.
distances
(d1, other_docs=())¶Compute cosine distances from given d1 to all documents in other_docs.
TODO: Accept vectors of outoftrainingset docs, as if from inference.
Parameters: 


Returns:  Array containing distances to all documents in other_docs from input d1. 
Return type:  numpy.array 
doctag_syn0
¶doctag_syn0norm
¶doesnt_match
(docs)¶Which document from the given list doesn’t go with the others from the training set?
TODO: Accept vectors of outoftrainingset docs, as if from inference.
Parameters:  docs (list of {str, int}) – Sequence of doctags/indexes. 

Returns:  Doctag/index of the document farthest away from the mean of all the documents. 
Return type:  {str, int} 
get_vector
(entity)¶Get the entity’s representations in vector space, as a 1D numpy array.
Parameters:  entity (str) – Identifier of the entity to return the vector for. 

Returns:  Vector for the specified entity. 
Return type:  numpy.ndarray 
Raises:  KeyError – If the given entity identifier doesn’t exist. 
index2entity
¶index_to_doctag
(i_index)¶Get string key for given i_index, if available. Otherwise return raw int doctag (same int).
init_sims
(replace=False)¶Precompute L2normalized vectors.
Parameters:  replace (bool, optional) – If True  forget the original vectors and only keep the normalized ones = saves lots of memory! 

Warning
You cannot continue training after doing a replace.
The model becomes effectively readonly: you can call
most_similar()
,
similarity()
, etc., but not train and infer_vector.
int_index
(index, doctags, max_rawint)¶Get int index for either string or int index
load
(fname_or_handle, **kwargs)¶Load an object previously saved using save()
from a file.
Parameters: 


See also
save()
Returns:  Object loaded from fname. 

Return type:  object 
Raises:  AttributeError – When called on an object instance instead of class (this is a class method). 
most_similar
(positive=None, negative=None, topn=10, clip_start=0, clip_end=None, indexer=None)¶Find the topN most similar docvecs from the training set. Positive docvecs contribute positively towards the similarity, negative docvecs negatively.
This method computes cosine similarity between a simple mean of the projection weight vectors of the given docs. Docs may be specified as vectors, integer indexes of trained docvecs, or if the documents were originally presented with string tags, by the corresponding tags.
TODO: Accept vectors of outoftrainingset docs, as if from inference.
Parameters: 


Returns:  Sequence of (doctag/index, similarity). 
Return type:  list of ({str, int}, float) 
most_similar_to_given
(entity1, entities_list)¶Get the entity from entities_list most similar to entity1.
n_similarity
(ds1, ds2)¶Compute cosine similarity between two sets of docvecs from the trained set.
TODO: Accept vectors of outoftrainingset docs, as if from inference.
Parameters: 


Returns:  The cosine similarity between the means of the documents in each of the two sets. 
Return type:  float 
rank
(entity1, entity2)¶Rank of the distance of entity2 from entity1, in relation to distances of all entities from entity1.
save
(*args, **kwargs)¶Save object.
Parameters:  fname (str) – Path to the output file. 

See also
load()
save_word2vec_format
(fname, prefix='*dt_', fvocab=None, total_vec=None, binary=False, write_first_line=True)¶Store the inputhidden weight matrix in the same format used by the original C word2vectool, for compatibility.
Parameters: 


similarity
(d1, d2)¶Compute cosine similarity between two docvecs from the training set.
TODO: Accept vectors of outoftrainingset docs, as if from inference.
Parameters: 


Returns:  The cosine similarity between the vectors of the two documents. 
Return type:  float 
similarity_unseen_docs
(model, doc_words1, doc_words2, alpha=None, min_alpha=None, steps=None)¶Compute cosine similarity between two postbulk out of training documents.
Parameters: 


Returns:  The cosine similarity between doc_words1 and doc_words2. 
Return type:  float 
gensim.models.keyedvectors.
FastTextKeyedVectors
(vector_size, min_n, max_n, bucket, compatible_hash)¶Bases: gensim.models.keyedvectors.WordEmbeddingsKeyedVectors
Vectors and vocab for FastText
.
Implements significant parts of the FastText algorithm. For example,
the word_vec()
calculates vectors for outofvocabulary (OOV)
entities. FastText achieves this by keeping vectors for ngrams:
adding the vectors for the ngrams of an entity yields the vector for the
entity.
Similar to a hashmap, this class keeps a fixed number of buckets, and maps all ngrams to buckets using a hash function.
This class also provides an abstraction over the hash functions used by Gensim’s FastText implementation over time. The hash function connects ngrams to buckets. Originally, the hash function was broken and incompatible with Facebook’s implementation. The current hash is fully compatible.
Parameters: 


vectors_vocab
¶Each row corresponds to a vector for an entity in the vocabulary. Columns correspond to vector dimensions.
Type:  np.array 

vectors_vocab_norm
¶Same as vectors_vocab, but the vectors are L2 normalized.
Type:  np.array 

vectors_ngrams
¶A vector for each ngram across all entities in the vocabulary. Each row is a vector that corresponds to a bucket. Columns correspond to vector dimensions.
Type:  np.array 

vectors_ngrams_norm
¶Same as vectors_ngrams, but the vectors are L2 normalized.
Under some conditions, may actually be the same matrix as
vectors_ngrams, e.g. if init_sims()
was called with
replace=True.
Type:  np.array 

buckets_word
¶Maps vocabulary items (by their index) to the buckets they occur in.
Type:  dict 

accuracy
(**kwargs)¶Compute accuracy of the model.
The accuracy is reported (=printed to log and returned as a list) for each section separately, plus there’s one aggregate summary at the end.
Parameters: 


Returns:  Full lists of correct and incorrect predictions divided by sections. 
Return type:  list of dict of (str, (str, str, str) 
add
(entities, weights, replace=False)¶Append entities and theirs vectors in a manual way. If some entity is already in the vocabulary, the old vector is kept unless replace flag is True.
Parameters: 


adjust_vectors
()¶Adjust the vectors for words in the vocabulary.
The adjustment relies on the vectors of the ngrams making up each individual word.
closer_than
(entity1, entity2)¶Get all entities that are closer to entity1 than entity2 is to entity1.
cosine_similarities
(vector_1, vectors_all)¶Compute cosine similarities between one vector and a set of other vectors.
Parameters: 


Returns:  Contains cosine distance between vector_1 and each row in vectors_all, shape (num_vectors,). 
Return type:  numpy.ndarray 
distance
(w1, w2)¶Compute cosine distance between two words.
Calculate 1  similarity()
.
Parameters: 


Returns:  Distance between w1 and w2. 
Return type:  float 
distances
(word_or_vector, other_words=())¶Compute cosine distances from given word or vector to all words in other_words. If other_words is empty, return distance between word_or_vectors and all words in vocab.
Parameters: 


Returns:  Array containing distances to all words in other_words from input word_or_vector. 
Return type:  numpy.array 
Raises:  KeyError – If either word_or_vector or any word in other_words is absent from vocab. 
doesnt_match
(words)¶Which word from the given list doesn’t go with the others?
Parameters:  words (list of str) – List of words. 

Returns:  The word further away from the mean of all words. 
Return type:  str 
evaluate_word_analogies
(analogies, restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)¶Compute performance of the model on an analogy test set.
This is modern variant of accuracy()
, see
discussion on GitHub #1935.
The accuracy is reported (printed to log and returned as a score) for each section separately, plus there’s one aggregate summary at the end.
This method corresponds to the computeaccuracy script of the original C word2vec. See also Analogy (State of the art).
Parameters: 


Returns: 

evaluate_word_pairs
(pairs, delimiter='\t', restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)¶Compute correlation of the model with human similarity judgments.
Notes
More datasets can be found at * http://technion.ac.il/~ira.leviant/MultilingualVSMdata.html * https://www.cl.cam.ac.uk/~fh295/simlex.html.
Parameters: 


Returns: 

get_vector
(word)¶Get the entity’s representations in vector space, as a 1D numpy array.
Parameters:  entity (str) – Identifier of the entity to return the vector for. 

Returns:  Vector for the specified entity. 
Return type:  numpy.ndarray 
Raises:  KeyError – If the given entity identifier doesn’t exist. 
index2entity
¶init_ngrams_weights
(seed)¶Initialize the vocabulary and ngrams weights prior to training.
Creates the weight matrices and initializes them with uniform random values.
Parameters:  seed (float) – The seed for the PRNG. 

Note
Call this after the vocabulary has been fully initialized.
init_post_load
(vectors)¶Perform initialization after loading a native Facebook model.
Expects that the vocabulary (self.vocab) has already been initialized.
Parameters: 


init_sims
(replace=False)¶Precompute L2normalized vectors.
Parameters:  replace (bool, optional) – If True  forget the original vectors and only keep the normalized ones = saves lots of memory! 

Warning
You cannot continue training after doing a replace.
The model becomes effectively readonly: you can call
most_similar()
,
similarity()
, etc., but not train.
load
(fname_or_handle, **kwargs)¶Load an object previously saved using save()
from a file.
Parameters: 


See also
save()
Returns:  Object loaded from fname. 

Return type:  object 
Raises:  AttributeError – When called on an object instance instead of class (this is a class method). 
log_accuracy
(section)¶log_evaluate_word_pairs
(pearson, spearman, oov, pairs)¶most_similar
(positive=None, negative=None, topn=10, restrict_vocab=None, indexer=None)¶Find the topN most similar words. Positive words contribute positively towards the similarity, negative words negatively.
This method computes cosine similarity between a simple mean of the projection weight vectors of the given words and the vectors for each word in the model. The method corresponds to the wordanalogy and distance scripts in the original word2vec implementation.
Parameters: 


Returns:  Sequence of (word, similarity). 
Return type:  list of (str, float) 
most_similar_cosmul
(positive=None, negative=None, topn=10)¶Find the topN most similar words, using the multiplicative combination objective, proposed by Omer Levy and Yoav Goldberg “Linguistic Regularities in Sparse and Explicit Word Representations”. Positive words still contribute positively towards the similarity, negative words negatively, but with less susceptibility to one large distance dominating the calculation. In the common analogysolving case, of two positive and one negative examples, this method is equivalent to the “3CosMul” objective (equation (4)) of Levy and Goldberg.
Additional positive or negative examples contribute to the numerator or denominator,
respectively  a potentially sensible but untested extension of the method.
With a single positive example, rankings will be the same as in the default
most_similar()
.
Parameters: 


Returns:  Sequence of (word, similarity). 
Return type:  list of (str, float) 
most_similar_to_given
(entity1, entities_list)¶Get the entity from entities_list most similar to entity1.
n_similarity
(ws1, ws2)¶Compute cosine similarity between two sets of words.
Parameters: 


Returns:  Similarities between ws1 and ws2. 
Return type:  numpy.ndarray 
num_ngram_vectors
¶rank
(entity1, entity2)¶Rank of the distance of entity2 from entity1, in relation to distances of all entities from entity1.
relative_cosine_similarity
(wa, wb, topn=10)¶Compute the relative cosine similarity between two words given topn similar words, by Artuur Leeuwenberga, Mihaela Velab , Jon Dehdaribc, Josef van Genabithbc “A Minimally Supervised Approach for Synonym Extraction with Word Embeddings”.
To calculate relative cosine similarity between two words, equation (1) of the paper is used. For WordNet synonyms, if rcs(topn=10) is greater than 0.10 then wa and wb are more similar than any arbitrary word pairs.
Parameters: 


Returns:  Relative cosine similarity between wa and wb. 
Return type:  numpy.float64 
save
(*args, **kwargs)¶Save object.
Parameters:  fname (str) – Path to the output file. 

See also
load()
save_word2vec_format
(fname, fvocab=None, binary=False, total_vec=None)¶Store the inputhidden weight matrix in the same format used by the original C word2vectool, for compatibility.
Parameters: 


similar_by_vector
(vector, topn=10, restrict_vocab=None)¶Find the topN most similar words by vector.
Parameters: 


Returns:  Sequence of (word, similarity). 
Return type:  list of (str, float) 
similar_by_word
(word, topn=10, restrict_vocab=None)¶Find the topN most similar words.
Parameters: 


Returns:  Sequence of (word, similarity). 
Return type:  list of (str, float) 
similarity
(w1, w2)¶Compute cosine similarity between two words.
Parameters: 


Returns:  Cosine similarity between w1 and w2. 
Return type:  float 
similarity_matrix
(**kwargs)¶Construct a term similarity matrix for computing Soft Cosine Measure.
This creates a sparse term similarity matrix in the scipy.sparse.csc_matrix
format for computing
Soft Cosine Measure between documents.
Parameters: 


Returns:  Term similarity matrix. 
Return type: 

See also
gensim.matutils.softcossim()
SoftCosineSimilarity
Notes
The constructed matrix corresponds to the matrix Mrel defined in section 2.1 of Delphine Charlet and Geraldine Damnati, “SimBow at SemEval2017 Task 3: SoftCosine Semantic Similarity between Questions for Community Question Answering”, 2017.
syn0
¶syn0_ngrams
¶syn0_ngrams_norm
¶syn0_vocab
¶syn0_vocab_norm
¶syn0norm
¶update_ngrams_weights
(seed, old_vocab_len)¶Update the vocabulary weights for training continuation.
Parameters: 


Note
Call this after the vocabulary has been updated.
wmdistance
(document1, document2)¶Compute the Word Mover’s Distance between two documents.
When using this code, please consider citing the following papers:
Parameters: 


Returns:  Word Mover’s distance between document1 and document2. 
Return type:  float 
Warning
This method only works if pyemd is installed.
If one of the documents have no words that exist in the vocab, float(‘inf’) (i.e. infinity) will be returned.
Raises:  ImportError – If pyemd isn’t installed. 

word_vec
(word, use_norm=False)¶Get word representations in vector space, as a 1D numpy array.
Parameters: 


Returns:  Vector representation of word. 
Return type:  numpy.ndarray 
Raises:  KeyError – If word and all ngrams not in vocabulary. 
words_closer_than
(w1, w2)¶Get all words that are closer to w1 than w2 is to w1.
Parameters: 


Returns:  List of words that are closer to w1 than w2 is to w1. 
Return type:  list (str) 
wv
¶gensim.models.keyedvectors.
KeyedVectors
¶gensim.models.keyedvectors.
Vocab
(**kwargs)¶Bases: object
A single vocabulary item, used internally for collecting perword frequency/sampling info, and for constructing binary trees (incl. both word leaves and inner nodes).
gensim.models.keyedvectors.
Word2VecKeyedVectors
(vector_size)¶Bases: gensim.models.keyedvectors.WordEmbeddingsKeyedVectors
Mapping between words and vectors for the Word2Vec
model.
Used to perform operations on the vectors such as vector lookup, distance, similarity etc.
accuracy
(**kwargs)¶Compute accuracy of the model.
The accuracy is reported (=printed to log and returned as a list) for each section separately, plus there’s one aggregate summary at the end.
Parameters: 


Returns:  Full lists of correct and incorrect predictions divided by sections. 
Return type:  list of dict of (str, (str, str, str) 
add
(entities, weights, replace=False)¶Append entities and theirs vectors in a manual way. If some entity is already in the vocabulary, the old vector is kept unless replace flag is True.
Parameters: 


closer_than
(entity1, entity2)¶Get all entities that are closer to entity1 than entity2 is to entity1.
cosine_similarities
(vector_1, vectors_all)¶Compute cosine similarities between one vector and a set of other vectors.
Parameters: 


Returns:  Contains cosine distance between vector_1 and each row in vectors_all, shape (num_vectors,). 
Return type:  numpy.ndarray 
distance
(w1, w2)¶Compute cosine distance between two words.
Calculate 1  similarity()
.
Parameters: 


Returns:  Distance between w1 and w2. 
Return type:  float 
distances
(word_or_vector, other_words=())¶Compute cosine distances from given word or vector to all words in other_words. If other_words is empty, return distance between word_or_vectors and all words in vocab.
Parameters: 


Returns:  Array containing distances to all words in other_words from input word_or_vector. 
Return type:  numpy.array 
Raises:  KeyError – If either word_or_vector or any word in other_words is absent from vocab. 
doesnt_match
(words)¶Which word from the given list doesn’t go with the others?
Parameters:  words (list of str) – List of words. 

Returns:  The word further away from the mean of all words. 
Return type:  str 
evaluate_word_analogies
(analogies, restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)¶Compute performance of the model on an analogy test set.
This is modern variant of accuracy()
, see
discussion on GitHub #1935.
The accuracy is reported (printed to log and returned as a score) for each section separately, plus there’s one aggregate summary at the end.
This method corresponds to the computeaccuracy script of the original C word2vec. See also Analogy (State of the art).
Parameters: 


Returns: 

evaluate_word_pairs
(pairs, delimiter='\t', restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)¶Compute correlation of the model with human similarity judgments.
Notes
More datasets can be found at * http://technion.ac.il/~ira.leviant/MultilingualVSMdata.html * https://www.cl.cam.ac.uk/~fh295/simlex.html.
Parameters: 


Returns: 

get_keras_embedding
(train_embeddings=False)¶Get a Keras ‘Embedding’ layer with weights set as the Word2Vec model’s learned word embeddings.
Parameters:  train_embeddings (bool) – If False, the weights are frozen and stopped from being updated. If True, the weights can/will be further trained/updated. 

Returns:  Embedding layer. 
Return type:  keras.layers.Embedding 
Raises:  ImportError – If Keras not installed. 
Warning
Current method work only if Keras installed.
get_vector
(word)¶Get the entity’s representations in vector space, as a 1D numpy array.
Parameters:  entity (str) – Identifier of the entity to return the vector for. 

Returns:  Vector for the specified entity. 
Return type:  numpy.ndarray 
Raises:  KeyError – If the given entity identifier doesn’t exist. 
index2entity
¶init_sims
(replace=False)¶Precompute L2normalized vectors.
Parameters:  replace (bool, optional) – If True  forget the original vectors and only keep the normalized ones = saves lots of memory! 

Warning
You cannot continue training after doing a replace.
The model becomes effectively readonly: you can call
most_similar()
,
similarity()
, etc., but not train.
load
(fname_or_handle, **kwargs)¶Load an object previously saved using save()
from a file.
Parameters: 


See also
save()
Returns:  Object loaded from fname. 

Return type:  object 
Raises:  AttributeError – When called on an object instance instead of class (this is a class method). 
load_word2vec_format
(fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict', limit=None, datatype=<type 'numpy.float32'>)¶Load the inputhidden weight matrix from the original C word2vectool format.
Warning
The information stored in the file is incomplete (the binary tree is missing), so while you can query for word similarity etc., you cannot continue training with a model loaded this way.
Parameters: 


Returns:  Loaded model. 
Return type: 
log_accuracy
(section)¶log_evaluate_word_pairs
(pearson, spearman, oov, pairs)¶most_similar
(positive=None, negative=None, topn=10, restrict_vocab=None, indexer=None)¶Find the topN most similar words. Positive words contribute positively towards the similarity, negative words negatively.
This method computes cosine similarity between a simple mean of the projection weight vectors of the given words and the vectors for each word in the model. The method corresponds to the wordanalogy and distance scripts in the original word2vec implementation.
Parameters: 


Returns:  Sequence of (word, similarity). 
Return type:  list of (str, float) 
most_similar_cosmul
(positive=None, negative=None, topn=10)¶Find the topN most similar words, using the multiplicative combination objective, proposed by Omer Levy and Yoav Goldberg “Linguistic Regularities in Sparse and Explicit Word Representations”. Positive words still contribute positively towards the similarity, negative words negatively, but with less susceptibility to one large distance dominating the calculation. In the common analogysolving case, of two positive and one negative examples, this method is equivalent to the “3CosMul” objective (equation (4)) of Levy and Goldberg.
Additional positive or negative examples contribute to the numerator or denominator,
respectively  a potentially sensible but untested extension of the method.
With a single positive example, rankings will be the same as in the default
most_similar()
.
Parameters: 


Returns:  Sequence of (word, similarity). 
Return type:  list of (str, float) 
most_similar_to_given
(entity1, entities_list)¶Get the entity from entities_list most similar to entity1.
n_similarity
(ws1, ws2)¶Compute cosine similarity between two sets of words.
Parameters: 


Returns:  Similarities between ws1 and ws2. 
Return type:  numpy.ndarray 
rank
(entity1, entity2)¶Rank of the distance of entity2 from entity1, in relation to distances of all entities from entity1.
relative_cosine_similarity
(wa, wb, topn=10)¶Compute the relative cosine similarity between two words given topn similar words, by Artuur Leeuwenberga, Mihaela Velab , Jon Dehdaribc, Josef van Genabithbc “A Minimally Supervised Approach for Synonym Extraction with Word Embeddings”.
To calculate relative cosine similarity between two words, equation (1) of the paper is used. For WordNet synonyms, if rcs(topn=10) is greater than 0.10 then wa and wb are more similar than any arbitrary word pairs.
Parameters: 


Returns:  Relative cosine similarity between wa and wb. 
Return type:  numpy.float64 
save
(*args, **kwargs)¶Save KeyedVectors.
Parameters:  fname (str) – Path to the output file. 

See also
load()
save_word2vec_format
(fname, fvocab=None, binary=False, total_vec=None)¶Store the inputhidden weight matrix in the same format used by the original C word2vectool, for compatibility.
Parameters: 


similar_by_vector
(vector, topn=10, restrict_vocab=None)¶Find the topN most similar words by vector.
Parameters: 


Returns:  Sequence of (word, similarity). 
Return type:  list of (str, float) 
similar_by_word
(word, topn=10, restrict_vocab=None)¶Find the topN most similar words.
Parameters: 


Returns:  Sequence of (word, similarity). 
Return type:  list of (str, float) 
similarity
(w1, w2)¶Compute cosine similarity between two words.
Parameters: 


Returns:  Cosine similarity between w1 and w2. 
Return type:  float 
similarity_matrix
(**kwargs)¶Construct a term similarity matrix for computing Soft Cosine Measure.
This creates a sparse term similarity matrix in the scipy.sparse.csc_matrix
format for computing
Soft Cosine Measure between documents.
Parameters: 


Returns:  Term similarity matrix. 
Return type: 

See also
gensim.matutils.softcossim()
SoftCosineSimilarity
Notes
The constructed matrix corresponds to the matrix Mrel defined in section 2.1 of Delphine Charlet and Geraldine Damnati, “SimBow at SemEval2017 Task 3: SoftCosine Semantic Similarity between Questions for Community Question Answering”, 2017.
syn0
¶syn0norm
¶wmdistance
(document1, document2)¶Compute the Word Mover’s Distance between two documents.
When using this code, please consider citing the following papers:
Parameters: 


Returns:  Word Mover’s distance between document1 and document2. 
Return type:  float 
Warning
This method only works if pyemd is installed.
If one of the documents have no words that exist in the vocab, float(‘inf’) (i.e. infinity) will be returned.
Raises:  ImportError – If pyemd isn’t installed. 

word_vec
(word, use_norm=False)¶Get word representations in vector space, as a 1D numpy array.
Parameters: 


Returns:  Vector representation of word. 
Return type:  numpy.ndarray 
Raises:  KeyError – If word not in vocabulary. 
words_closer_than
(w1, w2)¶Get all words that are closer to w1 than w2 is to w1.
Parameters: 


Returns:  List of words that are closer to w1 than w2 is to w1. 
Return type:  list (str) 
wv
¶gensim.models.keyedvectors.
WordEmbeddingSimilarityIndex
(keyedvectors, threshold=0.0, exponent=2.0, kwargs=None)¶Bases: gensim.similarities.termsim.TermSimilarityIndex
Computes cosine similarities between word embeddings and retrieves the closest word embeddings by cosine similarity for a given word embedding.
Parameters: 


See also
SparseTermSimilarityMatrix
load
(fname, mmap=None)¶Load an object previously saved using save()
from a file.
Parameters: 


See also
save()
Returns:  Object loaded from fname. 

Return type:  object 
Raises:  AttributeError – When called on an object instance instead of class (this is a class method). 
most_similar
(t1, topn=10)¶Get most similar terms for a given term.
Return most similar terms for a given term along with the similarities.
Parameters: 


Returns:  Most similar terms along with their similarities to term. Only terms distinct from term must be returned. 
Return type:  iterable of (str, float) 
save
(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)¶Save the object to a file.
Parameters: 


See also
load()
gensim.models.keyedvectors.
WordEmbeddingsKeyedVectors
(vector_size)¶Bases: gensim.models.keyedvectors.BaseKeyedVectors
Class containing common methods for operations over word vectors.
accuracy
(**kwargs)¶Compute accuracy of the model.
The accuracy is reported (=printed to log and returned as a list) for each section separately, plus there’s one aggregate summary at the end.
Parameters: 


Returns:  Full lists of correct and incorrect predictions divided by sections. 
Return type:  list of dict of (str, (str, str, str) 
add
(entities, weights, replace=False)¶Append entities and theirs vectors in a manual way. If some entity is already in the vocabulary, the old vector is kept unless replace flag is True.
Parameters: 


closer_than
(entity1, entity2)¶Get all entities that are closer to entity1 than entity2 is to entity1.
cosine_similarities
(vector_1, vectors_all)¶Compute cosine similarities between one vector and a set of other vectors.
Parameters: 


Returns:  Contains cosine distance between vector_1 and each row in vectors_all, shape (num_vectors,). 
Return type:  numpy.ndarray 
distance
(w1, w2)¶Compute cosine distance between two words.
Calculate 1  similarity()
.
Parameters: 


Returns:  Distance between w1 and w2. 
Return type:  float 
distances
(word_or_vector, other_words=())¶Compute cosine distances from given word or vector to all words in other_words. If other_words is empty, return distance between word_or_vectors and all words in vocab.
Parameters: 


Returns:  Array containing distances to all words in other_words from input word_or_vector. 
Return type:  numpy.array 
Raises:  KeyError – If either word_or_vector or any word in other_words is absent from vocab. 
doesnt_match
(words)¶Which word from the given list doesn’t go with the others?
Parameters:  words (list of str) – List of words. 

Returns:  The word further away from the mean of all words. 
Return type:  str 
evaluate_word_analogies
(analogies, restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)¶Compute performance of the model on an analogy test set.
This is modern variant of accuracy()
, see
discussion on GitHub #1935.
The accuracy is reported (printed to log and returned as a score) for each section separately, plus there’s one aggregate summary at the end.
This method corresponds to the computeaccuracy script of the original C word2vec. See also Analogy (State of the art).
Parameters: 


Returns: 

evaluate_word_pairs
(pairs, delimiter='\t', restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)¶Compute correlation of the model with human similarity judgments.
Notes
More datasets can be found at * http://technion.ac.il/~ira.leviant/MultilingualVSMdata.html * https://www.cl.cam.ac.uk/~fh295/simlex.html.
Parameters: 


Returns: 

get_vector
(word)¶Get the entity’s representations in vector space, as a 1D numpy array.
Parameters:  entity (str) – Identifier of the entity to return the vector for. 

Returns:  Vector for the specified entity. 
Return type:  numpy.ndarray 
Raises:  KeyError – If the given entity identifier doesn’t exist. 
index2entity
¶init_sims
(replace=False)¶Precompute L2normalized vectors.
Parameters:  replace (bool, optional) – If True  forget the original vectors and only keep the normalized ones = saves lots of memory! 

Warning
You cannot continue training after doing a replace.
The model becomes effectively readonly: you can call
most_similar()
,
similarity()
, etc., but not train.
load
(fname_or_handle, **kwargs)¶Load an object previously saved using save()
from a file.
Parameters: 


See also
save()
Returns:  Object loaded from fname. 

Return type:  object 
Raises:  AttributeError – When called on an object instance instead of class (this is a class method). 
log_accuracy
(section)¶log_evaluate_word_pairs
(pearson, spearman, oov, pairs)¶most_similar
(positive=None, negative=None, topn=10, restrict_vocab=None, indexer=None)¶Find the topN most similar words. Positive words contribute positively towards the similarity, negative words negatively.
This method computes cosine similarity between a simple mean of the projection weight vectors of the given words and the vectors for each word in the model. The method corresponds to the wordanalogy and distance scripts in the original word2vec implementation.
Parameters: 


Returns:  Sequence of (word, similarity). 
Return type:  list of (str, float) 
most_similar_cosmul
(positive=None, negative=None, topn=10)¶Find the topN most similar words, using the multiplicative combination objective, proposed by Omer Levy and Yoav Goldberg “Linguistic Regularities in Sparse and Explicit Word Representations”. Positive words still contribute positively towards the similarity, negative words negatively, but with less susceptibility to one large distance dominating the calculation. In the common analogysolving case, of two positive and one negative examples, this method is equivalent to the “3CosMul” objective (equation (4)) of Levy and Goldberg.
Additional positive or negative examples contribute to the numerator or denominator,
respectively  a potentially sensible but untested extension of the method.
With a single positive example, rankings will be the same as in the default
most_similar()
.
Parameters: 


Returns:  Sequence of (word, similarity). 
Return type:  list of (str, float) 
most_similar_to_given
(entity1, entities_list)¶Get the entity from entities_list most similar to entity1.
n_similarity
(ws1, ws2)¶Compute cosine similarity between two sets of words.
Parameters: 


Returns:  Similarities between ws1 and ws2. 
Return type:  numpy.ndarray 
rank
(entity1, entity2)¶Rank of the distance of entity2 from entity1, in relation to distances of all entities from entity1.
relative_cosine_similarity
(wa, wb, topn=10)¶Compute the relative cosine similarity between two words given topn similar words, by Artuur Leeuwenberga, Mihaela Velab , Jon Dehdaribc, Josef van Genabithbc “A Minimally Supervised Approach for Synonym Extraction with Word Embeddings”.
To calculate relative cosine similarity between two words, equation (1) of the paper is used. For WordNet synonyms, if rcs(topn=10) is greater than 0.10 then wa and wb are more similar than any arbitrary word pairs.
Parameters: 


Returns:  Relative cosine similarity between wa and wb. 
Return type:  numpy.float64 
save
(*args, **kwargs)¶Save KeyedVectors.
Parameters:  fname (str) – Path to the output file. 

See also
load()
similar_by_vector
(vector, topn=10, restrict_vocab=None)¶Find the topN most similar words by vector.
Parameters: 


Returns:  Sequence of (word, similarity). 
Return type:  list of (str, float) 
similar_by_word
(word, topn=10, restrict_vocab=None)¶Find the topN most similar words.
Parameters: 


Returns:  Sequence of (word, similarity). 
Return type:  list of (str, float) 
similarity
(w1, w2)¶Compute cosine similarity between two words.
Parameters: 


Returns:  Cosine similarity between w1 and w2. 
Return type:  float 
similarity_matrix
(**kwargs)¶Construct a term similarity matrix for computing Soft Cosine Measure.
This creates a sparse term similarity matrix in the scipy.sparse.csc_matrix
format for computing
Soft Cosine Measure between documents.
Parameters: 


Returns:  Term similarity matrix. 
Return type: 

See also
gensim.matutils.softcossim()
SoftCosineSimilarity
Notes
The constructed matrix corresponds to the matrix Mrel defined in section 2.1 of Delphine Charlet and Geraldine Damnati, “SimBow at SemEval2017 Task 3: SoftCosine Semantic Similarity between Questions for Community Question Answering”, 2017.
syn0
¶syn0norm
¶wmdistance
(document1, document2)¶Compute the Word Mover’s Distance between two documents.
When using this code, please consider citing the following papers:
Parameters: 


Returns:  Word Mover’s distance between document1 and document2. 
Return type:  float 
Warning
This method only works if pyemd is installed.
If one of the documents have no words that exist in the vocab, float(‘inf’) (i.e. infinity) will be returned.
Raises:  ImportError – If pyemd isn’t installed. 

word_vec
(word, use_norm=False)¶Get word representations in vector space, as a 1D numpy array.
Parameters: 


Returns:  Vector representation of word. 
Return type:  numpy.ndarray 
Raises:  KeyError – If word not in vocabulary. 
words_closer_than
(w1, w2)¶Get all words that are closer to w1 than w2 is to w1.
Parameters: 


Returns:  List of words that are closer to w1 than w2 is to w1. 
Return type:  list (str) 
wv
¶