gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

Corporate trainings in Data Science, NLP and Deep Learning

models.keyedvectors – Store and query word vectors

models.keyedvectors – Store and query word vectors

This module implements word vectors and their similarity look-ups.

Since trained word vectors are independent from the way they were trained (Word2Vec, FastText, WordRank, VarEmbed etc), they can be represented by a standalone structure, as implemented in this module.

The structure is called “KeyedVectors” and is essentially a mapping between entities and vectors. Each entity is identified by its string id, so this is a mapping between {str => 1D numpy array}.

The entity typically corresponds to a word (so the mapping maps words to 1D vectors), but for some models, the key can also correspond to a document, a graph node etc. To generalize over different use-cases, this module calls the keys entities. Each entity is always represented by its string id, no matter whether the entity is a word, a document or a graph node.

Why use KeyedVectors instead of a full model?

capability

KeyedVectors

full model

note

continue training vectors

You need the full model to train or update vectors.

smaller objects

KeyedVectors are smaller and need less RAM, because they don’t need to store the model state that enables training.

save/load from native fasttext/word2vec format

Vectors exported by the Facebook and Google tools do not support further training, but you can still load them into KeyedVectors.

append new vectors

Add new entity-vector entries to the mapping dynamically.

concurrency

Thread-safe, allows concurrent vector queries.

shared RAM

Multiple processes can re-use the same data, keeping only a single copy in RAM using mmap.

fast load

Supports mmap to load data from disk instantaneously.

TL;DR: the main difference is that KeyedVectors do not support further training. On the other hand, by shedding the internal data structures necessary for training, KeyedVectors offer a smaller RAM footprint and a simpler interface.

How to obtain word vectors?

Train a full model, then access its model.wv property, which holds the standalone keyed vectors. For example, using the Word2Vec algorithm to train the vectors

>>> from gensim.test.utils import common_texts
>>> from gensim.models import Word2Vec
>>>
>>> model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)
>>> word_vectors = model.wv

Persist the word vectors to disk with

>>> from gensim.test.utils import get_tmpfile
>>> from gensim.models import KeyedVectors
>>>
>>> fname = get_tmpfile("vectors.kv")
>>> word_vectors.save(fname)
>>> word_vectors = KeyedVectors.load(fname, mmap='r')

The vectors can also be instantiated from an existing file on disk in the original Google’s word2vec C format as a KeyedVectors instance

>>> from gensim.test.utils import datapath
>>>
>>> wv_from_text = KeyedVectors.load_word2vec_format(datapath('word2vec_pre_kv_c'), binary=False)  # C text format
>>> wv_from_bin = KeyedVectors.load_word2vec_format(datapath("euclidean_vectors.bin"), binary=True)  # C bin format

What can I do with word vectors?

You can perform various syntactic/semantic NLP word tasks with the trained vectors. Some of them are already built-in

>>> import gensim.downloader as api
>>>
>>> word_vectors = api.load("glove-wiki-gigaword-100")  # load pre-trained word-vectors from gensim-data
>>>
>>> result = word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])
>>> print("{}: {:.4f}".format(*result[0]))
queen: 0.7699
>>>
>>> result = word_vectors.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])
>>> print("{}: {:.4f}".format(*result[0]))
queen: 0.8965
>>>
>>> print(word_vectors.doesnt_match("breakfast cereal dinner lunch".split()))
cereal
>>>
>>> similarity = word_vectors.similarity('woman', 'man')
>>> similarity > 0.8
True
>>>
>>> result = word_vectors.similar_by_word("cat")
>>> print("{}: {:.4f}".format(*result[0]))
dog: 0.8798
>>>
>>> sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
>>> sentence_president = 'The president greets the press in Chicago'.lower().split()
>>>
>>> similarity = word_vectors.wmdistance(sentence_obama, sentence_president)
>>> print("{:.4f}".format(similarity))
3.4893
>>>
>>> distance = word_vectors.distance("media", "media")
>>> print("{:.1f}".format(distance))
0.0
>>>
>>> sim = word_vectors.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant'])
>>> print("{:.4f}".format(sim))
0.7067
>>>
>>> vector = word_vectors['computer']  # numpy vector of a word
>>> vector.shape
(100,)
>>>
>>> vector = word_vectors.wv.word_vec('office', use_norm=True)
>>> vector.shape
(100,)

Correlation with human opinion on word similarity

>>> from gensim.test.utils import datapath
>>>
>>> similarities = model.wv.evaluate_word_pairs(datapath('wordsim353.tsv'))

And on word analogies

>>> analogy_scores = model.wv.evaluate_word_analogies(datapath('questions-words.txt'))

and so on.

class gensim.models.keyedvectors.BaseKeyedVectors(vector_size)

Bases: gensim.utils.SaveLoad

Abstract base class / interface for various types of word vectors.

add(entities, weights, replace=False)

Append entities and theirs vectors in a manual way. If some entity is already in the vocabulary, the old vector is kept unless replace flag is True.

Parameters
  • entities (list of str) – Entities specified by string ids.

  • weights (list of numpy.ndarray or numpy.ndarray) – List of 1D np.array vectors or a 2D np.array of vectors.

  • replace (bool, optional) – Flag indicating whether to replace vectors for entities which already exist in the vocabulary, if True - replace vectors, otherwise - keep old vectors.

closer_than(entity1, entity2)

Get all entities that are closer to entity1 than entity2 is to entity1.

distance(entity1, entity2)

Compute distance between vectors of two input entities, specified by their string id.

distances(entity1, other_entities=())

Compute distances from a given entity (its string id) to all entities in other_entity. If other_entities is empty, return the distance between entity1 and all entities in vocab.

get_vector(entity)

Get the entity’s representations in vector space, as a 1D numpy array.

Parameters

entity (str) – Identifier of the entity to return the vector for.

Returns

Vector for the specified entity.

Return type

numpy.ndarray

Raises

KeyError – If the given entity identifier doesn’t exist.

classmethod load(fname_or_handle, **kwargs)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

most_similar(**kwargs)

Find the top-N most similar entities. Possibly have positive and negative list of entities in **kwargs.

most_similar_to_given(entity1, entities_list)

Get the entity from entities_list most similar to entity1.

rank(entity1, entity2)

Rank of the distance of entity2 from entity1, in relation to distances of all entities from entity1.

save(fname_or_handle, **kwargs)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.

similarity(entity1, entity2)

Compute cosine similarity between two entities, specified by their string id.

class gensim.models.keyedvectors.Doc2VecKeyedVectors(vector_size, mapfile_path)

Bases: gensim.models.keyedvectors.BaseKeyedVectors

add(entities, weights, replace=False)

Append entities and theirs vectors in a manual way. If some entity is already in the vocabulary, the old vector is kept unless replace flag is True.

Parameters
  • entities (list of str) – Entities specified by string ids.

  • weights (list of numpy.ndarray or numpy.ndarray) – List of 1D np.array vectors or a 2D np.array of vectors.

  • replace (bool, optional) – Flag indicating whether to replace vectors for entities which already exist in the vocabulary, if True - replace vectors, otherwise - keep old vectors.

closer_than(entity1, entity2)

Get all entities that are closer to entity1 than entity2 is to entity1.

distance(d1, d2)

Compute cosine distance between two documents.

distances(d1, other_docs=())

Compute cosine distances from given d1 to all documents in other_docs.

TODO: Accept vectors of out-of-training-set docs, as if from inference.

Parameters
  • d1 ({str, numpy.ndarray}) – Doctag/index of document.

  • other_docs (iterable of {str, int}) – Sequence of doctags/indexes. If None or empty, distance of d1 from all doctags in vocab is computed (including itself).

Returns

Array containing distances to all documents in other_docs from input d1.

Return type

numpy.array

property doctag_syn0
property doctag_syn0norm
doesnt_match(docs)

Which document from the given list doesn’t go with the others from the training set?

TODO: Accept vectors of out-of-training-set docs, as if from inference.

Parameters

docs (list of {str, int}) – Sequence of doctags/indexes.

Returns

Doctag/index of the document farthest away from the mean of all the documents.

Return type

{str, int}

get_vector(entity)

Get the entity’s representations in vector space, as a 1D numpy array.

Parameters

entity (str) – Identifier of the entity to return the vector for.

Returns

Vector for the specified entity.

Return type

numpy.ndarray

Raises

KeyError – If the given entity identifier doesn’t exist.

property index2entity
index_to_doctag(i_index)

Get string key for given i_index, if available. Otherwise return raw int doctag (same int).

init_sims(replace=False)

Precompute L2-normalized vectors.

Parameters

replace (bool, optional) – If True - forget the original vectors and only keep the normalized ones = saves lots of memory!

Warning

You cannot continue training after doing a replace. The model becomes effectively read-only: you can call most_similar(), similarity(), etc., but not train and infer_vector.

int_index(index, doctags, max_rawint)

Get int index for either string or int index

classmethod load(fname_or_handle, **kwargs)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

most_similar(positive=None, negative=None, topn=10, clip_start=0, clip_end=None, indexer=None)

Find the top-N most similar docvecs from the training set. Positive docvecs contribute positively towards the similarity, negative docvecs negatively.

This method computes cosine similarity between a simple mean of the projection weight vectors of the given docs. Docs may be specified as vectors, integer indexes of trained docvecs, or if the documents were originally presented with string tags, by the corresponding tags.

TODO: Accept vectors of out-of-training-set docs, as if from inference.

Parameters
  • positive (list of {str, int}, optional) – List of doctags/indexes that contribute positively.

  • negative (list of {str, int}, optional) – List of doctags/indexes that contribute negatively.

  • topn (int or None, optional) – Number of top-N similar docvecs to return, when topn is int. When topn is None, then similarities for all docvecs are returned.

  • clip_start (int) – Start clipping index.

  • clip_end (int) – End clipping index.

Returns

Sequence of (doctag/index, similarity).

Return type

list of ({str, int}, float)

most_similar_to_given(entity1, entities_list)

Get the entity from entities_list most similar to entity1.

n_similarity(ds1, ds2)

Compute cosine similarity between two sets of docvecs from the trained set.

TODO: Accept vectors of out-of-training-set docs, as if from inference.

Parameters
  • ds1 (list of {str, int}) – Set of document as sequence of doctags/indexes.

  • ds2 (list of {str, int}) – Set of document as sequence of doctags/indexes.

Returns

The cosine similarity between the means of the documents in each of the two sets.

Return type

float

rank(entity1, entity2)

Rank of the distance of entity2 from entity1, in relation to distances of all entities from entity1.

save(*args, **kwargs)

Save object.

Parameters

fname (str) – Path to the output file.

See also

load()

Load object.

save_word2vec_format(fname, prefix='*dt_', fvocab=None, total_vec=None, binary=False, write_first_line=True)

Store the input-hidden weight matrix in the same format used by the original C word2vec-tool, for compatibility.

Parameters
  • fname (str) – The file path used to save the vectors in.

  • prefix (str, optional) – Uniquely identifies doctags from word vocab, and avoids collision in case of repeated string in doctag and word vocab.

  • fvocab (str, optional) – UNUSED.

  • total_vec (int, optional) – Explicitly specify total no. of vectors (in case word vectors are appended with document vectors afterwards)

  • binary (bool, optional) – If True, the data will be saved in binary word2vec format, else it will be saved in plain text.

  • write_first_line (bool, optional) – Whether to print the first line in the file. Useful when saving doc-vectors after word-vectors.

similarity(d1, d2)

Compute cosine similarity between two docvecs from the training set.

TODO: Accept vectors of out-of-training-set docs, as if from inference.

Parameters
  • d1 ({int, str}) – Doctag/index of document.

  • d2 ({int, str}) – Doctag/index of document.

Returns

The cosine similarity between the vectors of the two documents.

Return type

float

similarity_unseen_docs(model, doc_words1, doc_words2, alpha=None, min_alpha=None, steps=None)

Compute cosine similarity between two post-bulk out of training documents.

Parameters
  • model (Doc2Vec) – An instance of a trained Doc2Vec model.

  • doc_words1 (list of str) – Input document.

  • doc_words2 (list of str) – Input document.

  • alpha (float, optional) – The initial learning rate.

  • min_alpha (float, optional) – Learning rate will linearly drop to min_alpha as training progresses.

  • steps (int, optional) – Number of epoch to train the new document.

Returns

The cosine similarity between doc_words1 and doc_words2.

Return type

float

class gensim.models.keyedvectors.FastTextKeyedVectors(vector_size, min_n, max_n, bucket, compatible_hash)

Bases: gensim.models.keyedvectors.WordEmbeddingsKeyedVectors

Vectors and vocab for FastText.

Implements significant parts of the FastText algorithm. For example, the word_vec() calculates vectors for out-of-vocabulary (OOV) entities. FastText achieves this by keeping vectors for ngrams: adding the vectors for the ngrams of an entity yields the vector for the entity.

Similar to a hashmap, this class keeps a fixed number of buckets, and maps all ngrams to buckets using a hash function.

This class also provides an abstraction over the hash functions used by Gensim’s FastText implementation over time. The hash function connects ngrams to buckets. Originally, the hash function was broken and incompatible with Facebook’s implementation. The current hash is fully compatible.

Parameters
  • vector_size (int) – The dimensionality of all vectors.

  • min_n (int) – The minimum number of characters in an ngram

  • max_n (int) – The maximum number of characters in an ngram

  • bucket (int) – The number of buckets.

  • compatible_hash (boolean) – If True, uses the Facebook-compatible hash function instead of the Gensim backwards-compatible hash function.

vectors_vocab

Each row corresponds to a vector for an entity in the vocabulary. Columns correspond to vector dimensions.

Type

np.array

vectors_vocab_norm

Same as vectors_vocab, but the vectors are L2 normalized.

Type

np.array

vectors_ngrams

A vector for each ngram across all entities in the vocabulary. Each row is a vector that corresponds to a bucket. Columns correspond to vector dimensions.

Type

np.array

vectors_ngrams_norm

Same as vectors_ngrams, but the vectors are L2 normalized. Under some conditions, may actually be the same matrix as vectors_ngrams, e.g. if init_sims() was called with replace=True.

Type

np.array

buckets_word

Maps vocabulary items (by their index) to the buckets they occur in.

Type

dict

accuracy(questions, restrict_vocab=30000, most_similar=<function WordEmbeddingsKeyedVectors.most_similar>, case_insensitive=True)

Compute accuracy of the model.

The accuracy is reported (=printed to log and returned as a list) for each section separately, plus there’s one aggregate summary at the end.

Parameters
  • questions (str) – Path to file, where lines are 4-tuples of words, split into sections by “: SECTION NAME” lines. See gensim/test/test_data/questions-words.txt as example.

  • restrict_vocab (int, optional) – Ignore all 4-tuples containing a word not in the first restrict_vocab words. This may be meaningful if you’ve sorted the model vocabulary by descending frequency (which is standard in modern word embedding models).

  • most_similar (function, optional) – Function used for similarity calculation.

  • case_insensitive (bool, optional) – If True - convert all words to their uppercase form before evaluating the performance. Useful to handle case-mismatch between training tokens and words in the test set. In case of multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.

Returns

Full lists of correct and incorrect predictions divided by sections.

Return type

list of dict of (str, (str, str, str)

add(entities, weights, replace=False)

Append entities and theirs vectors in a manual way. If some entity is already in the vocabulary, the old vector is kept unless replace flag is True.

Parameters
  • entities (list of str) – Entities specified by string ids.

  • weights (list of numpy.ndarray or numpy.ndarray) – List of 1D np.array vectors or a 2D np.array of vectors.

  • replace (bool, optional) – Flag indicating whether to replace vectors for entities which already exist in the vocabulary, if True - replace vectors, otherwise - keep old vectors.

adjust_vectors()

Adjust the vectors for words in the vocabulary.

The adjustment relies on the vectors of the ngrams making up each individual word.

closer_than(entity1, entity2)

Get all entities that are closer to entity1 than entity2 is to entity1.

static cosine_similarities(vector_1, vectors_all)

Compute cosine similarities between one vector and a set of other vectors.

Parameters
  • vector_1 (numpy.ndarray) – Vector from which similarities are to be computed, expected shape (dim,).

  • vectors_all (numpy.ndarray) – For each row in vectors_all, distance from vector_1 is computed, expected shape (num_vectors, dim).

Returns

Contains cosine distance between vector_1 and each row in vectors_all, shape (num_vectors,).

Return type

numpy.ndarray

distance(w1, w2)

Compute cosine distance between two words. Calculate 1 - similarity().

Parameters
  • w1 (str) – Input word.

  • w2 (str) – Input word.

Returns

Distance between w1 and w2.

Return type

float

distances(word_or_vector, other_words=())

Compute cosine distances from given word or vector to all words in other_words. If other_words is empty, return distance between word_or_vectors and all words in vocab.

Parameters
  • word_or_vector ({str, numpy.ndarray}) – Word or vector from which distances are to be computed.

  • other_words (iterable of str) – For each word in other_words distance from word_or_vector is computed. If None or empty, distance of word_or_vector from all words in vocab is computed (including itself).

Returns

Array containing distances to all words in other_words from input word_or_vector.

Return type

numpy.array

Raises

KeyError – If either word_or_vector or any word in other_words is absent from vocab.

doesnt_match(words)

Which word from the given list doesn’t go with the others?

Parameters

words (list of str) – List of words.

Returns

The word further away from the mean of all words.

Return type

str

evaluate_word_analogies(analogies, restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)

Compute performance of the model on an analogy test set.

This is modern variant of accuracy(), see discussion on GitHub #1935.

The accuracy is reported (printed to log and returned as a score) for each section separately, plus there’s one aggregate summary at the end.

This method corresponds to the compute-accuracy script of the original C word2vec. See also Analogy (State of the art).

Parameters
  • analogies (str) – Path to file, where lines are 4-tuples of words, split into sections by “: SECTION NAME” lines. See gensim/test/test_data/questions-words.txt as example.

  • restrict_vocab (int, optional) – Ignore all 4-tuples containing a word not in the first restrict_vocab words. This may be meaningful if you’ve sorted the model vocabulary by descending frequency (which is standard in modern word embedding models).

  • case_insensitive (bool, optional) – If True - convert all words to their uppercase form before evaluating the performance. Useful to handle case-mismatch between training tokens and words in the test set. In case of multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.

  • dummy4unknown (bool, optional) – If True - produce zero accuracies for 4-tuples with out-of-vocabulary words. Otherwise, these tuples are skipped entirely and not used in the evaluation.

Returns

  • score (float) – The overall evaluation score on the entire evaluation set

  • sections (list of dict of {str : str or list of tuple of (str, str, str, str)}) – Results broken down by each section of the evaluation set. Each dict contains the name of the section under the key ‘section’, and lists of correctly and incorrectly predicted 4-tuples of words under the keys ‘correct’ and ‘incorrect’.

evaluate_word_pairs(pairs, delimiter='\t', restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)

Compute correlation of the model with human similarity judgments.

Notes

More datasets can be found at * http://technion.ac.il/~ira.leviant/MultilingualVSMdata.html * https://www.cl.cam.ac.uk/~fh295/simlex.html.

Parameters
  • pairs (str) – Path to file, where lines are 3-tuples, each consisting of a word pair and a similarity value. See test/test_data/wordsim353.tsv as example.

  • delimiter (str, optional) – Separator in pairs file.

  • restrict_vocab (int, optional) – Ignore all 4-tuples containing a word not in the first restrict_vocab words. This may be meaningful if you’ve sorted the model vocabulary by descending frequency (which is standard in modern word embedding models).

  • case_insensitive (bool, optional) – If True - convert all words to their uppercase form before evaluating the performance. Useful to handle case-mismatch between training tokens and words in the test set. In case of multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.

  • dummy4unknown (bool, optional) – If True - produce zero accuracies for 4-tuples with out-of-vocabulary words. Otherwise, these tuples are skipped entirely and not used in the evaluation.

Returns

  • pearson (tuple of (float, float)) – Pearson correlation coefficient with 2-tailed p-value.

  • spearman (tuple of (float, float)) – Spearman rank-order correlation coefficient between the similarities from the dataset and the similarities produced by the model itself, with 2-tailed p-value.

  • oov_ratio (float) – The ratio of pairs with unknown words.

get_vector(word)

Get the entity’s representations in vector space, as a 1D numpy array.

Parameters

entity (str) – Identifier of the entity to return the vector for.

Returns

Vector for the specified entity.

Return type

numpy.ndarray

Raises

KeyError – If the given entity identifier doesn’t exist.

property index2entity
init_ngrams_weights(seed)

Initialize the vocabulary and ngrams weights prior to training.

Creates the weight matrices and initializes them with uniform random values.

Parameters

seed (float) – The seed for the PRNG.

Note

Call this after the vocabulary has been fully initialized.

init_post_load(vectors)

Perform initialization after loading a native Facebook model.

Expects that the vocabulary (self.vocab) has already been initialized.

Parameters
  • vectors (np.array) – A matrix containing vectors for all the entities, including words and ngrams. This comes directly from the binary model. The order of the vectors must correspond to the indices in the vocabulary.

  • match_gensim (boolean, optional) – No longer supported.

init_sims(replace=False)

Precompute L2-normalized vectors.

Parameters

replace (bool, optional) – If True - forget the original vectors and only keep the normalized ones = saves lots of memory!

Warning

You cannot continue training after doing a replace. The model becomes effectively read-only: you can call most_similar(), similarity(), etc., but not train.

classmethod load(fname_or_handle, **kwargs)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

static log_accuracy(section)
static log_evaluate_word_pairs(pearson, spearman, oov, pairs)
most_similar(positive=None, negative=None, topn=10, restrict_vocab=None, indexer=None)

Find the top-N most similar words. Positive words contribute positively towards the similarity, negative words negatively.

This method computes cosine similarity between a simple mean of the projection weight vectors of the given words and the vectors for each word in the model. The method corresponds to the word-analogy and distance scripts in the original word2vec implementation.

Parameters
  • positive (list of str, optional) – List of words that contribute positively.

  • negative (list of str, optional) – List of words that contribute negatively.

  • topn (int or None, optional) – Number of top-N similar words to return, when topn is int. When topn is None, then similarities for all words are returned.

  • restrict_vocab (int, optional) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)

Returns

When topn is int, a sequence of (word, similarity) is returned. When topn is None, then similarities for all words are returned as a one-dimensional numpy array with the size of the vocabulary.

Return type

list of (str, float) or numpy.array

most_similar_cosmul(positive=None, negative=None, topn=10)

Find the top-N most similar words, using the multiplicative combination objective, proposed by Omer Levy and Yoav Goldberg “Linguistic Regularities in Sparse and Explicit Word Representations”. Positive words still contribute positively towards the similarity, negative words negatively, but with less susceptibility to one large distance dominating the calculation. In the common analogy-solving case, of two positive and one negative examples, this method is equivalent to the “3CosMul” objective (equation (4)) of Levy and Goldberg.

Additional positive or negative examples contribute to the numerator or denominator, respectively - a potentially sensible but untested extension of the method. With a single positive example, rankings will be the same as in the default most_similar().

Parameters
  • positive (list of str, optional) – List of words that contribute positively.

  • negative (list of str, optional) – List of words that contribute negatively.

  • topn (int or None, optional) – Number of top-N similar words to return, when topn is int. When topn is None, then similarities for all words are returned.

Returns

When topn is int, a sequence of (word, similarity) is returned. When topn is None, then similarities for all words are returned as a one-dimensional numpy array with the size of the vocabulary.

Return type

list of (str, float) or numpy.array

most_similar_to_given(entity1, entities_list)

Get the entity from entities_list most similar to entity1.

n_similarity(ws1, ws2)

Compute cosine similarity between two sets of words.

Parameters
  • ws1 (list of str) – Sequence of words.

  • ws2 (list of str) – Sequence of words.

Returns

Similarities between ws1 and ws2.

Return type

numpy.ndarray

property num_ngram_vectors
rank(entity1, entity2)

Rank of the distance of entity2 from entity1, in relation to distances of all entities from entity1.

relative_cosine_similarity(wa, wb, topn=10)

Compute the relative cosine similarity between two words given top-n similar words, by Artuur Leeuwenberga, Mihaela Velab , Jon Dehdaribc, Josef van Genabithbc “A Minimally Supervised Approach for Synonym Extraction with Word Embeddings”.

To calculate relative cosine similarity between two words, equation (1) of the paper is used. For WordNet synonyms, if rcs(topn=10) is greater than 0.10 then wa and wb are more similar than any arbitrary word pairs.

Parameters
  • wa (str) – Word for which we have to look top-n similar word.

  • wb (str) – Word for which we evaluating relative cosine similarity with wa.

  • topn (int, optional) – Number of top-n similar words to look with respect to wa.

Returns

Relative cosine similarity between wa and wb.

Return type

numpy.float64

save(*args, **kwargs)

Save object.

Parameters

fname (str) – Path to the output file.

See also

load()

Load object.

save_word2vec_format(fname, fvocab=None, binary=False, total_vec=None)

Store the input-hidden weight matrix in the same format used by the original C word2vec-tool, for compatibility.

Parameters
  • fname (str) – The file path used to save the vectors in

  • fvocab (str, optional) – Optional file path used to save the vocabulary

  • binary (bool, optional) – If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.

  • total_vec (int, optional) – Optional parameter to explicitly specify total no. of vectors (in case word vectors are appended with document vectors afterwards).

similar_by_vector(vector, topn=10, restrict_vocab=None)

Find the top-N most similar words by vector.

Parameters
  • vector (numpy.array) – Vector from which similarities are to be computed.

  • topn (int or None, optional) – Number of top-N similar words to return, when topn is int. When topn is None, then similarities for all words are returned.

  • restrict_vocab (int, optional) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)

Returns

When topn is int, a sequence of (word, similarity) is returned. When topn is None, then similarities for all words are returned as a one-dimensional numpy array with the size of the vocabulary.

Return type

list of (str, float) or numpy.array

similar_by_word(word, topn=10, restrict_vocab=None)

Find the top-N most similar words.

Parameters
  • word (str) – Word

  • topn (int or None, optional) – Number of top-N similar words to return. If topn is None, similar_by_word returns the vector of similarity scores.

  • restrict_vocab (int, optional) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)

Returns

When topn is int, a sequence of (word, similarity) is returned. When topn is None, then similarities for all words are returned as a one-dimensional numpy array with the size of the vocabulary.

Return type

list of (str, float) or numpy.array

similarity(w1, w2)

Compute cosine similarity between two words.

Parameters
  • w1 (str) – Input word.

  • w2 (str) – Input word.

Returns

Cosine similarity between w1 and w2.

Return type

float

similarity_matrix(dictionary, tfidf=None, threshold=0.0, exponent=2.0, nonzero_limit=100, dtype=<class 'numpy.float32'>)

Construct a term similarity matrix for computing Soft Cosine Measure.

This creates a sparse term similarity matrix in the scipy.sparse.csc_matrix format for computing Soft Cosine Measure between documents.

Parameters
  • dictionary (Dictionary) – A dictionary that specifies the considered terms.

  • tfidf (gensim.models.tfidfmodel.TfidfModel or None, optional) – A model that specifies the relative importance of the terms in the dictionary. The columns of the term similarity matrix will be build in a decreasing order of importance of terms, or in the order of term identifiers if None.

  • threshold (float, optional) – Only embeddings more similar than threshold are considered when retrieving word embeddings closest to a given word embedding.

  • exponent (float, optional) – Take the word embedding similarities larger than threshold to the power of exponent.

  • nonzero_limit (int, optional) – The maximum number of non-zero elements outside the diagonal in a single column of the sparse term similarity matrix.

  • dtype (numpy.dtype, optional) – Data-type of the sparse term similarity matrix.

Returns

Term similarity matrix.

Return type

scipy.sparse.csc_matrix

See also

gensim.matutils.softcossim()

The Soft Cosine Measure.

SoftCosineSimilarity

A class for performing corpus-based similarity queries with Soft Cosine Measure.

Notes

The constructed matrix corresponds to the matrix Mrel defined in section 2.1 of Delphine Charlet and Geraldine Damnati, “SimBow at SemEval-2017 Task 3: Soft-Cosine Semantic Similarity between Questions for Community Question Answering”, 2017.

property syn0
property syn0_ngrams
property syn0_ngrams_norm
property syn0_vocab
property syn0_vocab_norm
property syn0norm
update_ngrams_weights(seed, old_vocab_len)

Update the vocabulary weights for training continuation.

Parameters
  • seed (float) – The seed for the PRNG.

  • old_vocab_length (int) – The length of the vocabulary prior to its update.

Note

Call this after the vocabulary has been updated.

wmdistance(document1, document2)

Compute the Word Mover’s Distance between two documents.

When using this code, please consider citing the following papers:

Parameters
  • document1 (list of str) – Input document.

  • document2 (list of str) – Input document.

Returns

Word Mover’s distance between document1 and document2.

Return type

float

Warning

This method only works if pyemd is installed.

If one of the documents have no words that exist in the vocab, float(‘inf’) (i.e. infinity) will be returned.

Raises

ImportError

If pyemd isn’t installed.

word_vec(word, use_norm=False)

Get word representations in vector space, as a 1D numpy array.

Parameters
  • word (str) – Input word

  • use_norm (bool, optional) – If True - resulting vector will be L2-normalized (unit euclidean length).

Returns

Vector representation of word.

Return type

numpy.ndarray

Raises

KeyError – If word and all ngrams not in vocabulary.

words_closer_than(w1, w2)

Get all words that are closer to w1 than w2 is to w1.

Parameters
  • w1 (str) – Input word.

  • w2 (str) – Input word.

Returns

List of words that are closer to w1 than w2 is to w1.

Return type

list (str)

property wv
gensim.models.keyedvectors.KeyedVectors

alias of gensim.models.keyedvectors.Word2VecKeyedVectors

class gensim.models.keyedvectors.Vocab(**kwargs)

Bases: object

A single vocabulary item, used internally for collecting per-word frequency/sampling info, and for constructing binary trees (incl. both word leaves and inner nodes).

class gensim.models.keyedvectors.Word2VecKeyedVectors(vector_size)

Bases: gensim.models.keyedvectors.WordEmbeddingsKeyedVectors

Mapping between words and vectors for the Word2Vec model. Used to perform operations on the vectors such as vector lookup, distance, similarity etc.

accuracy(questions, restrict_vocab=30000, most_similar=<function WordEmbeddingsKeyedVectors.most_similar>, case_insensitive=True)

Compute accuracy of the model.

The accuracy is reported (=printed to log and returned as a list) for each section separately, plus there’s one aggregate summary at the end.

Parameters
  • questions (str) – Path to file, where lines are 4-tuples of words, split into sections by “: SECTION NAME” lines. See gensim/test/test_data/questions-words.txt as example.

  • restrict_vocab (int, optional) – Ignore all 4-tuples containing a word not in the first restrict_vocab words. This may be meaningful if you’ve sorted the model vocabulary by descending frequency (which is standard in modern word embedding models).

  • most_similar (function, optional) – Function used for similarity calculation.

  • case_insensitive (bool, optional) – If True - convert all words to their uppercase form before evaluating the performance. Useful to handle case-mismatch between training tokens and words in the test set. In case of multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.

Returns

Full lists of correct and incorrect predictions divided by sections.

Return type

list of dict of (str, (str, str, str)

add(entities, weights, replace=False)

Append entities and theirs vectors in a manual way. If some entity is already in the vocabulary, the old vector is kept unless replace flag is True.

Parameters
  • entities (list of str) – Entities specified by string ids.

  • weights (list of numpy.ndarray or numpy.ndarray) – List of 1D np.array vectors or a 2D np.array of vectors.

  • replace (bool, optional) – Flag indicating whether to replace vectors for entities which already exist in the vocabulary, if True - replace vectors, otherwise - keep old vectors.

closer_than(entity1, entity2)

Get all entities that are closer to entity1 than entity2 is to entity1.

static cosine_similarities(vector_1, vectors_all)

Compute cosine similarities between one vector and a set of other vectors.

Parameters
  • vector_1 (numpy.ndarray) – Vector from which similarities are to be computed, expected shape (dim,).

  • vectors_all (numpy.ndarray) – For each row in vectors_all, distance from vector_1 is computed, expected shape (num_vectors, dim).

Returns

Contains cosine distance between vector_1 and each row in vectors_all, shape (num_vectors,).

Return type

numpy.ndarray

distance(w1, w2)

Compute cosine distance between two words. Calculate 1 - similarity().

Parameters
  • w1 (str) – Input word.

  • w2 (str) – Input word.

Returns

Distance between w1 and w2.

Return type

float

distances(word_or_vector, other_words=())

Compute cosine distances from given word or vector to all words in other_words. If other_words is empty, return distance between word_or_vectors and all words in vocab.

Parameters
  • word_or_vector ({str, numpy.ndarray}) – Word or vector from which distances are to be computed.

  • other_words (iterable of str) – For each word in other_words distance from word_or_vector is computed. If None or empty, distance of word_or_vector from all words in vocab is computed (including itself).

Returns

Array containing distances to all words in other_words from input word_or_vector.

Return type

numpy.array

Raises

KeyError – If either word_or_vector or any word in other_words is absent from vocab.

doesnt_match(words)

Which word from the given list doesn’t go with the others?

Parameters

words (list of str) – List of words.

Returns

The word further away from the mean of all words.

Return type

str

evaluate_word_analogies(analogies, restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)

Compute performance of the model on an analogy test set.

This is modern variant of accuracy(), see discussion on GitHub #1935.

The accuracy is reported (printed to log and returned as a score) for each section separately, plus there’s one aggregate summary at the end.

This method corresponds to the compute-accuracy script of the original C word2vec. See also Analogy (State of the art).

Parameters
  • analogies (str) – Path to file, where lines are 4-tuples of words, split into sections by “: SECTION NAME” lines. See gensim/test/test_data/questions-words.txt as example.

  • restrict_vocab (int, optional) – Ignore all 4-tuples containing a word not in the first restrict_vocab words. This may be meaningful if you’ve sorted the model vocabulary by descending frequency (which is standard in modern word embedding models).

  • case_insensitive (bool, optional) – If True - convert all words to their uppercase form before evaluating the performance. Useful to handle case-mismatch between training tokens and words in the test set. In case of multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.

  • dummy4unknown (bool, optional) – If True - produce zero accuracies for 4-tuples with out-of-vocabulary words. Otherwise, these tuples are skipped entirely and not used in the evaluation.

Returns

  • score (float) – The overall evaluation score on the entire evaluation set

  • sections (list of dict of {str : str or list of tuple of (str, str, str, str)}) – Results broken down by each section of the evaluation set. Each dict contains the name of the section under the key ‘section’, and lists of correctly and incorrectly predicted 4-tuples of words under the keys ‘correct’ and ‘incorrect’.

evaluate_word_pairs(pairs, delimiter='\t', restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)

Compute correlation of the model with human similarity judgments.

Notes

More datasets can be found at * http://technion.ac.il/~ira.leviant/MultilingualVSMdata.html * https://www.cl.cam.ac.uk/~fh295/simlex.html.

Parameters
  • pairs (str) – Path to file, where lines are 3-tuples, each consisting of a word pair and a similarity value. See test/test_data/wordsim353.tsv as example.

  • delimiter (str, optional) – Separator in pairs file.

  • restrict_vocab (int, optional) – Ignore all 4-tuples containing a word not in the first restrict_vocab words. This may be meaningful if you’ve sorted the model vocabulary by descending frequency (which is standard in modern word embedding models).

  • case_insensitive (bool, optional) – If True - convert all words to their uppercase form before evaluating the performance. Useful to handle case-mismatch between training tokens and words in the test set. In case of multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.

  • dummy4unknown (bool, optional) – If True - produce zero accuracies for 4-tuples with out-of-vocabulary words. Otherwise, these tuples are skipped entirely and not used in the evaluation.

Returns

  • pearson (tuple of (float, float)) – Pearson correlation coefficient with 2-tailed p-value.

  • spearman (tuple of (float, float)) – Spearman rank-order correlation coefficient between the similarities from the dataset and the similarities produced by the model itself, with 2-tailed p-value.

  • oov_ratio (float) – The ratio of pairs with unknown words.

get_keras_embedding(train_embeddings=False)

Get a Keras ‘Embedding’ layer with weights set as the Word2Vec model’s learned word embeddings.

Parameters

train_embeddings (bool) – If False, the weights are frozen and stopped from being updated. If True, the weights can/will be further trained/updated.

Returns

Embedding layer.

Return type

keras.layers.Embedding

Raises

ImportError – If Keras not installed.

Warning

Current method work only if Keras installed.

get_vector(word)

Get the entity’s representations in vector space, as a 1D numpy array.

Parameters

entity (str) – Identifier of the entity to return the vector for.

Returns

Vector for the specified entity.

Return type

numpy.ndarray

Raises

KeyError – If the given entity identifier doesn’t exist.

property index2entity
init_sims(replace=False)

Precompute L2-normalized vectors.

Parameters

replace (bool, optional) – If True - forget the original vectors and only keep the normalized ones = saves lots of memory!

Warning

You cannot continue training after doing a replace. The model becomes effectively read-only: you can call most_similar(), similarity(), etc., but not train.

classmethod load(fname_or_handle, **kwargs)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

classmethod load_word2vec_format(fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict', limit=None, datatype=<class 'numpy.float32'>)

Load the input-hidden weight matrix from the original C word2vec-tool format.

Warning

The information stored in the file is incomplete (the binary tree is missing), so while you can query for word similarity etc., you cannot continue training with a model loaded this way.

Parameters
  • fname (str) – The file path to the saved word2vec-format file.

  • fvocab (str, optional) – File path to the vocabulary.Word counts are read from fvocab filename, if set (this is the file generated by -save-vocab flag of the original C tool).

  • binary (bool, optional) – If True, indicates whether the data is in binary word2vec format.

  • encoding (str, optional) – If you trained the C model using non-utf8 encoding for words, specify that encoding in encoding.

  • unicode_errors (str, optional) – default ‘strict’, is a string suitable to be passed as the errors argument to the unicode() (Python 2.x) or str() (Python 3.x) function. If your source file may include word tokens truncated in the middle of a multibyte unicode character (as is common from the original word2vec.c tool), ‘ignore’ or ‘replace’ may help.

  • limit (int, optional) – Sets a maximum number of word-vectors to read from the file. The default, None, means read all.

  • datatype (type, optional) – (Experimental) Can coerce dimensions to a non-default float type (such as np.float16) to save memory. Such types may result in much slower bulk operations or incompatibility with optimized routines.)

Returns

Loaded model.

Return type

Word2VecKeyedVectors

static log_accuracy(section)
static log_evaluate_word_pairs(pearson, spearman, oov, pairs)
most_similar(positive=None, negative=None, topn=10, restrict_vocab=None, indexer=None)

Find the top-N most similar words. Positive words contribute positively towards the similarity, negative words negatively.

This method computes cosine similarity between a simple mean of the projection weight vectors of the given words and the vectors for each word in the model. The method corresponds to the word-analogy and distance scripts in the original word2vec implementation.

Parameters
  • positive (list of str, optional) – List of words that contribute positively.

  • negative (list of str, optional) – List of words that contribute negatively.

  • topn (int or None, optional) – Number of top-N similar words to return, when topn is int. When topn is None, then similarities for all words are returned.

  • restrict_vocab (int, optional) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)

Returns

When topn is int, a sequence of (word, similarity) is returned. When topn is None, then similarities for all words are returned as a one-dimensional numpy array with the size of the vocabulary.

Return type

list of (str, float) or numpy.array

most_similar_cosmul(positive=None, negative=None, topn=10)

Find the top-N most similar words, using the multiplicative combination objective, proposed by Omer Levy and Yoav Goldberg “Linguistic Regularities in Sparse and Explicit Word Representations”. Positive words still contribute positively towards the similarity, negative words negatively, but with less susceptibility to one large distance dominating the calculation. In the common analogy-solving case, of two positive and one negative examples, this method is equivalent to the “3CosMul” objective (equation (4)) of Levy and Goldberg.

Additional positive or negative examples contribute to the numerator or denominator, respectively - a potentially sensible but untested extension of the method. With a single positive example, rankings will be the same as in the default most_similar().

Parameters
  • positive (list of str, optional) – List of words that contribute positively.

  • negative (list of str, optional) – List of words that contribute negatively.

  • topn (int or None, optional) – Number of top-N similar words to return, when topn is int. When topn is None, then similarities for all words are returned.

Returns

When topn is int, a sequence of (word, similarity) is returned. When topn is None, then similarities for all words are returned as a one-dimensional numpy array with the size of the vocabulary.

Return type

list of (str, float) or numpy.array

most_similar_to_given(entity1, entities_list)

Get the entity from entities_list most similar to entity1.

n_similarity(ws1, ws2)

Compute cosine similarity between two sets of words.

Parameters
  • ws1 (list of str) – Sequence of words.

  • ws2 (list of str) – Sequence of words.

Returns

Similarities between ws1 and ws2.

Return type

numpy.ndarray

rank(entity1, entity2)

Rank of the distance of entity2 from entity1, in relation to distances of all entities from entity1.

relative_cosine_similarity(wa, wb, topn=10)

Compute the relative cosine similarity between two words given top-n similar words, by Artuur Leeuwenberga, Mihaela Velab , Jon Dehdaribc, Josef van Genabithbc “A Minimally Supervised Approach for Synonym Extraction with Word Embeddings”.

To calculate relative cosine similarity between two words, equation (1) of the paper is used. For WordNet synonyms, if rcs(topn=10) is greater than 0.10 then wa and wb are more similar than any arbitrary word pairs.

Parameters
  • wa (str) – Word for which we have to look top-n similar word.

  • wb (str) – Word for which we evaluating relative cosine similarity with wa.

  • topn (int, optional) – Number of top-n similar words to look with respect to wa.

Returns

Relative cosine similarity between wa and wb.

Return type

numpy.float64

save(*args, **kwargs)

Save KeyedVectors.

Parameters

fname (str) – Path to the output file.

See also

load()

Load saved model.

save_word2vec_format(fname, fvocab=None, binary=False, total_vec=None)

Store the input-hidden weight matrix in the same format used by the original C word2vec-tool, for compatibility.

Parameters
  • fname (str) – The file path used to save the vectors in

  • fvocab (str, optional) – Optional file path used to save the vocabulary

  • binary (bool, optional) – If True, the data will be saved in binary word2vec format, else it will be saved in plain text.

  • total_vec (int, optional) – Optional parameter to explicitly specify total no. of vectors (in case word vectors are appended with document vectors afterwards).

similar_by_vector(vector, topn=10, restrict_vocab=None)

Find the top-N most similar words by vector.

Parameters
  • vector (numpy.array) – Vector from which similarities are to be computed.

  • topn (int or None, optional) – Number of top-N similar words to return, when topn is int. When topn is None, then similarities for all words are returned.

  • restrict_vocab (int, optional) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)

Returns

When topn is int, a sequence of (word, similarity) is returned. When topn is None, then similarities for all words are returned as a one-dimensional numpy array with the size of the vocabulary.

Return type

list of (str, float) or numpy.array

similar_by_word(word, topn=10, restrict_vocab=None)

Find the top-N most similar words.

Parameters
  • word (str) – Word

  • topn (int or None, optional) – Number of top-N similar words to return. If topn is None, similar_by_word returns the vector of similarity scores.

  • restrict_vocab (int, optional) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)

Returns

When topn is int, a sequence of (word, similarity) is returned. When topn is None, then similarities for all words are returned as a one-dimensional numpy array with the size of the vocabulary.

Return type

list of (str, float) or numpy.array

similarity(w1, w2)

Compute cosine similarity between two words.

Parameters
  • w1 (str) – Input word.

  • w2 (str) – Input word.

Returns

Cosine similarity between w1 and w2.

Return type

float

similarity_matrix(dictionary, tfidf=None, threshold=0.0, exponent=2.0, nonzero_limit=100, dtype=<class 'numpy.float32'>)

Construct a term similarity matrix for computing Soft Cosine Measure.

This creates a sparse term similarity matrix in the scipy.sparse.csc_matrix format for computing Soft Cosine Measure between documents.

Parameters
  • dictionary (Dictionary) – A dictionary that specifies the considered terms.

  • tfidf (gensim.models.tfidfmodel.TfidfModel or None, optional) – A model that specifies the relative importance of the terms in the dictionary. The columns of the term similarity matrix will be build in a decreasing order of importance of terms, or in the order of term identifiers if None.

  • threshold (float, optional) – Only embeddings more similar than threshold are considered when retrieving word embeddings closest to a given word embedding.

  • exponent (float, optional) – Take the word embedding similarities larger than threshold to the power of exponent.

  • nonzero_limit (int, optional) – The maximum number of non-zero elements outside the diagonal in a single column of the sparse term similarity matrix.

  • dtype (numpy.dtype, optional) – Data-type of the sparse term similarity matrix.

Returns

Term similarity matrix.

Return type

scipy.sparse.csc_matrix

See also

gensim.matutils.softcossim()

The Soft Cosine Measure.

SoftCosineSimilarity

A class for performing corpus-based similarity queries with Soft Cosine Measure.

Notes

The constructed matrix corresponds to the matrix Mrel defined in section 2.1 of Delphine Charlet and Geraldine Damnati, “SimBow at SemEval-2017 Task 3: Soft-Cosine Semantic Similarity between Questions for Community Question Answering”, 2017.

property syn0
property syn0norm
wmdistance(document1, document2)

Compute the Word Mover’s Distance between two documents.

When using this code, please consider citing the following papers:

Parameters
  • document1 (list of str) – Input document.

  • document2 (list of str) – Input document.

Returns

Word Mover’s distance between document1 and document2.

Return type

float

Warning

This method only works if pyemd is installed.

If one of the documents have no words that exist in the vocab, float(‘inf’) (i.e. infinity) will be returned.

Raises

ImportError

If pyemd isn’t installed.

word_vec(word, use_norm=False)

Get word representations in vector space, as a 1D numpy array.

Parameters
  • word (str) – Input word

  • use_norm (bool, optional) – If True - resulting vector will be L2-normalized (unit euclidean length).

Returns

Vector representation of word.

Return type

numpy.ndarray

Raises

KeyError – If word not in vocabulary.

words_closer_than(w1, w2)

Get all words that are closer to w1 than w2 is to w1.

Parameters
  • w1 (str) – Input word.

  • w2 (str) – Input word.

Returns

List of words that are closer to w1 than w2 is to w1.

Return type

list (str)

property wv
class gensim.models.keyedvectors.WordEmbeddingSimilarityIndex(keyedvectors, threshold=0.0, exponent=2.0, kwargs=None)

Bases: gensim.similarities.termsim.TermSimilarityIndex

Computes cosine similarities between word embeddings and retrieves the closest word embeddings by cosine similarity for a given word embedding.

Parameters
  • keyedvectors (WordEmbeddingsKeyedVectors) – The word embeddings.

  • threshold (float, optional) – Only embeddings more similar than threshold are considered when retrieving word embeddings closest to a given word embedding.

  • exponent (float, optional) – Take the word embedding similarities larger than threshold to the power of exponent.

  • kwargs (dict or None) – A dict with keyword arguments that will be passed to the keyedvectors.most_similar method when retrieving the word embeddings closest to a given word embedding.

See also

SparseTermSimilarityMatrix

Build a term similarity matrix and compute the Soft Cosine Measure.

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

most_similar(t1, topn=10)

Get most similar terms for a given term.

Return most similar terms for a given term along with the similarities.

Parameters
  • term (str) – Tne term for which we are retrieving topn most similar terms.

  • topn (int, optional) – The maximum number of most similar terms to term that will be retrieved.

Returns

Most similar terms along with their similarities to term. Only terms distinct from term must be returned.

Return type

iterable of (str, float)

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.

class gensim.models.keyedvectors.WordEmbeddingsKeyedVectors(vector_size)

Bases: gensim.models.keyedvectors.BaseKeyedVectors

Class containing common methods for operations over word vectors.

accuracy(questions, restrict_vocab=30000, most_similar=<function WordEmbeddingsKeyedVectors.most_similar>, case_insensitive=True)

Compute accuracy of the model.

The accuracy is reported (=printed to log and returned as a list) for each section separately, plus there’s one aggregate summary at the end.

Parameters
  • questions (str) – Path to file, where lines are 4-tuples of words, split into sections by “: SECTION NAME” lines. See gensim/test/test_data/questions-words.txt as example.

  • restrict_vocab (int, optional) – Ignore all 4-tuples containing a word not in the first restrict_vocab words. This may be meaningful if you’ve sorted the model vocabulary by descending frequency (which is standard in modern word embedding models).

  • most_similar (function, optional) – Function used for similarity calculation.

  • case_insensitive (bool, optional) – If True - convert all words to their uppercase form before evaluating the performance. Useful to handle case-mismatch between training tokens and words in the test set. In case of multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.

Returns

Full lists of correct and incorrect predictions divided by sections.

Return type

list of dict of (str, (str, str, str)

add(entities, weights, replace=False)

Append entities and theirs vectors in a manual way. If some entity is already in the vocabulary, the old vector is kept unless replace flag is True.

Parameters
  • entities (list of str) – Entities specified by string ids.

  • weights (list of numpy.ndarray or numpy.ndarray) – List of 1D np.array vectors or a 2D np.array of vectors.

  • replace (bool, optional) – Flag indicating whether to replace vectors for entities which already exist in the vocabulary, if True - replace vectors, otherwise - keep old vectors.

closer_than(entity1, entity2)

Get all entities that are closer to entity1 than entity2 is to entity1.

static cosine_similarities(vector_1, vectors_all)

Compute cosine similarities between one vector and a set of other vectors.

Parameters
  • vector_1 (numpy.ndarray) – Vector from which similarities are to be computed, expected shape (dim,).

  • vectors_all (numpy.ndarray) – For each row in vectors_all, distance from vector_1 is computed, expected shape (num_vectors, dim).

Returns

Contains cosine distance between vector_1 and each row in vectors_all, shape (num_vectors,).

Return type

numpy.ndarray

distance(w1, w2)

Compute cosine distance between two words. Calculate 1 - similarity().

Parameters
  • w1 (str) – Input word.

  • w2 (str) – Input word.

Returns

Distance between w1 and w2.

Return type

float

distances(word_or_vector, other_words=())

Compute cosine distances from given word or vector to all words in other_words. If other_words is empty, return distance between word_or_vectors and all words in vocab.

Parameters
  • word_or_vector ({str, numpy.ndarray}) – Word or vector from which distances are to be computed.

  • other_words (iterable of str) – For each word in other_words distance from word_or_vector is computed. If None or empty, distance of word_or_vector from all words in vocab is computed (including itself).

Returns

Array containing distances to all words in other_words from input word_or_vector.

Return type

numpy.array

Raises

KeyError – If either word_or_vector or any word in other_words is absent from vocab.

doesnt_match(words)

Which word from the given list doesn’t go with the others?

Parameters

words (list of str) – List of words.

Returns

The word further away from the mean of all words.

Return type

str

evaluate_word_analogies(analogies, restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)

Compute performance of the model on an analogy test set.

This is modern variant of accuracy(), see discussion on GitHub #1935.

The accuracy is reported (printed to log and returned as a score) for each section separately, plus there’s one aggregate summary at the end.

This method corresponds to the compute-accuracy script of the original C word2vec. See also Analogy (State of the art).

Parameters
  • analogies (str) – Path to file, where lines are 4-tuples of words, split into sections by “: SECTION NAME” lines. See gensim/test/test_data/questions-words.txt as example.

  • restrict_vocab (int, optional) – Ignore all 4-tuples containing a word not in the first restrict_vocab words. This may be meaningful if you’ve sorted the model vocabulary by descending frequency (which is standard in modern word embedding models).

  • case_insensitive (bool, optional) – If True - convert all words to their uppercase form before evaluating the performance. Useful to handle case-mismatch between training tokens and words in the test set. In case of multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.

  • dummy4unknown (bool, optional) – If True - produce zero accuracies for 4-tuples with out-of-vocabulary words. Otherwise, these tuples are skipped entirely and not used in the evaluation.

Returns

  • score (float) – The overall evaluation score on the entire evaluation set

  • sections (list of dict of {str : str or list of tuple of (str, str, str, str)}) – Results broken down by each section of the evaluation set. Each dict contains the name of the section under the key ‘section’, and lists of correctly and incorrectly predicted 4-tuples of words under the keys ‘correct’ and ‘incorrect’.

evaluate_word_pairs(pairs, delimiter='\t', restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)

Compute correlation of the model with human similarity judgments.

Notes

More datasets can be found at * http://technion.ac.il/~ira.leviant/MultilingualVSMdata.html * https://www.cl.cam.ac.uk/~fh295/simlex.html.

Parameters
  • pairs (str) – Path to file, where lines are 3-tuples, each consisting of a word pair and a similarity value. See test/test_data/wordsim353.tsv as example.

  • delimiter (str, optional) – Separator in pairs file.

  • restrict_vocab (int, optional) – Ignore all 4-tuples containing a word not in the first restrict_vocab words. This may be meaningful if you’ve sorted the model vocabulary by descending frequency (which is standard in modern word embedding models).

  • case_insensitive (bool, optional) – If True - convert all words to their uppercase form before evaluating the performance. Useful to handle case-mismatch between training tokens and words in the test set. In case of multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.

  • dummy4unknown (bool, optional) – If True - produce zero accuracies for 4-tuples with out-of-vocabulary words. Otherwise, these tuples are skipped entirely and not used in the evaluation.

Returns

  • pearson (tuple of (float, float)) – Pearson correlation coefficient with 2-tailed p-value.

  • spearman (tuple of (float, float)) – Spearman rank-order correlation coefficient between the similarities from the dataset and the similarities produced by the model itself, with 2-tailed p-value.

  • oov_ratio (float) – The ratio of pairs with unknown words.

get_vector(word)

Get the entity’s representations in vector space, as a 1D numpy array.

Parameters

entity (str) – Identifier of the entity to return the vector for.

Returns

Vector for the specified entity.

Return type

numpy.ndarray

Raises

KeyError – If the given entity identifier doesn’t exist.

property index2entity
init_sims(replace=False)

Precompute L2-normalized vectors.

Parameters

replace (bool, optional) – If True - forget the original vectors and only keep the normalized ones = saves lots of memory!

Warning

You cannot continue training after doing a replace. The model becomes effectively read-only: you can call most_similar(), similarity(), etc., but not train.

classmethod load(fname_or_handle, **kwargs)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

static log_accuracy(section)
static log_evaluate_word_pairs(pearson, spearman, oov, pairs)
most_similar(positive=None, negative=None, topn=10, restrict_vocab=None, indexer=None)

Find the top-N most similar words. Positive words contribute positively towards the similarity, negative words negatively.

This method computes cosine similarity between a simple mean of the projection weight vectors of the given words and the vectors for each word in the model. The method corresponds to the word-analogy and distance scripts in the original word2vec implementation.

Parameters
  • positive (list of str, optional) – List of words that contribute positively.

  • negative (list of str, optional) – List of words that contribute negatively.

  • topn (int or None, optional) – Number of top-N similar words to return, when topn is int. When topn is None, then similarities for all words are returned.

  • restrict_vocab (int, optional) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)

Returns

When topn is int, a sequence of (word, similarity) is returned. When topn is None, then similarities for all words are returned as a one-dimensional numpy array with the size of the vocabulary.

Return type

list of (str, float) or numpy.array

most_similar_cosmul(positive=None, negative=None, topn=10)

Find the top-N most similar words, using the multiplicative combination objective, proposed by Omer Levy and Yoav Goldberg “Linguistic Regularities in Sparse and Explicit Word Representations”. Positive words still contribute positively towards the similarity, negative words negatively, but with less susceptibility to one large distance dominating the calculation. In the common analogy-solving case, of two positive and one negative examples, this method is equivalent to the “3CosMul” objective (equation (4)) of Levy and Goldberg.

Additional positive or negative examples contribute to the numerator or denominator, respectively - a potentially sensible but untested extension of the method. With a single positive example, rankings will be the same as in the default most_similar().

Parameters
  • positive (list of str, optional) – List of words that contribute positively.

  • negative (list of str, optional) – List of words that contribute negatively.

  • topn (int or None, optional) – Number of top-N similar words to return, when topn is int. When topn is None, then similarities for all words are returned.

Returns

When topn is int, a sequence of (word, similarity) is returned. When topn is None, then similarities for all words are returned as a one-dimensional numpy array with the size of the vocabulary.

Return type

list of (str, float) or numpy.array

most_similar_to_given(entity1, entities_list)

Get the entity from entities_list most similar to entity1.

n_similarity(ws1, ws2)

Compute cosine similarity between two sets of words.

Parameters
  • ws1 (list of str) – Sequence of words.

  • ws2 (list of str) – Sequence of words.

Returns

Similarities between ws1 and ws2.

Return type

numpy.ndarray

rank(entity1, entity2)

Rank of the distance of entity2 from entity1, in relation to distances of all entities from entity1.

relative_cosine_similarity(wa, wb, topn=10)

Compute the relative cosine similarity between two words given top-n similar words, by Artuur Leeuwenberga, Mihaela Velab , Jon Dehdaribc, Josef van Genabithbc “A Minimally Supervised Approach for Synonym Extraction with Word Embeddings”.

To calculate relative cosine similarity between two words, equation (1) of the paper is used. For WordNet synonyms, if rcs(topn=10) is greater than 0.10 then wa and wb are more similar than any arbitrary word pairs.

Parameters
  • wa (str) – Word for which we have to look top-n similar word.

  • wb (str) – Word for which we evaluating relative cosine similarity with wa.

  • topn (int, optional) – Number of top-n similar words to look with respect to wa.

Returns

Relative cosine similarity between wa and wb.

Return type

numpy.float64

save(*args, **kwargs)

Save KeyedVectors.

Parameters

fname (str) – Path to the output file.

See also

load()

Load saved model.

similar_by_vector(vector, topn=10, restrict_vocab=None)

Find the top-N most similar words by vector.

Parameters
  • vector (numpy.array) – Vector from which similarities are to be computed.

  • topn (int or None, optional) – Number of top-N similar words to return, when topn is int. When topn is None, then similarities for all words are returned.

  • restrict_vocab (int, optional) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)

Returns

When topn is int, a sequence of (word, similarity) is returned. When topn is None, then similarities for all words are returned as a one-dimensional numpy array with the size of the vocabulary.

Return type

list of (str, float) or numpy.array

similar_by_word(word, topn=10, restrict_vocab=None)

Find the top-N most similar words.

Parameters
  • word (str) – Word

  • topn (int or None, optional) – Number of top-N similar words to return. If topn is None, similar_by_word returns the vector of similarity scores.

  • restrict_vocab (int, optional) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)

Returns

When topn is int, a sequence of (word, similarity) is returned. When topn is None, then similarities for all words are returned as a one-dimensional numpy array with the size of the vocabulary.

Return type

list of (str, float) or numpy.array

similarity(w1, w2)

Compute cosine similarity between two words.

Parameters
  • w1 (str) – Input word.

  • w2 (str) – Input word.

Returns

Cosine similarity between w1 and w2.

Return type

float

similarity_matrix(dictionary, tfidf=None, threshold=0.0, exponent=2.0, nonzero_limit=100, dtype=<class 'numpy.float32'>)

Construct a term similarity matrix for computing Soft Cosine Measure.

This creates a sparse term similarity matrix in the scipy.sparse.csc_matrix format for computing Soft Cosine Measure between documents.

Parameters
  • dictionary (Dictionary) – A dictionary that specifies the considered terms.

  • tfidf (gensim.models.tfidfmodel.TfidfModel or None, optional) – A model that specifies the relative importance of the terms in the dictionary. The columns of the term similarity matrix will be build in a decreasing order of importance of terms, or in the order of term identifiers if None.

  • threshold (float, optional) – Only embeddings more similar than threshold are considered when retrieving word embeddings closest to a given word embedding.

  • exponent (float, optional) – Take the word embedding similarities larger than threshold to the power of exponent.

  • nonzero_limit (int, optional) – The maximum number of non-zero elements outside the diagonal in a single column of the sparse term similarity matrix.

  • dtype (numpy.dtype, optional) – Data-type of the sparse term similarity matrix.

Returns

Term similarity matrix.

Return type

scipy.sparse.csc_matrix

See also

gensim.matutils.softcossim()

The Soft Cosine Measure.

SoftCosineSimilarity

A class for performing corpus-based similarity queries with Soft Cosine Measure.

Notes

The constructed matrix corresponds to the matrix Mrel defined in section 2.1 of Delphine Charlet and Geraldine Damnati, “SimBow at SemEval-2017 Task 3: Soft-Cosine Semantic Similarity between Questions for Community Question Answering”, 2017.

property syn0
property syn0norm
wmdistance(document1, document2)

Compute the Word Mover’s Distance between two documents.

When using this code, please consider citing the following papers:

Parameters
  • document1 (list of str) – Input document.

  • document2 (list of str) – Input document.

Returns

Word Mover’s distance between document1 and document2.

Return type

float

Warning

This method only works if pyemd is installed.

If one of the documents have no words that exist in the vocab, float(‘inf’) (i.e. infinity) will be returned.

Raises

ImportError

If pyemd isn’t installed.

word_vec(word, use_norm=False)

Get word representations in vector space, as a 1D numpy array.

Parameters
  • word (str) – Input word

  • use_norm (bool, optional) – If True - resulting vector will be L2-normalized (unit euclidean length).

Returns

Vector representation of word.

Return type

numpy.ndarray

Raises

KeyError – If word not in vocabulary.

words_closer_than(w1, w2)

Get all words that are closer to w1 than w2 is to w1.

Parameters
  • w1 (str) – Input word.

  • w2 (str) – Input word.

Returns

List of words that are closer to w1 than w2 is to w1.

Return type

list (str)

property wv