models.keyedvectors – Store and query word vectors

This module implements word vectors, and more generally sets of vectors keyed by lookup tokens/ints,

and various similarity look-ups.

Since trained word vectors are independent from the way they were trained (Word2Vec, FastText etc), they can be represented by a standalone structure, as implemented in this module.

The structure is called “KeyedVectors” and is essentially a mapping between keys and vectors. Each vector is identified by its lookup key, most often a short string token, so this is usually a mapping between {str => 1D numpy array}.

The key is, in the original motivating case, a word (so the mapping maps words to 1D vectors), but for some models, the key can also correspond to a document, a graph node etc.

(Because some applications may maintain their own integral identifiers, compact and contiguous starting at zero, this class also supports use of plain ints as keys – in that case using them as literal pointers to the position of the desired vector in the underlying array, and saving the overhead of a lookup map entry.)

Why use KeyedVectors instead of a full model?

capability

KeyedVectors

full model

note

continue training vectors

You need the full model to train or update vectors.

smaller objects

KeyedVectors are smaller and need less RAM, because they don’t need to store the model state that enables training.

save/load from native fasttext/word2vec format

Vectors exported by the Facebook and Google tools do not support further training, but you can still load them into KeyedVectors.

append new vectors

Add new-vector entries to the mapping dynamically.

concurrency

Thread-safe, allows concurrent vector queries.

shared RAM

Multiple processes can re-use the same data, keeping only a single copy in RAM using mmap.

fast load

Supports mmap to load data from disk instantaneously.

TL;DR: the main difference is that KeyedVectors do not support further training. On the other hand, by shedding the internal data structures necessary for training, KeyedVectors offer a smaller RAM footprint and a simpler interface.

How to obtain word vectors?

Train a full model, then access its model.wv property, which holds the standalone keyed vectors. For example, using the Word2Vec algorithm to train the vectors

>>> from gensim.test.utils import lee_corpus_list
>>> from gensim.models import Word2Vec
>>>
>>> model = Word2Vec(lee_corpus_list, vector_size=24, epochs=100)
>>> word_vectors = model.wv

Persist the word vectors to disk with

>>> from gensim.models import KeyedVectors
>>>
>>> word_vectors.save('vectors.kv')
>>> reloaded_word_vectors = KeyedVectors.load('vectors.kv')

The vectors can also be instantiated from an existing file on disk in the original Google’s word2vec C format as a KeyedVectors instance

>>> from gensim.test.utils import datapath
>>>
>>> wv_from_text = KeyedVectors.load_word2vec_format(datapath('word2vec_pre_kv_c'), binary=False)  # C text format
>>> wv_from_bin = KeyedVectors.load_word2vec_format(datapath("euclidean_vectors.bin"), binary=True)  # C bin format

What can I do with word vectors?

You can perform various syntactic/semantic NLP word tasks with the trained vectors. Some of them are already built-in

>>> import gensim.downloader as api
>>>
>>> word_vectors = api.load("glove-wiki-gigaword-100")  # load pre-trained word-vectors from gensim-data
>>>
>>> # Check the "most similar words", using the default "cosine similarity" measure.
>>> result = word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])
>>> most_similar_key, similarity = result[0]  # look at the first match
>>> print(f"{most_similar_key}: {similarity:.4f}")
queen: 0.7699
>>>
>>> # Use a different similarity measure: "cosmul".
>>> result = word_vectors.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])
>>> most_similar_key, similarity = result[0]  # look at the first match
>>> print(f"{most_similar_key}: {similarity:.4f}")
queen: 0.8965
>>>
>>> print(word_vectors.doesnt_match("breakfast cereal dinner lunch".split()))
cereal
>>>
>>> similarity = word_vectors.similarity('woman', 'man')
>>> similarity > 0.8
True
>>>
>>> result = word_vectors.similar_by_word("cat")
>>> most_similar_key, similarity = result[0]  # look at the first match
>>> print(f"{most_similar_key}: {similarity:.4f}")
dog: 0.8798
>>>
>>> sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
>>> sentence_president = 'The president greets the press in Chicago'.lower().split()
>>>
>>> similarity = word_vectors.wmdistance(sentence_obama, sentence_president)
>>> print(f"{similarity:.4f}")
3.4893
>>>
>>> distance = word_vectors.distance("media", "media")
>>> print(f"{distance:.1f}")
0.0
>>>
>>> similarity = word_vectors.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant'])
>>> print(f"{similarity:.4f}")
0.7067
>>>
>>> vector = word_vectors['computer']  # numpy vector of a word
>>> vector.shape
(100,)
>>>
>>> vector = word_vectors.wv.get_vector('office', norm=True)
>>> vector.shape
(100,)

Correlation with human opinion on word similarity

>>> from gensim.test.utils import datapath
>>>
>>> similarities = model.wv.evaluate_word_pairs(datapath('wordsim353.tsv'))

And on word analogies

>>> analogy_scores = model.wv.evaluate_word_analogies(datapath('questions-words.txt'))

and so on.

class gensim.models.keyedvectors.CompatVocab(**kwargs)

Bases: object

A single vocabulary item, used internally for collecting per-word frequency/sampling info, and for constructing binary trees (incl. both word leaves and inner nodes).

Retained for now to ease the loading of older models.

gensim.models.keyedvectors.Doc2VecKeyedVectors

alias of KeyedVectors

gensim.models.keyedvectors.EuclideanKeyedVectors

alias of KeyedVectors

class gensim.models.keyedvectors.KeyedVectors(vector_size, count=0, dtype=<class 'numpy.float32'>, mapfile_path=None)

Bases: SaveLoad

Mapping between keys (such as words) and vectors for Word2Vec and related models.

Used to perform operations on the vectors such as vector lookup, distance, similarity etc.

To support the needs of specific models and other downstream uses, you can also set additional attributes via the set_vecattr() and get_vecattr() methods. Note that all such attributes under the same attr name must have compatible numpy types, as the type and storage array for such attributes is established by the 1st time such attr is set.

Parameters
  • vector_size (int) – Intended number of dimensions for all contained vectors.

  • count (int, optional) – If provided, vectors wil be pre-allocated for at least this many vectors. (Otherwise they can be added later.)

  • dtype (type, optional) – Vector dimensions will default to np.float32 (AKA REAL in some Gensim code) unless another type is provided here.

  • mapfile_path (string, optional) – Currently unused.

__contains__(key)
__getitem__(key_or_keys)

Get vector representation of key_or_keys.

Parameters

key_or_keys ({str, list of str, int, list of int}) – Requested key or list-of-keys.

Returns

Vector representation for key_or_keys (1D if key_or_keys is single key, otherwise - 2D).

Return type

numpy.ndarray

__setitem__(keys, weights)

Add keys and theirs vectors in a manual way. If some key is already in the vocabulary, old vector is replaced with the new one.

This method is an alias for add_vectors() with replace=True.

Parameters
  • keys ({str, int, list of (str or int)}) – keys specified by their string or int ids.

  • weights (list of numpy.ndarray or numpy.ndarray) – List of 1D np.array vectors or 2D np.array of vectors.

add_lifecycle_event(event_name, log_level=20, **event)

Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.

Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.

The lifecycle_events attribute is persisted across object’s save() and load() operations. It has no impact on the use of the model, but is useful during debugging and support.

Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.

Parameters
  • event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.

  • event (dict) –

    Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.

    This method will automatically add the following key-values to event, so you don’t have to specify them:

    • datetime: the current date & time

    • gensim: the current Gensim version

    • python: the current Python version

    • platform: the current platform

    • event: the name of this event

  • log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.

add_vector(key, vector)

Add one new vector at the given key, into existing slot if available.

Warning: using this repeatedly is inefficient, requiring a full reallocation & copy, if this instance hasn’t been preallocated to be ready for such incremental additions.

Parameters
  • key (str) – Key identifier of the added vector.

  • vector (numpy.ndarray) – 1D numpy array with the vector values.

Returns

Index of the newly added vector, so that self.vectors[result] == vector and self.index_to_key[result] == key.

Return type

int

add_vectors(keys, weights, extras=None, replace=False)

Append keys and their vectors in a manual way. If some key is already in the vocabulary, the old vector is kept unless replace flag is True.

Parameters
  • keys (list of (str or int)) – Keys specified by string or int ids.

  • weights (list of numpy.ndarray or numpy.ndarray) – List of 1D np.array vectors or a 2D np.array of vectors.

  • replace (bool, optional) – Flag indicating whether to replace vectors for keys which already exist in the map; if True - replace vectors, otherwise - keep old vectors.

allocate_vecattrs(attrs=None, types=None)

Ensure arrays for given per-vector extra-attribute names & types exist, at right size.

The length of the index_to_key list is canonical ‘intended size’ of KeyedVectors, even if other properties (vectors array) hasn’t yet been allocated or expanded. So this allocation targets that size.

closer_than(key1, key2)

Get all keys that are closer to key1 than key2 is to key1.

static cosine_similarities(vector_1, vectors_all)

Compute cosine similarities between one vector and a set of other vectors.

Parameters
  • vector_1 (numpy.ndarray) – Vector from which similarities are to be computed, expected shape (dim,).

  • vectors_all (numpy.ndarray) – For each row in vectors_all, distance from vector_1 is computed, expected shape (num_vectors, dim).

Returns

Contains cosine distance between vector_1 and each row in vectors_all, shape (num_vectors,).

Return type

numpy.ndarray

distance(w1, w2)

Compute cosine distance between two keys. Calculate 1 - similarity().

Parameters
  • w1 (str) – Input key.

  • w2 (str) – Input key.

Returns

Distance between w1 and w2.

Return type

float

distances(word_or_vector, other_words=())

Compute cosine distances from given word or vector to all words in other_words. If other_words is empty, return distance between word_or_vector and all words in vocab.

Parameters
  • word_or_vector ({str, numpy.ndarray}) – Word or vector from which distances are to be computed.

  • other_words (iterable of str) – For each word in other_words distance from word_or_vector is computed. If None or empty, distance of word_or_vector from all words in vocab is computed (including itself).

Returns

Array containing distances to all words in other_words from input word_or_vector.

Return type

numpy.array

Raises

KeyError – If either word_or_vector or any word in other_words is absent from vocab.

doesnt_match(words)

Which key from the given list doesn’t go with the others?

Parameters

words (list of str) – List of keys.

Returns

The key further away from the mean of all keys.

Return type

str

evaluate_word_analogies(analogies, restrict_vocab=300000, case_insensitive=True, dummy4unknown=False, similarity_function='most_similar')

Compute performance of the model on an analogy test set.

The accuracy is reported (printed to log and returned as a score) for each section separately, plus there’s one aggregate summary at the end.

This method corresponds to the compute-accuracy script of the original C word2vec. See also Analogy (State of the art).

Parameters
  • analogies (str) – Path to file, where lines are 4-tuples of words, split into sections by “: SECTION NAME” lines. See gensim/test/test_data/questions-words.txt as example.

  • restrict_vocab (int, optional) – Ignore all 4-tuples containing a word not in the first restrict_vocab words. This may be meaningful if you’ve sorted the model vocabulary by descending frequency (which is standard in modern word embedding models).

  • case_insensitive (bool, optional) – If True - convert all words to their uppercase form before evaluating the performance. Useful to handle case-mismatch between training tokens and words in the test set. In case of multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.

  • dummy4unknown (bool, optional) – If True - produce zero accuracies for 4-tuples with out-of-vocabulary words. Otherwise, these tuples are skipped entirely and not used in the evaluation.

  • similarity_function (str, optional) – Function name used for similarity calculation.

Returns

  • score (float) – The overall evaluation score on the entire evaluation set

  • sections (list of dict of {str : str or list of tuple of (str, str, str, str)}) – Results broken down by each section of the evaluation set. Each dict contains the name of the section under the key ‘section’, and lists of correctly and incorrectly predicted 4-tuples of words under the keys ‘correct’ and ‘incorrect’.

evaluate_word_pairs(pairs, delimiter='\t', encoding='utf8', restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)

Compute correlation of the model with human similarity judgments.

Notes

More datasets can be found at * http://technion.ac.il/~ira.leviant/MultilingualVSMdata.html * https://www.cl.cam.ac.uk/~fh295/simlex.html.

Parameters
  • pairs (str) – Path to file, where lines are 3-tuples, each consisting of a word pair and a similarity value. See test/test_data/wordsim353.tsv as example.

  • delimiter (str, optional) – Separator in pairs file.

  • restrict_vocab (int, optional) – Ignore all 4-tuples containing a word not in the first restrict_vocab words. This may be meaningful if you’ve sorted the model vocabulary by descending frequency (which is standard in modern word embedding models).

  • case_insensitive (bool, optional) – If True - convert all words to their uppercase form before evaluating the performance. Useful to handle case-mismatch between training tokens and words in the test set. In case of multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.

  • dummy4unknown (bool, optional) – If True - produce zero accuracies for 4-tuples with out-of-vocabulary words. Otherwise, these tuples are skipped entirely and not used in the evaluation.

Returns

  • pearson (tuple of (float, float)) – Pearson correlation coefficient with 2-tailed p-value.

  • spearman (tuple of (float, float)) – Spearman rank-order correlation coefficient between the similarities from the dataset and the similarities produced by the model itself, with 2-tailed p-value.

  • oov_ratio (float) – The ratio of pairs with unknown words.

fill_norms(force=False)

Ensure per-vector norms are available.

Any code which modifies vectors should ensure the accompanying norms are either recalculated or ‘None’, to trigger a full recalculation later on-request.

get_index(key, default=None)

Return the integer index (slot/position) where the given key’s vector is stored in the backing vectors array.

get_mean_vector(keys, weights=None, pre_normalize=True, post_normalize=False, ignore_missing=True)

Get the mean vector for a given list of keys.

Parameters
  • keys (list of (str or int or ndarray)) – Keys specified by string or int ids or numpy array.

  • weights (list of float or numpy.ndarray, optional) – 1D array of same size of keys specifying the weight for each key.

  • pre_normalize (bool, optional) – Flag indicating whether to normalize each keyvector before taking mean. If False, individual keyvector will not be normalized.

  • post_normalize (bool, optional) – Flag indicating whether to normalize the final mean vector. If True, normalized mean vector will be return.

  • ignore_missing (bool, optional) – If False, will raise error if a key doesn’t exist in vocabulary.

Returns

Mean vector for the list of keys.

Return type

numpy.ndarray

Raises
  • ValueError – If the size of the list of keys and weights doesn’t match.

  • KeyError – If any of the key doesn’t exist in vocabulary and ignore_missing is false.

get_normed_vectors()

Get all embedding vectors normalized to unit L2 length (euclidean), as a 2D numpy array.

To see which key corresponds to which vector = which array row, refer to the index_to_key attribute.

Returns

2D numpy array of shape (number_of_keys, embedding dimensionality), L2-normalized along the rows (key vectors).

Return type

numpy.ndarray

get_vecattr(key, attr)

Get attribute value associated with given key.

Parameters
  • key (str) – Vector key for which to fetch the attribute value.

  • attr (str) – Name of the additional attribute to fetch for the given key.

Returns

Value of the additional attribute fetched for the given key.

Return type

object

get_vector(key, norm=False)

Get the key’s vector, as a 1D numpy array.

Parameters
  • key (str) – Key for vector to return.

  • norm (bool, optional) – If True, the resulting vector will be L2-normalized (unit Euclidean length).

Returns

Vector for the specified key.

Return type

numpy.ndarray

Raises

KeyError – If the given key doesn’t exist.

has_index_for(key)

Can this model return a single index for this key?

Subclasses that synthesize vectors for out-of-vocabulary words (like FastText) may respond True for a simple word in wv (__contains__()) check but False for this more-specific check.

property index2entity
property index2word
init_sims(replace=False)

Precompute data helpful for bulk similarity calculations.

fill_norms() now preferred for this purpose.

Parameters

replace (bool, optional) – If True - forget the original vectors and only keep the normalized ones.

Warning

You cannot sensibly continue training after doing a replace on a model’s internal KeyedVectors, and a replace is no longer necessary to save RAM. Do not use this method.

intersect_word2vec_format(fname, lockf=0.0, binary=False, encoding='utf8', unicode_errors='strict')

Merge in an input-hidden weight matrix loaded from the original C word2vec-tool format, where it intersects with the current vocabulary.

No words are added to the existing vocabulary, but intersecting words adopt the file’s weights, and non-intersecting words are left alone.

Parameters
  • fname (str) – The file path to load the vectors from.

  • lockf (float, optional) – Lock-factor value to be set for any imported word-vectors; the default value of 0.0 prevents further updating of the vector during subsequent training. Use 1.0 to allow further training updates of merged vectors.

  • binary (bool, optional) – If True, fname is in the binary word2vec C format.

  • encoding (str, optional) – Encoding of text for unicode function (python2 only).

  • unicode_errors (str, optional) – Error handling behaviour, used as parameter for unicode function (python2 only).

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

classmethod load_word2vec_format(fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict', limit=None, datatype=<class 'numpy.float32'>, no_header=False)

Load KeyedVectors from a file produced by the original C word2vec-tool format.

Warning

The information stored in the file is incomplete (the binary tree is missing), so while you can query for word similarity etc., you cannot continue training with a model loaded this way.

Parameters
  • fname (str) – The file path to the saved word2vec-format file.

  • fvocab (str, optional) – File path to the vocabulary.Word counts are read from fvocab filename, if set (this is the file generated by -save-vocab flag of the original C tool).

  • binary (bool, optional) – If True, indicates whether the data is in binary word2vec format.

  • encoding (str, optional) – If you trained the C model using non-utf8 encoding for words, specify that encoding in encoding.

  • unicode_errors (str, optional) – default ‘strict’, is a string suitable to be passed as the errors argument to the unicode() (Python 2.x) or str() (Python 3.x) function. If your source file may include word tokens truncated in the middle of a multibyte unicode character (as is common from the original word2vec.c tool), ‘ignore’ or ‘replace’ may help.

  • limit (int, optional) – Sets a maximum number of word-vectors to read from the file. The default, None, means read all.

  • datatype (type, optional) – (Experimental) Can coerce dimensions to a non-default float type (such as np.float16) to save memory. Such types may result in much slower bulk operations or incompatibility with optimized routines.)

  • no_header (bool, optional) – Default False means a usual word2vec-format file, with a 1st line declaring the count of following vectors & number of dimensions. If True, the file is assumed to lack a declaratory (vocab_size, vector_size) header and instead start with the 1st vector, and an extra reading-pass will be used to discover the number of vectors. Works only with binary=False.

Returns

Loaded model.

Return type

KeyedVectors

static log_accuracy(section)
static log_evaluate_word_pairs(pearson, spearman, oov, pairs)
most_similar(positive=None, negative=None, topn=10, clip_start=0, clip_end=None, restrict_vocab=None, indexer=None)

Find the top-N most similar keys. Positive keys contribute positively towards the similarity, negative keys negatively.

This method computes cosine similarity between a simple mean of the projection weight vectors of the given keys and the vectors for each key in the model. The method corresponds to the word-analogy and distance scripts in the original word2vec implementation.

Parameters
  • positive (list of (str or int or ndarray) or list of ((str,float) or (int,float) or (ndarray,float)), optional) – List of keys that contribute positively. If tuple, second element specifies the weight (default 1.0)

  • negative (list of (str or int or ndarray) or list of ((str,float) or (int,float) or (ndarray,float)), optional) – List of keys that contribute negatively. If tuple, second element specifies the weight (default -1.0)

  • topn (int or None, optional) – Number of top-N similar keys to return, when topn is int. When topn is None, then similarities for all keys are returned.

  • clip_start (int) – Start clipping index.

  • clip_end (int) – End clipping index.

  • restrict_vocab (int, optional) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 key vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.) If specified, overrides any values of clip_start or clip_end.

Returns

When topn is int, a sequence of (key, similarity) is returned. When topn is None, then similarities for all keys are returned as a one-dimensional numpy array with the size of the vocabulary.

Return type

list of (str, float) or numpy.array

most_similar_cosmul(positive=None, negative=None, topn=10, restrict_vocab=None)

Find the top-N most similar words, using the multiplicative combination objective, proposed by Omer Levy and Yoav Goldberg “Linguistic Regularities in Sparse and Explicit Word Representations”. Positive words still contribute positively towards the similarity, negative words negatively, but with less susceptibility to one large distance dominating the calculation. In the common analogy-solving case, of two positive and one negative examples, this method is equivalent to the “3CosMul” objective (equation (4)) of Levy and Goldberg.

Additional positive or negative examples contribute to the numerator or denominator, respectively - a potentially sensible but untested extension of the method. With a single positive example, rankings will be the same as in the default most_similar().

Allows calls like most_similar_cosmul(‘dog’, ‘cat’), as a shorthand for most_similar_cosmul([‘dog’], [‘cat’]) where ‘dog’ is positive and ‘cat’ negative

Parameters
  • positive (list of str, optional) – List of words that contribute positively.

  • negative (list of str, optional) – List of words that contribute negatively.

  • topn (int or None, optional) – Number of top-N similar words to return, when topn is int. When topn is None, then similarities for all words are returned.

  • restrict_vocab (int or None, optional) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 node vectors in the vocabulary order. This may be meaningful if vocabulary is sorted by descending frequency.

Returns

When topn is int, a sequence of (word, similarity) is returned. When topn is None, then similarities for all words are returned as a one-dimensional numpy array with the size of the vocabulary.

Return type

list of (str, float) or numpy.array

most_similar_to_given(key1, keys_list)

Get the key from keys_list most similar to key1.

n_similarity(ws1, ws2)

Compute cosine similarity between two sets of keys.

Parameters
  • ws1 (list of str) – Sequence of keys.

  • ws2 (list of str) – Sequence of keys.

Returns

Similarities between ws1 and ws2.

Return type

numpy.ndarray

rank(key1, key2)

Rank of the distance of key2 from key1, in relation to distances of all keys from key1.

rank_by_centrality(words, use_norm=True)

Rank the given words by similarity to the centroid of all the words.

Parameters
  • words (list of str) – List of keys.

  • use_norm (bool, optional) – Whether to calculate centroid using unit-normed vectors; default True.

Returns

Ranked list of (similarity, key), most-similar to the centroid first.

Return type

list of (float, str)

relative_cosine_similarity(wa, wb, topn=10)

Compute the relative cosine similarity between two words given top-n similar words, by Artuur Leeuwenberga, Mihaela Velab , Jon Dehdaribc, Josef van Genabithbc “A Minimally Supervised Approach for Synonym Extraction with Word Embeddings”.

To calculate relative cosine similarity between two words, equation (1) of the paper is used. For WordNet synonyms, if rcs(topn=10) is greater than 0.10 then wa and wb are more similar than any arbitrary word pairs.

Parameters
  • wa (str) – Word for which we have to look top-n similar word.

  • wb (str) – Word for which we evaluating relative cosine similarity with wa.

  • topn (int, optional) – Number of top-n similar words to look with respect to wa.

Returns

Relative cosine similarity between wa and wb.

Return type

numpy.float64

resize_vectors(seed=0)

Make underlying vectors match index_to_key size; random-initialize any new rows.

save(*args, **kwargs)

Save KeyedVectors to a file.

Parameters

fname (str) – Path to the output file.

See also

load()

Load a previously saved model.

save_word2vec_format(fname, fvocab=None, binary=False, total_vec=None, write_header=True, prefix='', append=False, sort_attr='count')

Store the input-hidden weight matrix in the same format used by the original C word2vec-tool, for compatibility.

Parameters
  • fname (str) – File path to save the vectors to.

  • fvocab (str, optional) – File path to save additional vocabulary information to. None to not store the vocabulary.

  • binary (bool, optional) – If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.

  • total_vec (int, optional) – Explicitly specify total number of vectors (in case word vectors are appended with document vectors afterwards).

  • write_header (bool, optional) – If False, don’t write the 1st line declaring the count of vectors and dimensions. This is the format used by e.g. gloVe vectors.

  • prefix (str, optional) – String to prepend in front of each stored word. Default = no prefix.

  • append (bool, optional) – If set, open fname in ab mode instead of the default wb mode.

  • sort_attr (str, optional) – Sort the output vectors in descending order of this attribute. Default: most frequent keys first.

set_vecattr(key, attr, val)

Set attribute associated with the given key to value.

Parameters
  • key (str) – Store the attribute for this vector key.

  • attr (str) – Name of the additional attribute to store for the given key.

  • val (object) – Value of the additional attribute to store for the given key.

Returns

Return type

None

similar_by_key(key, topn=10, restrict_vocab=None)

Find the top-N most similar keys.

Parameters
  • key (str) – Key

  • topn (int or None, optional) – Number of top-N similar keys to return. If topn is None, similar_by_key returns the vector of similarity scores.

  • restrict_vocab (int, optional) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 key vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)

Returns

When topn is int, a sequence of (key, similarity) is returned. When topn is None, then similarities for all keys are returned as a one-dimensional numpy array with the size of the vocabulary.

Return type

list of (str, float) or numpy.array

similar_by_vector(vector, topn=10, restrict_vocab=None)

Find the top-N most similar keys by vector.

Parameters
  • vector (numpy.array) – Vector from which similarities are to be computed.

  • topn (int or None, optional) – Number of top-N similar keys to return, when topn is int. When topn is None, then similarities for all keys are returned.

  • restrict_vocab (int, optional) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 key vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)

Returns

When topn is int, a sequence of (key, similarity) is returned. When topn is None, then similarities for all keys are returned as a one-dimensional numpy array with the size of the vocabulary.

Return type

list of (str, float) or numpy.array

similar_by_word(word, topn=10, restrict_vocab=None)

Compatibility alias for similar_by_key().

similarity(w1, w2)

Compute cosine similarity between two keys.

Parameters
  • w1 (str) – Input key.

  • w2 (str) – Input key.

Returns

Cosine similarity between w1 and w2.

Return type

float

similarity_unseen_docs(*args, **kwargs)
sort_by_descending_frequency()

Sort the vocabulary so the most frequent words have the lowest indexes.

unit_normalize_all()

Destructively scale all vectors to unit-length.

You cannot sensibly continue training after such a step.

vectors_for_all(keys: Iterable, allow_inference: bool = True, copy_vecattrs: bool = False) KeyedVectors

Produce vectors for all given keys as a new KeyedVectors object.

Notes

The keys will always be deduplicated. For optimal performance, you should not pass entire corpora to the method. Instead, you should construct a dictionary of unique words in your corpus:

>>> from collections import Counter
>>> import itertools
>>>
>>> from gensim.models import FastText
>>> from gensim.test.utils import datapath, common_texts
>>>
>>> model_corpus_file = datapath('lee_background.cor')  # train word vectors on some corpus
>>> model = FastText(corpus_file=model_corpus_file, vector_size=20, min_count=1)
>>> corpus = common_texts  # infer word vectors for words from another corpus
>>> word_counts = Counter(itertools.chain.from_iterable(corpus))  # count words in your corpus
>>> words_by_freq = (k for k, v in word_counts.most_common())
>>> word_vectors = model.wv.vectors_for_all(words_by_freq)  # create word-vectors for words in your corpus
Parameters
  • keys (iterable) – The keys that will be vectorized.

  • allow_inference (bool, optional) – In subclasses such as FastTextKeyedVectors, vectors for out-of-vocabulary keys (words) may be inferred. Default is True.

  • copy_vecattrs (bool, optional) – Additional attributes set via the KeyedVectors.set_vecattr() method will be preserved in the produced KeyedVectors object. Default is False. To ensure that all the produced vectors will have vector attributes assigned, you should set allow_inference=False.

Returns

keyedvectors – Vectors for all the given keys.

Return type

KeyedVectors

property vectors_norm
property vocab
wmdistance(document1, document2, norm=True)

Compute the Word Mover’s Distance between two documents.

When using this code, please consider citing the following papers:

Parameters
  • document1 (list of str) – Input document.

  • document2 (list of str) – Input document.

  • norm (boolean) – Normalize all word vectors to unit length before computing the distance? Defaults to True.

Returns

Word Mover’s distance between document1 and document2.

Return type

float

Warning

This method only works if POT is installed.

If one of the documents have no words that exist in the vocab, float(‘inf’) (i.e. infinity) will be returned.

Raises

ImportError

If POT isn’t installed.

word_vec(*args, **kwargs)

Compatibility alias for get_vector(); must exist so subclass calls reach subclass get_vector().

words_closer_than(word1, word2)
gensim.models.keyedvectors.Vocab

alias of CompatVocab

gensim.models.keyedvectors.Word2VecKeyedVectors

alias of KeyedVectors

gensim.models.keyedvectors.load_word2vec_format(*args, **kwargs)

Alias for load_word2vec_format().

gensim.models.keyedvectors.prep_vectors(target_shape, prior_vectors=None, seed=0, dtype=<class 'numpy.float32'>)

Return a numpy array of the given shape. Reuse prior_vectors object or values to extent possible. Initialize new values randomly if requested.

gensim.models.keyedvectors.pseudorandom_weak_vector(size, seed_string=None, hashfxn=<built-in function hash>)

Get a random vector, derived deterministically from seed_string if supplied.

Useful for initializing KeyedVectors that will be the starting projection/input layers of _2Vec models.