models.wrappers.varembed
– VarEmbed Word Embeddings¶
Python wrapper around Varembed model. Original paper:”Morphological Priors for Probabilistic Neural Word Embeddings”.
Notes
This module allows ability to obtain word vectors for outofvocabulary words, for the Varembed model.
The wrapped model can not be updated with new documents for online training.

class
gensim.models.wrappers.varembed.
VarEmbed
¶ Bases:
gensim.models.keyedvectors.KeyedVectors
Python wrapper using Varembed.
Warning
This is only python wrapper for Varembed, this allows to load pretrained models only.
Mapping between keys (such as words) and vectors for
Word2Vec
and related models.Used to perform operations on the vectors such as vector lookup, distance, similarity etc.
To support the needs of specific models and other downstream uses, you can also set additional attributes via the
set_vecattr()
andget_vecattr()
methods. Note that all such attributes under the same attr name must have compatible numpy types, as the type and storage array for such attributes is established by the 1st time such attr is set. Parameters
vector_size (int) – Intended number of dimensions for all contained vectors.
count (int, optional) – If provided, vectors wil be preallocated for at least this many vectors. (Otherwise they can be added later.)
dtype (type, optional) – Vector dimensions will default to np.float32 (AKA REAL in some Gensim code) unless another type is provided here.
mapfile_path (string, optional) – FIXME: UNDER CONSTRUCTION / WILL CHANGE PRE4.0.0 PER #2955 / #2975.

add_morphemes_to_embeddings
(morfessor_model, morpho_embeddings, morpho_to_ix)¶ Include morpheme embeddings into vectors.
 Parameters
morfessor_model (
morfessor.baseline.BaselineModel
) – Morfessor model.morpho_embeddings (dict) – Pickle file containing morpheme embeddings.
morpho_to_ix (dict) – Mapping morpheme to index.

add_vector
(key, vector)¶ Add one new vector at the given key, into existing slot if available.
Warning: using this repeatedly is inefficient, requiring a full reallocation & copy, if this instance hasn’t been preallocated to be ready for such incremental additions.
 Parameters
key (str) – Key identifier of the added vector.
vector (numpy.ndarray) – 1D numpy array with the vector values.
 Returns
Index of the newly added vector, so that
self.vectors[result] == vector
andself.index_to_key[result] == key
. Return type
int

add_vectors
(keys, weights, extras=None, replace=False)¶ Append keys and their vectors in a manual way. If some key is already in the vocabulary, the old vector is kept unless replace flag is True.
 Parameters
keys (list of (str or int)) – Keys specified by string or int ids.
weights (list of numpy.ndarray or numpy.ndarray) – List of 1D np.array vectors or a 2D np.array of vectors.
replace (bool, optional) – Flag indicating whether to replace vectors for keys which already exist in the map; if True  replace vectors, otherwise  keep old vectors.

allocate_vecattrs
(attrs=None, types=None)¶ Ensure arrays for given pervector extraattribute names & types exist, at right size.
The length of the index_to_key list is canonical ‘intended size’ of KeyedVectors, even if other properties (vectors array) hasn’t yet been allocated or expanded. So this allocation targets that size.

closer_than
(key1, key2)¶ Get all keys that are closer to key1 than key2 is to key1.

static
cosine_similarities
(vector_1, vectors_all)¶ Compute cosine similarities between one vector and a set of other vectors.
 Parameters
vector_1 (numpy.ndarray) – Vector from which similarities are to be computed, expected shape (dim,).
vectors_all (numpy.ndarray) – For each row in vectors_all, distance from vector_1 is computed, expected shape (num_vectors, dim).
 Returns
Contains cosine distance between vector_1 and each row in vectors_all, shape (num_vectors,).
 Return type
numpy.ndarray

distance
(w1, w2)¶ Compute cosine distance between two keys. Calculate 1 
similarity()
. Parameters
w1 (str) – Input key.
w2 (str) – Input key.
 Returns
Distance between w1 and w2.
 Return type
float

distances
(word_or_vector, other_words=())¶ Compute cosine distances from given word or vector to all words in other_words. If other_words is empty, return distance between word_or_vectors and all words in vocab.
 Parameters
word_or_vector ({str, numpy.ndarray}) – Word or vector from which distances are to be computed.
other_words (iterable of str) – For each word in other_words distance from word_or_vector is computed. If None or empty, distance of word_or_vector from all words in vocab is computed (including itself).
 Returns
Array containing distances to all words in other_words from input word_or_vector.
 Return type
numpy.array
 Raises
KeyError – If either word_or_vector or any word in other_words is absent from vocab.

doesnt_match
(words)¶ Which key from the given list doesn’t go with the others?
 Parameters
words (list of str) – List of keys.
 Returns
The key further away from the mean of all keys.
 Return type
str

evaluate_word_analogies
(analogies, restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)¶ Compute performance of the model on an analogy test set.
The accuracy is reported (printed to log and returned as a score) for each section separately, plus there’s one aggregate summary at the end.
This method corresponds to the computeaccuracy script of the original C word2vec. See also Analogy (State of the art).
 Parameters
analogies (str) – Path to file, where lines are 4tuples of words, split into sections by “: SECTION NAME” lines. See gensim/test/test_data/questionswords.txt as example.
restrict_vocab (int, optional) – Ignore all 4tuples containing a word not in the first restrict_vocab words. This may be meaningful if you’ve sorted the model vocabulary by descending frequency (which is standard in modern word embedding models).
case_insensitive (bool, optional) – If True  convert all words to their uppercase form before evaluating the performance. Useful to handle casemismatch between training tokens and words in the test set. In case of multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.
dummy4unknown (bool, optional) – If True  produce zero accuracies for 4tuples with outofvocabulary words. Otherwise, these tuples are skipped entirely and not used in the evaluation.
 Returns
score (float) – The overall evaluation score on the entire evaluation set
sections (list of dict of {str : str or list of tuple of (str, str, str, str)}) – Results broken down by each section of the evaluation set. Each dict contains the name of the section under the key ‘section’, and lists of correctly and incorrectly predicted 4tuples of words under the keys ‘correct’ and ‘incorrect’.

evaluate_word_pairs
(pairs, delimiter='\t', restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)¶ Compute correlation of the model with human similarity judgments.
Notes
More datasets can be found at * http://technion.ac.il/~ira.leviant/MultilingualVSMdata.html * https://www.cl.cam.ac.uk/~fh295/simlex.html.
 Parameters
pairs (str) – Path to file, where lines are 3tuples, each consisting of a word pair and a similarity value. See test/test_data/wordsim353.tsv as example.
delimiter (str, optional) – Separator in pairs file.
restrict_vocab (int, optional) – Ignore all 4tuples containing a word not in the first restrict_vocab words. This may be meaningful if you’ve sorted the model vocabulary by descending frequency (which is standard in modern word embedding models).
case_insensitive (bool, optional) – If True  convert all words to their uppercase form before evaluating the performance. Useful to handle casemismatch between training tokens and words in the test set. In case of multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.
dummy4unknown (bool, optional) – If True  produce zero accuracies for 4tuples with outofvocabulary words. Otherwise, these tuples are skipped entirely and not used in the evaluation.
 Returns
pearson (tuple of (float, float)) – Pearson correlation coefficient with 2tailed pvalue.
spearman (tuple of (float, float)) – Spearman rankorder correlation coefficient between the similarities from the dataset and the similarities produced by the model itself, with 2tailed pvalue.
oov_ratio (float) – The ratio of pairs with unknown words.

fill_norms
(force=False)¶ Ensure pervector norms are available.
Any code which modifies vectors should ensure the accompanying norms are either recalculated or ‘None’, to trigger a full recalculation later onrequest.

get_index
(key, default=None)¶ Return the integer index (slot/position) where the given key’s vector is stored in the backing vectors array.

get_normed_vectors
()¶ Get all embedding vectors normalized to unit L2 length (euclidean), as a 2D numpy array.
To see which key corresponds to which vector = which array row, refer to the
index_to_key
attribute. Returns
2D numpy array of shape
(number_of_keys, embedding dimensionality)
, L2normalized along the rows (key vectors). Return type
numpy.ndarray

get_vecattr
(key, attr)¶ Get attribute value associated with given key.
 Parameters
key (str) – Vector key for which to fetch the attribute value.
attr (str) – Name of the additional attribute to fetch for the given key.
 Returns
Value of the additional attribute fetched for the given key.
 Return type
object

get_vector
(key, norm=False)¶ Get the key’s vector, as a 1D numpy array.
 Parameters
key (str) – Key for vector to return.
norm (bool, optional) – If True, the resulting vector will be L2normalized (unit Euclidean length).
 Returns
Vector for the specified key.
 Return type
numpy.ndarray
 Raises
KeyError – If the given key doesn’t exist.

has_index_for
(key)¶ Can this model return a single index for this key?
Subclasses that synthesize vectors for outofvocabulary words (like
FastText
) may respond True for a simple word in wv (__contains__()) check but False for this morespecific check.

property
index2entity
¶

property
index2word
¶

init_sims
(replace=False)¶ Precompute data helpful for bulk similarity calculations.
fill_norms()
now preferred for this purpose. Parameters
replace (bool, optional) – If True  forget the original vectors and only keep the normalized ones.
Warning
You cannot sensibly continue training after doing a replace on a model’s internal KeyedVectors, and a replace is no longer necessary to save RAM. Do not use this method.

intersect_word2vec_format
(fname, lockf=0.0, binary=False, encoding='utf8', unicode_errors='strict')¶ Merge in an inputhidden weight matrix loaded from the original C word2vectool format, where it intersects with the current vocabulary.
No words are added to the existing vocabulary, but intersecting words adopt the file’s weights, and nonintersecting words are left alone.
 Parameters
fname (str) – The file path to load the vectors from.
lockf (float, optional) – Lockfactor value to be set for any imported wordvectors; the default value of 0.0 prevents further updating of the vector during subsequent training. Use 1.0 to allow further training updates of merged vectors.
binary (bool, optional) – If True, fname is in the binary word2vec C format.
encoding (str, optional) – Encoding of text for unicode function (python2 only).
unicode_errors (str, optional) – Error handling behaviour, used as parameter for unicode function (python2 only).

classmethod
load
(fname, mmap=None)¶ Load an object previously saved using
save()
from a file. Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memorymap option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
 Returns
Object loaded from fname.
 Return type
object
 Raises
AttributeError – When called on an object instance instead of class (this is a class method).

classmethod
load_varembed_format
(vectors, morfessor_model=None)¶ Load the word vectors into matrix from the varembed output vector files.
 Parameters
vectors (dict) – Pickle file containing the word vectors.
morfessor_model (str, optional) – Path to the trained morfessor model.
 Returns
Ready to use instance.
 Return type

classmethod
load_word2vec_format
(fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict', limit=None, datatype=<class 'numpy.float32'>, no_header=False)¶ Load the inputhidden weight matrix from the original C word2vectool format.
Warning
The information stored in the file is incomplete (the binary tree is missing), so while you can query for word similarity etc., you cannot continue training with a model loaded this way.
 Parameters
fname (str) – The file path to the saved word2vecformat file.
fvocab (str, optional) – File path to the vocabulary.Word counts are read from fvocab filename, if set (this is the file generated by savevocab flag of the original C tool).
binary (bool, optional) – If True, indicates whether the data is in binary word2vec format.
encoding (str, optional) – If you trained the C model using nonutf8 encoding for words, specify that encoding in encoding.
unicode_errors (str, optional) – default ‘strict’, is a string suitable to be passed as the errors argument to the unicode() (Python 2.x) or str() (Python 3.x) function. If your source file may include word tokens truncated in the middle of a multibyte unicode character (as is common from the original word2vec.c tool), ‘ignore’ or ‘replace’ may help.
limit (int, optional) – Sets a maximum number of wordvectors to read from the file. The default, None, means read all.
datatype (type, optional) – (Experimental) Can coerce dimensions to a nondefault float type (such as np.float16) to save memory. Such types may result in much slower bulk operations or incompatibility with optimized routines.)
no_header (bool, optional) – Default False means a usual word2vecformat file, with a 1st line declaring the count of following vectors & number of dimensions. If True, the file is assumed to lack a declaratory (vocab_size, vector_size) header and instead start with the 1st vector, and an extra readingpass will be used to discover the number of vectors. Works only with binary=False.
 Returns
Loaded model.
 Return type

load_word_embeddings
(word_embeddings, word_to_ix)¶ Loads the word embeddings.
 Parameters
word_embeddings (numpy.ndarray) – Matrix with wordembeddings.
word_to_ix (dict of (str, int)) – Mapping word to index.

static
log_accuracy
(section)¶

static
log_evaluate_word_pairs
(pearson, spearman, oov, pairs)¶

most_similar
(positive=None, negative=None, topn=10, clip_start=0, clip_end=None, restrict_vocab=None, indexer=None)¶ Find the topN most similar keys. Positive keys contribute positively towards the similarity, negative keys negatively.
This method computes cosine similarity between a simple mean of the projection weight vectors of the given keys and the vectors for each key in the model. The method corresponds to the wordanalogy and distance scripts in the original word2vec implementation.
 Parameters
positive (list of (str or int or ndarray), optional) – List of keys that contribute positively.
negative (list of (str or int or ndarray), optional) – List of keys that contribute negatively.
topn (int or None, optional) – Number of topN similar keys to return, when topn is int. When topn is None, then similarities for all keys are returned.
clip_start (int) – Start clipping index.
clip_end (int) – End clipping index.
restrict_vocab (int, optional) – Optional integer which limits the range of vectors which are searched for mostsimilar values. For example, restrict_vocab=10000 would only check the first 10000 key vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.) If specified, overrides any values of
clip_start
orclip_end
.
 Returns
When topn is int, a sequence of (key, similarity) is returned. When topn is None, then similarities for all keys are returned as a onedimensional numpy array with the size of the vocabulary.
 Return type
list of (str, float) or numpy.array

most_similar_cosmul
(positive=None, negative=None, topn=10)¶ Find the topN most similar words, using the multiplicative combination objective, proposed by Omer Levy and Yoav Goldberg “Linguistic Regularities in Sparse and Explicit Word Representations”. Positive words still contribute positively towards the similarity, negative words negatively, but with less susceptibility to one large distance dominating the calculation. In the common analogysolving case, of two positive and one negative examples, this method is equivalent to the “3CosMul” objective (equation (4)) of Levy and Goldberg.
Additional positive or negative examples contribute to the numerator or denominator, respectively  a potentially sensible but untested extension of the method. With a single positive example, rankings will be the same as in the default
most_similar()
. Parameters
positive (list of str, optional) – List of words that contribute positively.
negative (list of str, optional) – List of words that contribute negatively.
topn (int or None, optional) – Number of topN similar words to return, when topn is int. When topn is None, then similarities for all words are returned.
 Returns
When topn is int, a sequence of (word, similarity) is returned. When topn is None, then similarities for all words are returned as a onedimensional numpy array with the size of the vocabulary.
 Return type
list of (str, float) or numpy.array

most_similar_to_given
(key1, keys_list)¶ Get the key from keys_list most similar to key1.

n_similarity
(ws1, ws2)¶ Compute cosine similarity between two sets of keys.
 Parameters
ws1 (list of str) – Sequence of keys.
ws2 (list of str) – Sequence of keys.
 Returns
Similarities between ws1 and ws2.
 Return type
numpy.ndarray

rank
(key1, key2)¶ Rank of the distance of key2 from key1, in relation to distances of all keys from key1.

rank_by_centrality
(words, use_norm=True)¶ Rank the given words by similarity to the centroid of all the words.
 Parameters
words (list of str) – List of keys.
use_norm (bool, optional) – Whether to calculate centroid using unitnormed vectors; default True.
 Returns
Ranked list of (similarity, key), mostsimilar to the centroid first.
 Return type
list of (float, str)

relative_cosine_similarity
(wa, wb, topn=10)¶ Compute the relative cosine similarity between two words given topn similar words, by Artuur Leeuwenberga, Mihaela Velab , Jon Dehdaribc, Josef van Genabithbc “A Minimally Supervised Approach for Synonym Extraction with Word Embeddings”.
To calculate relative cosine similarity between two words, equation (1) of the paper is used. For WordNet synonyms, if rcs(topn=10) is greater than 0.10 then wa and wb are more similar than any arbitrary word pairs.
 Parameters
wa (str) – Word for which we have to look topn similar word.
wb (str) – Word for which we evaluating relative cosine similarity with wa.
topn (int, optional) – Number of topn similar words to look with respect to wa.
 Returns
Relative cosine similarity between wa and wb.
 Return type
numpy.float64

resize_vectors
(seed=0)¶ Make underlying vectors match index_to_key size; randominitialize any new rows.

save
(*args, **kwargs)¶ Save KeyedVectors to a file.
 Parameters
fname (str) – Path to the output file.
See also
load()
Load a previously saved model.

save_word2vec_format
(fname, fvocab=None, binary=False, total_vec=None, write_header=True, prefix='', append=False, sort_attr='count')¶ Store the inputhidden weight matrix in the same format used by the original C word2vectool, for compatibility.
 Parameters
fname (str) – File path to save the vectors to.
fvocab (str, optional) – File path to save additional vocabulary information to. None to not store the vocabulary.
binary (bool, optional) – If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.
total_vec (int, optional) – Explicitly specify total number of vectors (in case word vectors are appended with document vectors afterwards).
write_header (bool, optional) – If False, don’t write the 1st line declaring the count of vectors and dimensions. This is the format used by e.g. gloVe vectors.
prefix (str, optional) – String to prepend in front of each stored word. Default = no prefix.
append (bool, optional) – If set, open fname in ab mode instead of the default wb mode.
sort_attr (str, optional) – Sort the output vectors in descending order of this attribute. Default: most frequent keys first.

set_vecattr
(key, attr, val)¶ Set attribute associated with the given key to value.
 Parameters
key (str) – Store the attribute for this vector key.
attr (str) – Name of the additional attribute to store for the given key.
val (object) – Value of the additional attribute to store for the given key.
 Returns
 Return type
None

similar_by_key
(key, topn=10, restrict_vocab=None)¶ Find the topN most similar keys.
 Parameters
key (str) – Key
topn (int or None, optional) – Number of topN similar keys to return. If topn is None, similar_by_key returns the vector of similarity scores.
restrict_vocab (int, optional) – Optional integer which limits the range of vectors which are searched for mostsimilar values. For example, restrict_vocab=10000 would only check the first 10000 key vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)
 Returns
When topn is int, a sequence of (key, similarity) is returned. When topn is None, then similarities for all keys are returned as a onedimensional numpy array with the size of the vocabulary.
 Return type
list of (str, float) or numpy.array

similar_by_vector
(vector, topn=10, restrict_vocab=None)¶ Find the topN most similar keys by vector.
 Parameters
vector (numpy.array) – Vector from which similarities are to be computed.
topn (int or None, optional) – Number of topN similar keys to return, when topn is int. When topn is None, then similarities for all keys are returned.
restrict_vocab (int, optional) – Optional integer which limits the range of vectors which are searched for mostsimilar values. For example, restrict_vocab=10000 would only check the first 10000 key vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)
 Returns
When topn is int, a sequence of (key, similarity) is returned. When topn is None, then similarities for all keys are returned as a onedimensional numpy array with the size of the vocabulary.
 Return type
list of (str, float) or numpy.array

similar_by_word
(word, topn=10, restrict_vocab=None)¶ Compatibility alias for similar_by_key()

similarity
(w1, w2)¶ Compute cosine similarity between two keys.
 Parameters
w1 (str) – Input key.
w2 (str) – Input key.
 Returns
Cosine similarity between w1 and w2.
 Return type
float

similarity_unseen_docs
(*args, **kwargs)¶

sort_by_descending_frequency
()¶ Sort the vocabulary so the most frequent words have the lowest indexes.

unit_normalize_all
()¶ Destructively scale all vectors to unitlength.
(You cannot sensibly continue training after such a step.)

property
vectors_norm
¶

property
vocab
¶

wmdistance
(document1, document2)¶ Compute the Word Mover’s Distance between two documents.
When using this code, please consider citing the following papers:
Ofir Pele and Michael Werman “A linear time histogram metric for improved SIFT matching”
Ofir Pele and Michael Werman “Fast and robust earth mover’s distances”
Matt Kusner et al. “From Word Embeddings To Document Distances”.
 Parameters
document1 (list of str) – Input document.
document2 (list of str) – Input document.
 Returns
Word Mover’s distance between document1 and document2.
 Return type
float
Warning
This method only works if pyemd is installed.
If one of the documents have no words that exist in the vocab, float(‘inf’) (i.e. infinity) will be returned.
 Raises
ImportError –
If pyemd isn’t installed.

word_vec
(*args, **kwargs)¶ Compatibility alias for get_vector(); must exist so subclass calls reach subclass get_vector().

words_closer_than
(word1, word2)¶