gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

models.keyedvectors – Store and query word vectors

models.keyedvectors – Store and query word vectors

Word vector storage and similarity look-ups. Common code independent of the way the vectors are trained(Word2Vec, FastText, WordRank, VarEmbed etc)

The word vectors are considered read-only in this class.

Initialize the vectors by training e.g. Word2Vec:

>>> model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
>>> word_vectors = model.wv

Persist the word vectors to disk with:

>>> word_vectors.save(fname)
>>> word_vectors = KeyedVectors.load(fname)

The vectors can also be instantiated from an existing file on disk in the original Google’s word2vec C format as a KeyedVectors instance:

>>> from gensim.models import KeyedVectors
>>> word_vectors = KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False)  # C text format
>>> word_vectors = KeyedVectors.load_word2vec_format('/tmp/vectors.bin', binary=True)  # C binary format

You can perform various syntactic/semantic NLP word tasks with the vectors. Some of them are already built-in:

>>> word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.50882536), ...]

>>> word_vectors.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])
[('queen', 0.71382287), ...]

>>> word_vectors.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'

>>> word_vectors.similarity('woman', 'man')
0.73723527

Correlation with human opinion on word similarity:

>>> word_vectors.evaluate_word_pairs(os.path.join(module_path, 'test_data','wordsim353.tsv'))
0.51, 0.62, 0.13

And on analogies:

>>> word_vectors.accuracy(os.path.join(module_path, 'test_data', 'questions-words.txt'))

and so on.

class gensim.models.keyedvectors.BaseKeyedVectors(vector_size)

Bases: gensim.utils.SaveLoad

closer_than(entity1, entity2)

Returns all entities that are closer to entity1 than entity2 is to entity1.

distance(entity1, entity2)

Compute distance between vectors of two input entities, specified by string tag.

distances(entity1, other_entities=())

Compute distances from given entity (string tag) to all entities in other_entity. If other_entities is empty, return distance between entity1 and all entities in vocab.

get_vector(entity)

Accept a single entity as input, specified by string tag. Returns the entity’s representations in vector space, as a 1D numpy array.

classmethod load(fname_or_handle, **kwargs)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
most_similar(**kwargs)

Find the top-N most similar entities. Possibly have positive and negative list of entities in **kwargs.

most_similar_to_given(entity1, entities_list)

Return the entity from entities_list most similar to entity1.

rank(entity1, entity2)

Rank of the distance of entity2 from entity1, in relation to distances of all entities from entity1.

save(fname_or_handle, **kwargs)

Save the object to file.

Parameters:
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
  • separately (list of str or None, optional) – If None - automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently. If list of str - this attributes will be stored in separate files, the automatic check is not performed in this case.
  • sep_limit (int) – Limit for automatic separation.
  • ignore (frozenset of str) – Attributes that shouldn’t be serialize/store.
  • pickle_protocol (int) – Protocol number for pickle.

See also

load()

similarity(entity1, entity2)

Compute cosine similarity between entities, specified by string tag.

class gensim.models.keyedvectors.Doc2VecKeyedVectors(vector_size, mapfile_path)

Bases: gensim.models.keyedvectors.BaseKeyedVectors

closer_than(entity1, entity2)

Returns all entities that are closer to entity1 than entity2 is to entity1.

distance(d1, d2)

Compute cosine distance between two documents.

distances(d1, other_docs=())

Compute distances from given document (string tag or int index) to all documents in other_docs. If other_docs is empty, return distance between d1 and all documents seen during training.

doctag_syn0
doctag_syn0norm
doesnt_match(docs)

Which doc from the given list doesn’t go with the others?

(TODO: Accept vectors of out-of-training-set docs, as if from inference.)

Parameters:docs – List of seen documents specified by their corresponding string tags or integer indices.
Returns:The document further away from the mean of all the documents.
Return type:str or int
get_vector(entity)

Accept a single entity as input, specified by string tag. Returns the entity’s representations in vector space, as a 1D numpy array.

index2entity
index_to_doctag(i_index)

Return string key for given i_index, if available. Otherwise return raw int doctag (same int).

init_sims(replace=False)

Precompute L2-normalized vectors.

If replace is set, forget the original vectors and only keep the normalized ones = saves lots of memory!

Note that you cannot continue training or inference after doing a replace. The model becomes effectively read-only = you can call most_similar, similarity etc., but not train or infer_vector.

int_index(index, doctags, max_rawint)

Return int index for either string or int index

classmethod load(fname_or_handle, **kwargs)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
most_similar(positive=None, negative=None, topn=10, clip_start=0, clip_end=None, indexer=None)

Find the top-N most similar docvecs known from training. Positive docs contribute positively towards the similarity, negative docs negatively.

This method computes cosine similarity between a simple mean of the projection weight vectors of the given docs. Docs may be specified as vectors, integer indexes of trained docvecs, or if the documents were originally presented with string tags, by the corresponding tags.

The ‘clip_start’ and ‘clip_end’ allow limiting results to a particular contiguous range of the underlying vectors_docs_norm vectors. (This may be useful if the ordering there was chosen to be significant, such as more popular tag IDs in lower indexes.)

Parameters:
  • positive – List of Docs specifed as vectors, integer indexes of trained docvecs or string tags that contribute positively.
  • negative – List of Docs specifed as vectors, integer indexes of trained docvecs or string tags that contribute negatively.
  • topn (int) – Number of top-N similar docvecs to return.
  • clip_start (int) – Start clipping index.
  • clip_end (int) – End clipping index.
Returns:

Returns a list of tuples (doc, similarity)

Return type:

obj: list of :obj: tuple

most_similar_to_given(entity1, entities_list)

Return the entity from entities_list most similar to entity1.

n_similarity(ds1, ds2)

Compute cosine similarity between two sets of docvecs from the trained set, specified by int index or string tag. (TODO: Accept vectors of out-of-training-set docs, as if from inference.)

Parameters:
  • ds1 – Specify the first set of documents as a list of their integer indices or string tags.
  • ds2 – Specify the second set of documents as a list of their integer indices or string tags.
Returns:

The cosine similarity between the means of the documents in each of the two sets.

Return type:

float

rank(entity1, entity2)

Rank of the distance of entity2 from entity1, in relation to distances of all entities from entity1.

save(*args, **kwargs)

Saves the keyedvectors. This saved model can be loaded again using load() which supports operations on trained document vectors like most_similar.

Parameters:fname (str) – Path to the file.
save_word2vec_format(fname, prefix='*dt_', fvocab=None, total_vec=None, binary=False, write_first_line=True)

Store the input-hidden weight matrix in the same format used by the original C word2vec-tool, for compatibility.

Parameters:
  • fname (str) – The file path used to save the vectors in.
  • prefix (str) – Uniquely identifies doctags from word vocab, and avoids collision in case of repeated string in doctag and word vocab.
  • fvocab (str) – Optional file path used to save the vocabulary
  • binary (bool) – If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.
  • total_vec (int) – Optional parameter to explicitly specify total no. of vectors (in case word vectors are appended with document vectors afterwards)
  • write_first_line (bool) – Whether to print the first line in the file. Useful when saving doc-vectors after word-vectors.
similarity(d1, d2)

Compute cosine similarity between two docvecs in the trained set, specified by int index or string tag. (TODO: Accept vectors of out-of-training-set docs, as if from inference.)

Parameters:
  • d1 (int or str) – Indicate the first document by it’s string tag or integer index.
  • d2 (int or str) – Indicate the second document by it’s string tag or integer index.
Returns:

The cosine similarity between the vectors of the two documents.

Return type:

float

similarity_unseen_docs(model, doc_words1, doc_words2, alpha=0.1, min_alpha=0.0001, steps=5)

Compute cosine similarity between two post-bulk out of training documents.

Parameters:
  • model – An instance of a trained Doc2Vec model.
  • doc_words1 – The first document. Document should be a list of (word) tokens.
  • doc_words2 – The second document. Document should be a list of (word) tokens.
  • alpha (float) – The initial learning rate.
  • min_alpha (float) – Learning rate will linearly drop to min_alpha as training progresses.
  • steps (int) – Number of times to train the new document.
Returns:

The cosine similarity between the unseen documents.

Return type:

float

class gensim.models.keyedvectors.FastTextKeyedVectors(vector_size, min_n, max_n)

Bases: gensim.models.keyedvectors.WordEmbeddingsKeyedVectors

Class to contain vectors and vocab for the FastText training class and other methods not directly involved in training such as most_similar()

accuracy(questions, restrict_vocab=30000, most_similar=<function most_similar>, case_insensitive=True)

Compute accuracy of the model. questions is a filename where lines are 4-tuples of words, split into sections by “: SECTION NAME” lines. See questions-words.txt in https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/word2vec/source-archive.zip for an example.

The accuracy is reported (=printed to log and returned as a list) for each section separately, plus there’s one aggregate summary at the end.

Use restrict_vocab to ignore all questions containing a word not in the first restrict_vocab words (default 30,000). This may be meaningful if you’ve sorted the vocabulary by descending frequency. In case case_insensitive is True, the first restrict_vocab words are taken first, and then case normalization is performed.

Use case_insensitive to convert all words in questions and vocab to their uppercase form before evaluating the accuracy (default True). Useful in case of case-mismatch between training tokens and question words. In case of multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.

This method corresponds to the compute-accuracy script of the original C word2vec.

closer_than(entity1, entity2)

Returns all entities that are closer to entity1 than entity2 is to entity1.

static cosine_similarities(vectors_all)

Return cosine similarities between one vector and a set of other vectors.

Parameters:
  • vector_1 (numpy.array) – vector from which similarities are to be computed. expected shape (dim,)
  • vectors_all (numpy.array) – for each row in vectors_all, distance from vector_1 is computed. expected shape (num_vectors, dim)
Returns:

Contains cosine distance between vector_1 and each row in vectors_all. shape (num_vectors,)

Return type:

obj: numpy.array

distance(w1, w2)

Compute cosine distance between two words.

Examples

>>> trained_model.distance('woman', 'man')
0.34
>>> trained_model.distance('woman', 'woman')
0.0
distances(word_or_vector, other_words=())

Compute cosine distances from given word or vector to all words in other_words. If other_words is empty, return distance between word_or_vectors and all words in vocab.

Parameters:
  • word_or_vector (str or numpy.array) – Word or vector from which distances are to be computed.
  • other_words (iterable(str) or None) – For each word in other_words distance from word_or_vector is computed. If None or empty, distance of word_or_vector from all words in vocab is computed (including itself).
Returns:

Array containing distances to all words in other_words from input word_or_vector, in the same order as other_words.

Return type:

numpy.array

Notes

Raises KeyError if either word_or_vector or any word in other_words is absent from vocab.

doesnt_match(words)

Which word from the given list doesn’t go with the others?

Parameters:words – List of words
Returns:The word further away from the mean of all words.
Return type:str

Example

>>> trained_model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'
evaluate_word_pairs(pairs, delimiter='\t', restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)

Compute correlation of the model with human similarity judgments. pairs is a filename of a dataset where lines are 3-tuples, each consisting of a word pair and a similarity value, separated by delimiter. An example dataset is included in Gensim (test/test_data/wordsim353.tsv). More datasets can be found at http://technion.ac.il/~ira.leviant/MultilingualVSMdata.html or https://www.cl.cam.ac.uk/~fh295/simlex.html.

The model is evaluated using Pearson correlation coefficient and Spearman rank-order correlation coefficient between the similarities from the dataset and the similarities produced by the model itself. The results are printed to log and returned as a triple (pearson, spearman, ratio of pairs with unknown words).

Use restrict_vocab to ignore all word pairs containing a word not in the first restrict_vocab words (default 300,000). This may be meaningful if you’ve sorted the vocabulary by descending frequency. If case_insensitive is True, the first restrict_vocab words are taken, and then case normalization is performed.

Use case_insensitive to convert all words in the pairs and vocab to their uppercase form before evaluating the model (default True). Useful when you expect case-mismatch between training tokens and words pairs in the dataset. If there are multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.

Use dummy4unknown=True to produce zero-valued similarities for pairs with out-of-vocabulary words. Otherwise (default False), these pairs are skipped entirely.

get_vector(word)

Accept a single entity as input, specified by string tag. Returns the entity’s representations in vector space, as a 1D numpy array.

index2entity
init_sims(replace=False)

Precompute L2-normalized vectors.

If replace is set, forget the original vectors and only keep the normalized ones = saves lots of memory!

Note that you cannot continue training after doing a replace. The model becomes effectively read-only = you can only call most_similar, similarity etc.

classmethod load(fname_or_handle, **kwargs)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
static log_accuracy()
static log_evaluate_word_pairs(spearman, oov, pairs)
most_similar(positive=None, negative=None, topn=10, restrict_vocab=None, indexer=None)

Find the top-N most similar words. Positive words contribute positively towards the similarity, negative words negatively.

This method computes cosine similarity between a simple mean of the projection weight vectors of the given words and the vectors for each word in the model. The method corresponds to the word-analogy and distance scripts in the original word2vec implementation.

Parameters:
  • positive – List of words that contribute positively.
  • negative – List of words that contribute negatively.
  • topn (int) – Number of top-N similar words to return.
  • restrict_vocab (int) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)
Returns:

Returns a list of tuples (word, similarity)

Return type:

obj: list of :obj: tuple

Examples

>>> trained_model.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.50882536), ...]
most_similar_cosmul(positive=None, negative=None, topn=10)

Find the top-N most similar words, using the multiplicative combination objective proposed by Omer Levy and Yoav Goldberg. Positive words still contribute positively towards the similarity, negative words negatively, but with less susceptibility to one large distance dominating the calculation.

In the common analogy-solving case, of two positive and one negative examples, this method is equivalent to the “3CosMul” objective (equation (4)) of Levy and Goldberg.

Additional positive or negative examples contribute to the numerator or denominator, respectively – a potentially sensible but untested extension of the method. (With a single positive example, rankings will be the same as in the default most_similar.)

Example:

>>> trained_model.most_similar_cosmul(positive=['baghdad', 'england'], negative=['london'])
[(u'iraq', 0.8488819003105164), ...]
most_similar_to_given(entity1, entities_list)

Return the entity from entities_list most similar to entity1.

n_similarity(ws1, ws2)

Compute cosine similarity between two sets of words.

Examples

>>> trained_model.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant'])
0.61540466561049689
>>> trained_model.n_similarity(['restaurant', 'japanese'], ['japanese', 'restaurant'])
1.0000000000000004
>>> trained_model.n_similarity(['sushi'], ['restaurant']) == trained_model.similarity('sushi', 'restaurant')
True
rank(entity1, entity2)

Rank of the distance of entity2 from entity1, in relation to distances of all entities from entity1.

save(*args, **kwargs)

Saves the keyedvectors. This saved model can be loaded again using load() which supports getting vectors for out-of-vocabulary words.

Parameters:fname (str) – Path to the file.
save_word2vec_format(fname, fvocab=None, binary=False, total_vec=None)

Store the input-hidden weight matrix in the same format used by the original C word2vec-tool, for compatibility.

Parameters:
  • fname (str) – The file path used to save the vectors in.
  • fvocab (str) – Optional file path used to save the vocabulary.
  • binary (bool) – If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.
  • total_vec (int) – Optional parameter to explicitly specify total no. of vectors (in case word vectors are appended with document vectors afterwards).
similar_by_vector(vector, topn=10, restrict_vocab=None)

Find the top-N most similar words by vector.

Parameters:
  • vector (numpy.array) – vector from which similarities are to be computed. expected shape (dim,)
  • topn (int) – Number of top-N similar words to return. If topn is False, similar_by_vector returns the vector of similarity scores.
  • restrict_vocab (int) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)
Returns:

Returns a list of tuples (word, similarity)

Return type:

obj: list of :obj: tuple

similar_by_word(word, topn=10, restrict_vocab=None)

Find the top-N most similar words.

Parameters:
  • word (str) – Word
  • topn (int) – Number of top-N similar words to return. If topn is False, similar_by_word returns the vector of similarity scores.
  • restrict_vocab (int) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)
Returns:

  • obj: list of :obj: tuple – Returns a list of tuples (word, similarity)
  • Example:: – >>> trained_model.similar_by_word(‘graph’) [(‘user’, 0.9999163150787354), …]

similarity(w1, w2)

Compute cosine similarity between two words.

Examples

>>> trained_model.similarity('woman', 'man')
0.73723527
>>> trained_model.similarity('woman', 'woman')
1.0
similarity_matrix(dictionary, tfidf=None, threshold=0.0, exponent=2.0, nonzero_limit=100, dtype=<type 'numpy.float32'>)

Constructs a term similarity matrix for computing Soft Cosine Measure.

Constructs a a sparse term similarity matrix in the scipy.sparse.csc_matrix format for computing Soft Cosine Measure between documents.

Parameters:
  • dictionary (Dictionary) – A dictionary that specifies a mapping between words and the indices of rows and columns of the resulting term similarity matrix.
  • tfidf (gensim.models.tfidfmodel.TfidfModel, optional) – A model that specifies the relative importance of the terms in the dictionary. The rows of the term similarity matrix will be build in an increasing order of importance of terms, or in the order of term identifiers if None.
  • threshold (float, optional) – Only pairs of words whose embeddings are more similar than threshold are considered when building the sparse term similarity matrix.
  • exponent (float, optional) – The exponent applied to the similarity between two word embeddings when building the term similarity matrix.
  • nonzero_limit (int, optional) – The maximum number of non-zero elements outside the diagonal in a single row or column of the term similarity matrix. Setting nonzero_limit to a constant ensures that the time complexity of computing the Soft Cosine Measure will be linear in the document length rather than quadratic.
  • dtype (numpy.dtype, optional) – Data-type of the term similarity matrix.
Returns:

Term similarity matrix.

Return type:

scipy.sparse.csc_matrix

See also

gensim.matutils.softcossim()
The Soft Cosine Measure.
gensim.similarities.docsim.SoftCosineSimilarity
A class for performing corpus-based similarity queries with Soft Cosine Measure.

Notes

The constructed matrix corresponds to the matrix Mrel defined in section 2.1 of Delphine Charlet and Geraldine Damnati, “SimBow at SemEval-2017 Task 3: Soft-Cosine Semantic Similarity between Questions for Community Question Answering”, 2017.

syn0
syn0_ngrams
syn0_ngrams_norm
syn0_vocab
syn0_vocab_norm
syn0norm
wmdistance(document1, document2)

Compute the Word Mover’s Distance between two documents. When using this code, please consider citing the following papers:

Note that if one of the documents have no words that exist in the Word2Vec vocab, float(‘inf’) (i.e. infinity) will be returned.

This method only works if pyemd is installed (can be installed via pip, but requires a C compiler).

Example

>>> # Train word2vec model.
>>> model = Word2Vec(sentences)
>>> # Some sentences to test.
>>> sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
>>> sentence_president = 'The president greets the press in Chicago'.lower().split()
>>> # Remove their stopwords.
>>> from nltk.corpus import stopwords
>>> stopwords = nltk.corpus.stopwords.words('english')
>>> sentence_obama = [w for w in sentence_obama if w not in stopwords]
>>> sentence_president = [w for w in sentence_president if w not in stopwords]
>>> # Compute WMD.
>>> distance = model.wmdistance(sentence_obama, sentence_president)
word_vec(word, use_norm=False)

Accept a single word as input. Returns the word’s representations in vector space, as a 1D numpy array.

If use_norm is True, returns the normalized word vector.

words_closer_than(w1, w2)

Returns all words that are closer to w1 than w2 is to w1.

Parameters:
  • w1 (str) – Input word.
  • w2 (str) – Input word.
Returns:

List of words that are closer to w1 than w2 is to w1.

Return type:

list (str)

Examples

>>> model.words_closer_than('carnivore', 'mammal')
['dog', 'canine']
wv
gensim.models.keyedvectors.KeyedVectors

alias of gensim.models.keyedvectors.Word2VecKeyedVectors

class gensim.models.keyedvectors.Vocab(**kwargs)

Bases: object

A single vocabulary item, used internally for collecting per-word frequency/sampling info, and for constructing binary trees (incl. both word leaves and inner nodes).

class gensim.models.keyedvectors.Word2VecKeyedVectors(vector_size)

Bases: gensim.models.keyedvectors.WordEmbeddingsKeyedVectors

Class to contain vectors and vocab for word2vec model. Used to perform operations on the vectors such as vector lookup, distance, similarity etc.

accuracy(questions, restrict_vocab=30000, most_similar=<function most_similar>, case_insensitive=True)

Compute accuracy of the model. questions is a filename where lines are 4-tuples of words, split into sections by “: SECTION NAME” lines. See questions-words.txt in https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/word2vec/source-archive.zip for an example.

The accuracy is reported (=printed to log and returned as a list) for each section separately, plus there’s one aggregate summary at the end.

Use restrict_vocab to ignore all questions containing a word not in the first restrict_vocab words (default 30,000). This may be meaningful if you’ve sorted the vocabulary by descending frequency. In case case_insensitive is True, the first restrict_vocab words are taken first, and then case normalization is performed.

Use case_insensitive to convert all words in questions and vocab to their uppercase form before evaluating the accuracy (default True). Useful in case of case-mismatch between training tokens and question words. In case of multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.

This method corresponds to the compute-accuracy script of the original C word2vec.

closer_than(entity1, entity2)

Returns all entities that are closer to entity1 than entity2 is to entity1.

static cosine_similarities(vectors_all)

Return cosine similarities between one vector and a set of other vectors.

Parameters:
  • vector_1 (numpy.array) – vector from which similarities are to be computed. expected shape (dim,)
  • vectors_all (numpy.array) – for each row in vectors_all, distance from vector_1 is computed. expected shape (num_vectors, dim)
Returns:

Contains cosine distance between vector_1 and each row in vectors_all. shape (num_vectors,)

Return type:

obj: numpy.array

distance(w1, w2)

Compute cosine distance between two words.

Examples

>>> trained_model.distance('woman', 'man')
0.34
>>> trained_model.distance('woman', 'woman')
0.0
distances(word_or_vector, other_words=())

Compute cosine distances from given word or vector to all words in other_words. If other_words is empty, return distance between word_or_vectors and all words in vocab.

Parameters:
  • word_or_vector (str or numpy.array) – Word or vector from which distances are to be computed.
  • other_words (iterable(str) or None) – For each word in other_words distance from word_or_vector is computed. If None or empty, distance of word_or_vector from all words in vocab is computed (including itself).
Returns:

Array containing distances to all words in other_words from input word_or_vector, in the same order as other_words.

Return type:

numpy.array

Notes

Raises KeyError if either word_or_vector or any word in other_words is absent from vocab.

doesnt_match(words)

Which word from the given list doesn’t go with the others?

Parameters:words – List of words
Returns:The word further away from the mean of all words.
Return type:str

Example

>>> trained_model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'
evaluate_word_pairs(pairs, delimiter='\t', restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)

Compute correlation of the model with human similarity judgments. pairs is a filename of a dataset where lines are 3-tuples, each consisting of a word pair and a similarity value, separated by delimiter. An example dataset is included in Gensim (test/test_data/wordsim353.tsv). More datasets can be found at http://technion.ac.il/~ira.leviant/MultilingualVSMdata.html or https://www.cl.cam.ac.uk/~fh295/simlex.html.

The model is evaluated using Pearson correlation coefficient and Spearman rank-order correlation coefficient between the similarities from the dataset and the similarities produced by the model itself. The results are printed to log and returned as a triple (pearson, spearman, ratio of pairs with unknown words).

Use restrict_vocab to ignore all word pairs containing a word not in the first restrict_vocab words (default 300,000). This may be meaningful if you’ve sorted the vocabulary by descending frequency. If case_insensitive is True, the first restrict_vocab words are taken, and then case normalization is performed.

Use case_insensitive to convert all words in the pairs and vocab to their uppercase form before evaluating the model (default True). Useful when you expect case-mismatch between training tokens and words pairs in the dataset. If there are multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.

Use dummy4unknown=True to produce zero-valued similarities for pairs with out-of-vocabulary words. Otherwise (default False), these pairs are skipped entirely.

get_keras_embedding(train_embeddings=False)

Return a Keras ‘Embedding’ layer with weights set as the Word2Vec model’s learned word embeddings

Parameters:train_embeddings (bool) – If False, the weights are frozen and stopped from being updated. If True, the weights can/will be further trained/updated.
Returns:Embedding layer
Return type:obj: keras.layers.Embedding
get_vector(word)

Accept a single entity as input, specified by string tag. Returns the entity’s representations in vector space, as a 1D numpy array.

index2entity
init_sims(replace=False)

Precompute L2-normalized vectors.

If replace is set, forget the original vectors and only keep the normalized ones = saves lots of memory!

Note that you cannot continue training after doing a replace. The model becomes effectively read-only = you can call most_similar, similarity etc., but not train.

classmethod load(fname_or_handle, **kwargs)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
classmethod load_word2vec_format(fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict', limit=None, datatype=<type 'numpy.float32'>)

Load the input-hidden weight matrix from the original C word2vec-tool format.

Note that the information stored in the file is incomplete (the binary tree is missing), so while you can query for word similarity etc., you cannot continue training with a model loaded this way.

Parameters:
  • fname (str) – The file path to the saved word2vec-format file.
  • fvocab (str) – Optional file path to the vocabulary.Word counts are read from fvocab filename, if set (this is the file generated by -save-vocab flag of the original C tool).
  • binary (bool) – If True, indicates whether the data is in binary word2vec format.
  • encoding (str) – If you trained the C model using non-utf8 encoding for words, specify that encoding in encoding.
  • unicode_errors (str) – default ‘strict’, is a string suitable to be passed as the errors argument to the unicode() (Python 2.x) or str() (Python 3.x) function. If your source file may include word tokens truncated in the middle of a multibyte unicode character (as is common from the original word2vec.c tool), ‘ignore’ or ‘replace’ may help.
  • limit (int) – Sets a maximum number of word-vectors to read from the file. The default, None, means read all.
  • datatype – (Experimental) Can coerce dimensions to a non-default float type (such as np.float16) to save memory. (Such types may result in much slower bulk operations or incompatibility with optimized routines.)
Returns:

Returns the loaded model as an instance of :class: ~gensim.models.word2vec.Wod2Vec.

Return type:

obj: ~gensim.models.word2vec.Wod2Vec

static log_accuracy()
static log_evaluate_word_pairs(spearman, oov, pairs)
most_similar(positive=None, negative=None, topn=10, restrict_vocab=None, indexer=None)

Find the top-N most similar words. Positive words contribute positively towards the similarity, negative words negatively.

This method computes cosine similarity between a simple mean of the projection weight vectors of the given words and the vectors for each word in the model. The method corresponds to the word-analogy and distance scripts in the original word2vec implementation.

Parameters:
  • positive – List of words that contribute positively.
  • negative – List of words that contribute negatively.
  • topn (int) – Number of top-N similar words to return.
  • restrict_vocab (int) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)
Returns:

Returns a list of tuples (word, similarity)

Return type:

obj: list of :obj: tuple

Examples

>>> trained_model.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.50882536), ...]
most_similar_cosmul(positive=None, negative=None, topn=10)

Find the top-N most similar words, using the multiplicative combination objective proposed by Omer Levy and Yoav Goldberg. Positive words still contribute positively towards the similarity, negative words negatively, but with less susceptibility to one large distance dominating the calculation.

In the common analogy-solving case, of two positive and one negative examples, this method is equivalent to the “3CosMul” objective (equation (4)) of Levy and Goldberg.

Additional positive or negative examples contribute to the numerator or denominator, respectively – a potentially sensible but untested extension of the method. (With a single positive example, rankings will be the same as in the default most_similar.)

Example:

>>> trained_model.most_similar_cosmul(positive=['baghdad', 'england'], negative=['london'])
[(u'iraq', 0.8488819003105164), ...]
most_similar_to_given(entity1, entities_list)

Return the entity from entities_list most similar to entity1.

n_similarity(ws1, ws2)

Compute cosine similarity between two sets of words.

Examples

>>> trained_model.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant'])
0.61540466561049689
>>> trained_model.n_similarity(['restaurant', 'japanese'], ['japanese', 'restaurant'])
1.0000000000000004
>>> trained_model.n_similarity(['sushi'], ['restaurant']) == trained_model.similarity('sushi', 'restaurant')
True
rank(entity1, entity2)

Rank of the distance of entity2 from entity1, in relation to distances of all entities from entity1.

save(*args, **kwargs)

Saves the keyedvectors. This saved model can be loaded again using load() which supports operations on trained word vectors like most_similar.

Parameters:fname (str) – Path to the file.
save_word2vec_format(fname, fvocab=None, binary=False, total_vec=None)

Store the input-hidden weight matrix in the same format used by the original C word2vec-tool, for compatibility.

Parameters:
  • fname (str) – The file path used to save the vectors in
  • fvocab (str) – Optional file path used to save the vocabulary
  • binary (bool) – If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.
  • total_vec (int) – Optional parameter to explicitly specify total no. of vectors (in case word vectors are appended with document vectors afterwards)
similar_by_vector(vector, topn=10, restrict_vocab=None)

Find the top-N most similar words by vector.

Parameters:
  • vector (numpy.array) – vector from which similarities are to be computed. expected shape (dim,)
  • topn (int) – Number of top-N similar words to return. If topn is False, similar_by_vector returns the vector of similarity scores.
  • restrict_vocab (int) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)
Returns:

Returns a list of tuples (word, similarity)

Return type:

obj: list of :obj: tuple

similar_by_word(word, topn=10, restrict_vocab=None)

Find the top-N most similar words.

Parameters:
  • word (str) – Word
  • topn (int) – Number of top-N similar words to return. If topn is False, similar_by_word returns the vector of similarity scores.
  • restrict_vocab (int) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)
Returns:

  • obj: list of :obj: tuple – Returns a list of tuples (word, similarity)
  • Example:: – >>> trained_model.similar_by_word(‘graph’) [(‘user’, 0.9999163150787354), …]

similarity(w1, w2)

Compute cosine similarity between two words.

Examples

>>> trained_model.similarity('woman', 'man')
0.73723527
>>> trained_model.similarity('woman', 'woman')
1.0
similarity_matrix(dictionary, tfidf=None, threshold=0.0, exponent=2.0, nonzero_limit=100, dtype=<type 'numpy.float32'>)

Constructs a term similarity matrix for computing Soft Cosine Measure.

Constructs a a sparse term similarity matrix in the scipy.sparse.csc_matrix format for computing Soft Cosine Measure between documents.

Parameters:
  • dictionary (Dictionary) – A dictionary that specifies a mapping between words and the indices of rows and columns of the resulting term similarity matrix.
  • tfidf (gensim.models.tfidfmodel.TfidfModel, optional) – A model that specifies the relative importance of the terms in the dictionary. The rows of the term similarity matrix will be build in an increasing order of importance of terms, or in the order of term identifiers if None.
  • threshold (float, optional) – Only pairs of words whose embeddings are more similar than threshold are considered when building the sparse term similarity matrix.
  • exponent (float, optional) – The exponent applied to the similarity between two word embeddings when building the term similarity matrix.
  • nonzero_limit (int, optional) – The maximum number of non-zero elements outside the diagonal in a single row or column of the term similarity matrix. Setting nonzero_limit to a constant ensures that the time complexity of computing the Soft Cosine Measure will be linear in the document length rather than quadratic.
  • dtype (numpy.dtype, optional) – Data-type of the term similarity matrix.
Returns:

Term similarity matrix.

Return type:

scipy.sparse.csc_matrix

See also

gensim.matutils.softcossim()
The Soft Cosine Measure.
gensim.similarities.docsim.SoftCosineSimilarity
A class for performing corpus-based similarity queries with Soft Cosine Measure.

Notes

The constructed matrix corresponds to the matrix Mrel defined in section 2.1 of Delphine Charlet and Geraldine Damnati, “SimBow at SemEval-2017 Task 3: Soft-Cosine Semantic Similarity between Questions for Community Question Answering”, 2017.

syn0
syn0norm
wmdistance(document1, document2)

Compute the Word Mover’s Distance between two documents. When using this code, please consider citing the following papers:

Note that if one of the documents have no words that exist in the Word2Vec vocab, float(‘inf’) (i.e. infinity) will be returned.

This method only works if pyemd is installed (can be installed via pip, but requires a C compiler).

Example

>>> # Train word2vec model.
>>> model = Word2Vec(sentences)
>>> # Some sentences to test.
>>> sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
>>> sentence_president = 'The president greets the press in Chicago'.lower().split()
>>> # Remove their stopwords.
>>> from nltk.corpus import stopwords
>>> stopwords = nltk.corpus.stopwords.words('english')
>>> sentence_obama = [w for w in sentence_obama if w not in stopwords]
>>> sentence_president = [w for w in sentence_president if w not in stopwords]
>>> # Compute WMD.
>>> distance = model.wmdistance(sentence_obama, sentence_president)
word_vec(word, use_norm=False)

Accept a single word as input. Returns the word’s representations in vector space, as a 1D numpy array.

If use_norm is True, returns the normalized word vector.

Examples

>>> trained_model['office']
array([ -1.40128313e-02, ...])
words_closer_than(w1, w2)

Returns all words that are closer to w1 than w2 is to w1.

Parameters:
  • w1 (str) – Input word.
  • w2 (str) – Input word.
Returns:

List of words that are closer to w1 than w2 is to w1.

Return type:

list (str)

Examples

>>> model.words_closer_than('carnivore', 'mammal')
['dog', 'canine']
wv
class gensim.models.keyedvectors.WordEmbeddingsKeyedVectors(vector_size)

Bases: gensim.models.keyedvectors.BaseKeyedVectors

Class containing common methods for operations over word vectors.

accuracy(questions, restrict_vocab=30000, most_similar=<function most_similar>, case_insensitive=True)

Compute accuracy of the model. questions is a filename where lines are 4-tuples of words, split into sections by “: SECTION NAME” lines. See questions-words.txt in https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/word2vec/source-archive.zip for an example.

The accuracy is reported (=printed to log and returned as a list) for each section separately, plus there’s one aggregate summary at the end.

Use restrict_vocab to ignore all questions containing a word not in the first restrict_vocab words (default 30,000). This may be meaningful if you’ve sorted the vocabulary by descending frequency. In case case_insensitive is True, the first restrict_vocab words are taken first, and then case normalization is performed.

Use case_insensitive to convert all words in questions and vocab to their uppercase form before evaluating the accuracy (default True). Useful in case of case-mismatch between training tokens and question words. In case of multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.

This method corresponds to the compute-accuracy script of the original C word2vec.

closer_than(entity1, entity2)

Returns all entities that are closer to entity1 than entity2 is to entity1.

static cosine_similarities(vectors_all)

Return cosine similarities between one vector and a set of other vectors.

Parameters:
  • vector_1 (numpy.array) – vector from which similarities are to be computed. expected shape (dim,)
  • vectors_all (numpy.array) – for each row in vectors_all, distance from vector_1 is computed. expected shape (num_vectors, dim)
Returns:

Contains cosine distance between vector_1 and each row in vectors_all. shape (num_vectors,)

Return type:

obj: numpy.array

distance(w1, w2)

Compute cosine distance between two words.

Examples

>>> trained_model.distance('woman', 'man')
0.34
>>> trained_model.distance('woman', 'woman')
0.0
distances(word_or_vector, other_words=())

Compute cosine distances from given word or vector to all words in other_words. If other_words is empty, return distance between word_or_vectors and all words in vocab.

Parameters:
  • word_or_vector (str or numpy.array) – Word or vector from which distances are to be computed.
  • other_words (iterable(str) or None) – For each word in other_words distance from word_or_vector is computed. If None or empty, distance of word_or_vector from all words in vocab is computed (including itself).
Returns:

Array containing distances to all words in other_words from input word_or_vector, in the same order as other_words.

Return type:

numpy.array

Notes

Raises KeyError if either word_or_vector or any word in other_words is absent from vocab.

doesnt_match(words)

Which word from the given list doesn’t go with the others?

Parameters:words – List of words
Returns:The word further away from the mean of all words.
Return type:str

Example

>>> trained_model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'
evaluate_word_pairs(pairs, delimiter='\t', restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)

Compute correlation of the model with human similarity judgments. pairs is a filename of a dataset where lines are 3-tuples, each consisting of a word pair and a similarity value, separated by delimiter. An example dataset is included in Gensim (test/test_data/wordsim353.tsv). More datasets can be found at http://technion.ac.il/~ira.leviant/MultilingualVSMdata.html or https://www.cl.cam.ac.uk/~fh295/simlex.html.

The model is evaluated using Pearson correlation coefficient and Spearman rank-order correlation coefficient between the similarities from the dataset and the similarities produced by the model itself. The results are printed to log and returned as a triple (pearson, spearman, ratio of pairs with unknown words).

Use restrict_vocab to ignore all word pairs containing a word not in the first restrict_vocab words (default 300,000). This may be meaningful if you’ve sorted the vocabulary by descending frequency. If case_insensitive is True, the first restrict_vocab words are taken, and then case normalization is performed.

Use case_insensitive to convert all words in the pairs and vocab to their uppercase form before evaluating the model (default True). Useful when you expect case-mismatch between training tokens and words pairs in the dataset. If there are multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.

Use dummy4unknown=True to produce zero-valued similarities for pairs with out-of-vocabulary words. Otherwise (default False), these pairs are skipped entirely.

get_vector(word)

Accept a single entity as input, specified by string tag. Returns the entity’s representations in vector space, as a 1D numpy array.

index2entity
init_sims(replace=False)

Precompute L2-normalized vectors.

If replace is set, forget the original vectors and only keep the normalized ones = saves lots of memory!

Note that you cannot continue training after doing a replace. The model becomes effectively read-only = you can call most_similar, similarity etc., but not train.

classmethod load(fname_or_handle, **kwargs)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
static log_accuracy()
static log_evaluate_word_pairs(spearman, oov, pairs)
most_similar(positive=None, negative=None, topn=10, restrict_vocab=None, indexer=None)

Find the top-N most similar words. Positive words contribute positively towards the similarity, negative words negatively.

This method computes cosine similarity between a simple mean of the projection weight vectors of the given words and the vectors for each word in the model. The method corresponds to the word-analogy and distance scripts in the original word2vec implementation.

Parameters:
  • positive – List of words that contribute positively.
  • negative – List of words that contribute negatively.
  • topn (int) – Number of top-N similar words to return.
  • restrict_vocab (int) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)
Returns:

Returns a list of tuples (word, similarity)

Return type:

obj: list of :obj: tuple

Examples

>>> trained_model.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.50882536), ...]
most_similar_cosmul(positive=None, negative=None, topn=10)

Find the top-N most similar words, using the multiplicative combination objective proposed by Omer Levy and Yoav Goldberg. Positive words still contribute positively towards the similarity, negative words negatively, but with less susceptibility to one large distance dominating the calculation.

In the common analogy-solving case, of two positive and one negative examples, this method is equivalent to the “3CosMul” objective (equation (4)) of Levy and Goldberg.

Additional positive or negative examples contribute to the numerator or denominator, respectively – a potentially sensible but untested extension of the method. (With a single positive example, rankings will be the same as in the default most_similar.)

Example:

>>> trained_model.most_similar_cosmul(positive=['baghdad', 'england'], negative=['london'])
[(u'iraq', 0.8488819003105164), ...]
most_similar_to_given(entity1, entities_list)

Return the entity from entities_list most similar to entity1.

n_similarity(ws1, ws2)

Compute cosine similarity between two sets of words.

Examples

>>> trained_model.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant'])
0.61540466561049689
>>> trained_model.n_similarity(['restaurant', 'japanese'], ['japanese', 'restaurant'])
1.0000000000000004
>>> trained_model.n_similarity(['sushi'], ['restaurant']) == trained_model.similarity('sushi', 'restaurant')
True
rank(entity1, entity2)

Rank of the distance of entity2 from entity1, in relation to distances of all entities from entity1.

save(*args, **kwargs)

Saves the keyedvectors. This saved model can be loaded again using load() which supports operations on trained word vectors like most_similar.

Parameters:fname (str) – Path to the file.
similar_by_vector(vector, topn=10, restrict_vocab=None)

Find the top-N most similar words by vector.

Parameters:
  • vector (numpy.array) – vector from which similarities are to be computed. expected shape (dim,)
  • topn (int) – Number of top-N similar words to return. If topn is False, similar_by_vector returns the vector of similarity scores.
  • restrict_vocab (int) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)
Returns:

Returns a list of tuples (word, similarity)

Return type:

obj: list of :obj: tuple

similar_by_word(word, topn=10, restrict_vocab=None)

Find the top-N most similar words.

Parameters:
  • word (str) – Word
  • topn (int) – Number of top-N similar words to return. If topn is False, similar_by_word returns the vector of similarity scores.
  • restrict_vocab (int) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)
Returns:

  • obj: list of :obj: tuple – Returns a list of tuples (word, similarity)
  • Example:: – >>> trained_model.similar_by_word(‘graph’) [(‘user’, 0.9999163150787354), …]

similarity(w1, w2)

Compute cosine similarity between two words.

Examples

>>> trained_model.similarity('woman', 'man')
0.73723527
>>> trained_model.similarity('woman', 'woman')
1.0
similarity_matrix(dictionary, tfidf=None, threshold=0.0, exponent=2.0, nonzero_limit=100, dtype=<type 'numpy.float32'>)

Constructs a term similarity matrix for computing Soft Cosine Measure.

Constructs a a sparse term similarity matrix in the scipy.sparse.csc_matrix format for computing Soft Cosine Measure between documents.

Parameters:
  • dictionary (Dictionary) – A dictionary that specifies a mapping between words and the indices of rows and columns of the resulting term similarity matrix.
  • tfidf (gensim.models.tfidfmodel.TfidfModel, optional) – A model that specifies the relative importance of the terms in the dictionary. The rows of the term similarity matrix will be build in an increasing order of importance of terms, or in the order of term identifiers if None.
  • threshold (float, optional) – Only pairs of words whose embeddings are more similar than threshold are considered when building the sparse term similarity matrix.
  • exponent (float, optional) – The exponent applied to the similarity between two word embeddings when building the term similarity matrix.
  • nonzero_limit (int, optional) – The maximum number of non-zero elements outside the diagonal in a single row or column of the term similarity matrix. Setting nonzero_limit to a constant ensures that the time complexity of computing the Soft Cosine Measure will be linear in the document length rather than quadratic.
  • dtype (numpy.dtype, optional) – Data-type of the term similarity matrix.
Returns:

Term similarity matrix.

Return type:

scipy.sparse.csc_matrix

See also

gensim.matutils.softcossim()
The Soft Cosine Measure.
gensim.similarities.docsim.SoftCosineSimilarity
A class for performing corpus-based similarity queries with Soft Cosine Measure.

Notes

The constructed matrix corresponds to the matrix Mrel defined in section 2.1 of Delphine Charlet and Geraldine Damnati, “SimBow at SemEval-2017 Task 3: Soft-Cosine Semantic Similarity between Questions for Community Question Answering”, 2017.

syn0
syn0norm
wmdistance(document1, document2)

Compute the Word Mover’s Distance between two documents. When using this code, please consider citing the following papers:

Note that if one of the documents have no words that exist in the Word2Vec vocab, float(‘inf’) (i.e. infinity) will be returned.

This method only works if pyemd is installed (can be installed via pip, but requires a C compiler).

Example

>>> # Train word2vec model.
>>> model = Word2Vec(sentences)
>>> # Some sentences to test.
>>> sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
>>> sentence_president = 'The president greets the press in Chicago'.lower().split()
>>> # Remove their stopwords.
>>> from nltk.corpus import stopwords
>>> stopwords = nltk.corpus.stopwords.words('english')
>>> sentence_obama = [w for w in sentence_obama if w not in stopwords]
>>> sentence_president = [w for w in sentence_president if w not in stopwords]
>>> # Compute WMD.
>>> distance = model.wmdistance(sentence_obama, sentence_president)
word_vec(word, use_norm=False)

Accept a single word as input. Returns the word’s representations in vector space, as a 1D numpy array.

If use_norm is True, returns the normalized word vector.

Examples

>>> trained_model['office']
array([ -1.40128313e-02, ...])
words_closer_than(w1, w2)

Returns all words that are closer to w1 than w2 is to w1.

Parameters:
  • w1 (str) – Input word.
  • w2 (str) – Input word.
Returns:

List of words that are closer to w1 than w2 is to w1.

Return type:

list (str)

Examples

>>> model.words_closer_than('carnivore', 'mammal')
['dog', 'canine']
wv