gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

models.wrappers.wordrank – Word Embeddings from WordRank

models.wrappers.wordrank – Word Embeddings from WordRank

Python wrapper around Wordrank. Original paper: “WordRank: Learning Word Embeddings via Robust Ranking “.

Installation

Use official guide or this one

  • On Linux

    sudo yum install boost-devel #(on RedHat/Centos)
    sudo apt-get install libboost-all-dev #(on Ubuntu)
    
    git clone https://bitbucket.org/shihaoji/wordrank
    cd wordrank/
    # replace icc to gcc in install.sh
    ./install.sh
    
  • On MacOS

    brew install cmake
    brew install wget
    brew install boost
    brew install mercurial
    
    git clone https://bitbucket.org/shihaoji/wordrank
    cd wordrank/
    # replace icc to gcc in install.sh
    ./install.sh
    

Examples

>>> from gensim.models.wrappers import Wordrank
>>>
>>> path_to_wordrank_binary = '/path/to/wordrank/binary'
>>> model = Wordrank.train(path_to_wordrank_binary, corpus_file='text8', out_name='wr_model')
>>>
>>> print(model["hello"])  # prints vector for given words

Warning

Note that the wrapper might not work in a docker container for large datasets due to memory limits (caused by MPI).

class gensim.models.wrappers.wordrank.Wordrank(vector_size)

Bases: gensim.models.keyedvectors.Word2VecKeyedVectors

Python wrapper using Wordrank implementation

Communication between Wordrank and Python takes place by working with data files on disk and calling the Wordrank binary and glove’s helper binaries (for preparing training data) with subprocess module.

Warning

This is only python wrapper for Wordrank implementation, you need to install original implementation first and pass the path to wordrank dir to wr_path.

accuracy(**kwargs)

Compute accuracy of the model.

The accuracy is reported (=printed to log and returned as a list) for each section separately, plus there’s one aggregate summary at the end.

Parameters:
  • questions (str) – Path to file, where lines are 4-tuples of words, split into sections by “: SECTION NAME” lines. See gensim/test/test_data/questions-words.txt as example.
  • restrict_vocab (int, optional) – Ignore all 4-tuples containing a word not in the first restrict_vocab words. This may be meaningful if you’ve sorted the model vocabulary by descending frequency (which is standard in modern word embedding models).
  • most_similar (function, optional) – Function used for similarity calculation.
  • case_insensitive (bool, optional) – If True - convert all words to their uppercase form before evaluating the performance. Useful to handle case-mismatch between training tokens and words in the test set. In case of multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.
Returns:

Full lists of correct and incorrect predictions divided by sections.

Return type:

list of dict of (str, (str, str, str)

add(entities, weights, replace=False)

Append entities and theirs vectors in a manual way. If some entity is already in the vocabulary, the old vector is kept unless replace flag is True.

Parameters:
  • entities (list of str) – Entities specified by string ids.
  • weights ({list of numpy.ndarray, numpy.ndarray}) – List of 1D np.array vectors or a 2D np.array of vectors.
  • replace (bool, optional) – Flag indicating whether to replace vectors for entities which already exist in the vocabulary, if True - replace vectors, otherwise - keep old vectors.
closer_than(entity1, entity2)

Get all entities that are closer to entity1 than entity2 is to entity1.

static cosine_similarities(vector_1, vectors_all)

Compute cosine similarities between one vector and a set of other vectors.

Parameters:
  • vector_1 (numpy.ndarray) – Vector from which similarities are to be computed, expected shape (dim,).
  • vectors_all (numpy.ndarray) – For each row in vectors_all, distance from vector_1 is computed, expected shape (num_vectors, dim).
Returns:

Contains cosine distance between vector_1 and each row in vectors_all, shape (num_vectors,).

Return type:

numpy.ndarray

distance(w1, w2)

Compute cosine distance between two words. Calculate 1 - similarity().

Parameters:
  • w1 (str) – Input word.
  • w2 (str) – Input word.
Returns:

Distance between w1 and w2.

Return type:

float

distances(word_or_vector, other_words=())

Compute cosine distances from given word or vector to all words in other_words. If other_words is empty, return distance between word_or_vectors and all words in vocab.

Parameters:
  • word_or_vector ({str, numpy.ndarray}) – Word or vector from which distances are to be computed.
  • other_words (iterable of str) – For each word in other_words distance from word_or_vector is computed. If None or empty, distance of word_or_vector from all words in vocab is computed (including itself).
Returns:

Array containing distances to all words in other_words from input word_or_vector.

Return type:

numpy.array

Raises:

KeyError – If either word_or_vector or any word in other_words is absent from vocab.

doesnt_match(words)

Which word from the given list doesn’t go with the others?

Parameters:words (list of str) – List of words.
Returns:The word further away from the mean of all words.
Return type:str
ensemble_embedding(word_embedding, context_embedding)

Replace current syn0 with the sum of context and word embeddings.

Parameters:
  • word_embedding (str) – Path to word embeddings in GloVe format.
  • context_embedding (str) – Path to context embeddings in word2vec_format.
Returns:

Matrix with new embeddings.

Return type:

numpy.ndarray

evaluate_word_analogies(analogies, restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)

Compute performance of the model on an analogy test set.

This is modern variant of accuracy(), see discussion on GitHub #1935.

The accuracy is reported (printed to log and returned as a score) for each section separately, plus there’s one aggregate summary at the end.

This method corresponds to the compute-accuracy script of the original C word2vec. See also Analogy (State of the art).

Parameters:
  • analogies (str) – Path to file, where lines are 4-tuples of words, split into sections by “: SECTION NAME” lines. See gensim/test/test_data/questions-words.txt as example.
  • restrict_vocab (int, optional) – Ignore all 4-tuples containing a word not in the first restrict_vocab words. This may be meaningful if you’ve sorted the model vocabulary by descending frequency (which is standard in modern word embedding models).
  • case_insensitive (bool, optional) – If True - convert all words to their uppercase form before evaluating the performance. Useful to handle case-mismatch between training tokens and words in the test set. In case of multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.
  • dummy4unknown (bool, optional) – If True - produce zero accuracies for 4-tuples with out-of-vocabulary words. Otherwise, these tuples are skipped entirely and not used in the evaluation.
Returns:

  • score (float) – The overall evaluation score on the entire evaluation set
  • sections (list of dict of {str : str or list of tuple of (str, str, str, str)}) – Results broken down by each section of the evaluation set. Each dict contains the name of the section under the key ‘section’, and lists of correctly and incorrectly predicted 4-tuples of words under the keys ‘correct’ and ‘incorrect’.

evaluate_word_pairs(pairs, delimiter='\t', restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)

Compute correlation of the model with human similarity judgments.

Notes

More datasets can be found at * http://technion.ac.il/~ira.leviant/MultilingualVSMdata.html * https://www.cl.cam.ac.uk/~fh295/simlex.html.

Parameters:
  • pairs (str) – Path to file, where lines are 3-tuples, each consisting of a word pair and a similarity value. See test/test_data/wordsim353.tsv as example.
  • delimiter (str, optional) – Separator in pairs file.
  • restrict_vocab (int, optional) – Ignore all 4-tuples containing a word not in the first restrict_vocab words. This may be meaningful if you’ve sorted the model vocabulary by descending frequency (which is standard in modern word embedding models).
  • case_insensitive (bool, optional) – If True - convert all words to their uppercase form before evaluating the performance. Useful to handle case-mismatch between training tokens and words in the test set. In case of multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.
  • dummy4unknown (bool, optional) – If True - produce zero accuracies for 4-tuples with out-of-vocabulary words. Otherwise, these tuples are skipped entirely and not used in the evaluation.
Returns:

  • pearson (tuple of (float, float)) – Pearson correlation coefficient with 2-tailed p-value.
  • spearman (tuple of (float, float)) – Spearman rank-order correlation coefficient between the similarities from the dataset and the similarities produced by the model itself, with 2-tailed p-value.
  • oov_ratio (float) – The ratio of pairs with unknown words.

get_keras_embedding(train_embeddings=False)

Get a Keras ‘Embedding’ layer with weights set as the Word2Vec model’s learned word embeddings.

Parameters:train_embeddings (bool) – If False, the weights are frozen and stopped from being updated. If True, the weights can/will be further trained/updated.
Returns:Embedding layer.
Return type:keras.layers.Embedding
Raises:ImportError – If Keras not installed.

Warning

Current method work only if Keras installed.

get_vector(word)

Get the entity’s representations in vector space, as a 1D numpy array.

Parameters:entity (str) – Identifier of the entity to return the vector for.
Returns:Vector for the specified entity.
Return type:numpy.ndarray
Raises:KeyError – If the given entity identifier doesn’t exist.
index2entity
init_sims(replace=False)

Precompute L2-normalized vectors.

Parameters:replace (bool, optional) – If True - forget the original vectors and only keep the normalized ones = saves lots of memory!

Warning

You cannot continue training after doing a replace. The model becomes effectively read-only: you can call most_similar(), similarity(), etc., but not train.

classmethod load(fname_or_handle, **kwargs)

Load an object previously saved using save() from a file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()
Save object to file.
Returns:Object loaded from fname.
Return type:object
Raises:AttributeError – When called on an object instance instead of class (this is a class method).
classmethod load_word2vec_format(fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict', limit=None, datatype=<type 'numpy.float32'>)

Load the input-hidden weight matrix from the original C word2vec-tool format.

Warning

The information stored in the file is incomplete (the binary tree is missing), so while you can query for word similarity etc., you cannot continue training with a model loaded this way.

Parameters:
  • fname (str) – The file path to the saved word2vec-format file.
  • fvocab (str, optional) – File path to the vocabulary.Word counts are read from fvocab filename, if set (this is the file generated by -save-vocab flag of the original C tool).
  • binary (bool, optional) – If True, indicates whether the data is in binary word2vec format.
  • encoding (str, optional) – If you trained the C model using non-utf8 encoding for words, specify that encoding in encoding.
  • unicode_errors (str, optional) – default ‘strict’, is a string suitable to be passed as the errors argument to the unicode() (Python 2.x) or str() (Python 3.x) function. If your source file may include word tokens truncated in the middle of a multibyte unicode character (as is common from the original word2vec.c tool), ‘ignore’ or ‘replace’ may help.
  • limit (int, optional) – Sets a maximum number of word-vectors to read from the file. The default, None, means read all.
  • datatype (type, optional) – (Experimental) Can coerce dimensions to a non-default float type (such as np.float16) to save memory. Such types may result in much slower bulk operations or incompatibility with optimized routines.)
Returns:

Loaded model.

Return type:

Word2VecKeyedVectors

classmethod load_wordrank_model(model_file, vocab_file=None, context_file=None, sorted_vocab=1, ensemble=1)

Load model from model_file.

Parameters:
  • model_file (str) – Path to model in GloVe format.
  • vocab_file (str, optional) – Path to file with vocabulary.
  • context_file (str, optional) – Path to file with context-embedding in word2vec_format.
  • sorted_vocab ({0, 1}, optional) – If 1 - sort the vocabulary by descending frequency before assigning word indexes, otherwise - do nothing.
  • ensemble ({0, 1}, optional) – If 1 - use ensemble of word and context vectors.
static log_accuracy(section)
static log_evaluate_word_pairs(pearson, spearman, oov, pairs)
most_similar(positive=None, negative=None, topn=10, restrict_vocab=None, indexer=None)

Find the top-N most similar words. Positive words contribute positively towards the similarity, negative words negatively.

This method computes cosine similarity between a simple mean of the projection weight vectors of the given words and the vectors for each word in the model. The method corresponds to the word-analogy and distance scripts in the original word2vec implementation.

Parameters:
  • positive (list of str, optional) – List of words that contribute positively.
  • negative (list of str, optional) – List of words that contribute negatively.
  • topn (int, optional) – Number of top-N similar words to return.
  • restrict_vocab (int, optional) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)
Returns:

Sequence of (word, similarity).

Return type:

list of (str, float)

most_similar_cosmul(positive=None, negative=None, topn=10)

Find the top-N most similar words, using the multiplicative combination objective, proposed by Omer Levy and Yoav Goldberg “Linguistic Regularities in Sparse and Explicit Word Representations”. Positive words still contribute positively towards the similarity, negative words negatively, but with less susceptibility to one large distance dominating the calculation. In the common analogy-solving case, of two positive and one negative examples, this method is equivalent to the “3CosMul” objective (equation (4)) of Levy and Goldberg.

Additional positive or negative examples contribute to the numerator or denominator, respectively - a potentially sensible but untested extension of the method. With a single positive example, rankings will be the same as in the default most_similar().

Parameters:
  • positive (list of str, optional) – List of words that contribute positively.
  • negative (list of str, optional) – List of words that contribute negatively.
  • topn (int, optional) – Number of top-N similar words to return.
Returns:

Sequence of (word, similarity).

Return type:

list of (str, float)

most_similar_to_given(entity1, entities_list)

Get the entity from entities_list most similar to entity1.

n_similarity(ws1, ws2)

Compute cosine similarity between two sets of words.

Parameters:
  • ws1 (list of str) – Sequence of words.
  • ws2 (list of str) – Sequence of words.
Returns:

Similarities between ws1 and ws2.

Return type:

numpy.ndarray

rank(entity1, entity2)

Rank of the distance of entity2 from entity1, in relation to distances of all entities from entity1.

relative_cosine_similarity(wa, wb, topn=10)

Compute the relative cosine similarity between two words given top-n similar words, by Artuur Leeuwenberga, Mihaela Velab , Jon Dehdaribc, Josef van Genabithbc “A Minimally Supervised Approach for Synonym Extraction with Word Embeddings”.

To calculate relative cosine similarity between two words, equation (1) of the paper is used. For WordNet synonyms, if rcs(topn=10) is greater than 0.10 then wa and wb are more similar than any arbitrary word pairs.

Parameters:
  • wa (str) – Word for which we have to look top-n similar word.
  • wb (str) – Word for which we evaluating relative cosine similarity with wa.
  • topn (int, optional) – Number of top-n similar words to look with respect to wa.
Returns:

Relative cosine similarity between wa and wb.

Return type:

numpy.float64

save(*args, **kwargs)

Save KeyedVectors.

Parameters:fname (str) – Path to the output file.

See also

load()
Load saved model.
save_word2vec_format(fname, fvocab=None, binary=False, total_vec=None)

Store the input-hidden weight matrix in the same format used by the original C word2vec-tool, for compatibility.

Parameters:
  • fname (str) – The file path used to save the vectors in
  • fvocab (str, optional) – Optional file path used to save the vocabulary
  • binary (bool, optional) – If True, the data will be saved in binary word2vec format, else it will be saved in plain text.
  • total_vec (int, optional) – Optional parameter to explicitly specify total no. of vectors (in case word vectors are appended with document vectors afterwards).
similar_by_vector(vector, topn=10, restrict_vocab=None)

Find the top-N most similar words by vector.

Parameters:
  • vector (numpy.array) – Vector from which similarities are to be computed.
  • topn ({int, False}, optional) – Number of top-N similar words to return. If topn is False, similar_by_vector returns the vector of similarity scores.
  • restrict_vocab (int, optional) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)
Returns:

Sequence of (word, similarity).

Return type:

list of (str, float)

similar_by_word(word, topn=10, restrict_vocab=None)

Find the top-N most similar words.

Parameters:
  • word (str) – Word
  • topn ({int, False}, optional) – Number of top-N similar words to return. If topn is False, similar_by_word returns the vector of similarity scores.
  • restrict_vocab (int, optional) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)
Returns:

Sequence of (word, similarity).

Return type:

list of (str, float)

similarity(w1, w2)

Compute cosine similarity between two words.

Parameters:
  • w1 (str) – Input word.
  • w2 (str) – Input word.
Returns:

Cosine similarity between w1 and w2.

Return type:

float

similarity_matrix(**kwargs)

Construct a term similarity matrix for computing Soft Cosine Measure.

This creates a sparse term similarity matrix in the scipy.sparse.csc_matrix format for computing Soft Cosine Measure between documents.

Parameters:
  • dictionary (Dictionary) – A dictionary that specifies the considered terms.
  • tfidf (gensim.models.tfidfmodel.TfidfModel or None, optional) – A model that specifies the relative importance of the terms in the dictionary. The columns of the term similarity matrix will be build in a decreasing order of importance of terms, or in the order of term identifiers if None.
  • threshold (float, optional) – Only embeddings more similar than threshold are considered when retrieving word embeddings closest to a given word embedding.
  • exponent (float, optional) – Take the word embedding similarities larger than threshold to the power of exponent.
  • nonzero_limit (int, optional) – The maximum number of non-zero elements outside the diagonal in a single column of the sparse term similarity matrix.
  • dtype (numpy.dtype, optional) – Data-type of the sparse term similarity matrix.
Returns:

Term similarity matrix.

Return type:

scipy.sparse.csc_matrix

See also

gensim.matutils.softcossim()
The Soft Cosine Measure.
SoftCosineSimilarity
A class for performing corpus-based similarity queries with Soft Cosine Measure.

Notes

The constructed matrix corresponds to the matrix Mrel defined in section 2.1 of Delphine Charlet and Geraldine Damnati, “SimBow at SemEval-2017 Task 3: Soft-Cosine Semantic Similarity between Questions for Community Question Answering”, 2017.

sort_embeddings(vocab_file)

Sort embeddings according to word frequency.

Parameters:vocab_file (str) – Path to file with vocabulary.
syn0
syn0norm
classmethod train(wr_path, corpus_file, out_name, size=100, window=15, symmetric=1, min_count=5, max_vocab_size=0, sgd_num=100, lrate=0.001, period=10, iter=90, epsilon=0.75, dump_period=10, reg=0, alpha=100, beta=99, loss='hinge', memory=4.0, np=1, cleanup_files=False, sorted_vocab=1, ensemble=0)

Train model.

Parameters:
  • wr_path (str) – Absolute path to the Wordrank directory.
  • corpus_file (str) – Path to corpus file, expected space-separated tokens in a each line format.
  • out_name (str) –
    Name of the directory which will be created (in wordrank folder) to save embeddings and training data:
    • model_word_current_<iter>.txt - Word Embeddings saved after every dump_period.
    • model_context_current_<iter>.txt - Context Embeddings saved after every dump_period.
    • meta/vocab.txt - vocab file.
    • meta/wiki.toy - word-word concurrence values.
  • size (int, optional) – Dimensionality of the feature vectors.
  • window (int, optional) – Number of context words to the left (and to the right, if symmetric = 1).
  • symmetric ({0, 1}, optional) – If 1 - using symmetric windows, if 0 - will use only left context words.
  • min_count (int, optional) – Ignore all words with total frequency lower than min_count.
  • max_vocab_size (int, optional) – Upper bound on vocabulary size, i.e. keep the <int> most frequent words. If 0 - no limit.
  • sgd_num (int, optional) – Number of SGD taken for each data point.
  • lrate (float, optional) – Learning rate (attention: too high diverges, give Nan).
  • period (int, optional) – Period of xi variable updates.
  • iter (int, optional) – Number of iterations (epochs) over the corpus.
  • epsilon (float, optional) – Power scaling value for weighting function.
  • dump_period (int, optional) – Period after which embeddings should be dumped.
  • reg (int, optional) – Value of regularization parameter.
  • alpha (int, optional) – Alpha parameter of gamma distribution.
  • beta (int, optional) – Beta parameter of gamma distribution.
  • loss ({"logistic", "hinge"}, optional) – Name of the loss function.
  • memory (float, optional) – Soft limit for memory consumption, in GB.
  • np (int, optional) – Number of process to execute (mpirun option).
  • cleanup_files (bool, optional) – If True, delete directory and files used by this wrapper.
  • sorted_vocab ({0, 1}, optional) – If 1 - sort the vocabulary by descending frequency before assigning word indexes, otherwise - do nothing.
  • ensemble ({0, 1}, optional) – If 1 - use ensemble of word and context vectors.
wmdistance(document1, document2)

Compute the Word Mover’s Distance between two documents.

When using this code, please consider citing the following papers:

Parameters:
  • document1 (list of str) – Input document.
  • document2 (list of str) – Input document.
Returns:

Word Mover’s distance between document1 and document2.

Return type:

float

Warning

This method only works if pyemd is installed.

If one of the documents have no words that exist in the vocab, float(‘inf’) (i.e. infinity) will be returned.

Raises:ImportError

If pyemd isn’t installed.

word_vec(word, use_norm=False)

Get word representations in vector space, as a 1D numpy array.

Parameters:
  • word (str) – Input word
  • use_norm (bool, optional) – If True - resulting vector will be L2-normalized (unit euclidean length).
Returns:

Vector representation of word.

Return type:

numpy.ndarray

Raises:

KeyError – If word not in vocabulary.

words_closer_than(w1, w2)

Get all words that are closer to w1 than w2 is to w1.

Parameters:
  • w1 (str) – Input word.
  • w2 (str) – Input word.
Returns:

List of words that are closer to w1 than w2 is to w1.

Return type:

list (str)

wv