models.wrappers.wordrank
– Word Embeddings from WordRank¶Python wrapper around Wordrank. Original paper: “WordRank: Learning Word Embeddings via Robust Ranking “.
Use official guide or this one
On Linux
sudo yum install boostdevel #(on RedHat/Centos)
sudo aptget install libboostalldev #(on Ubuntu)
git clone https://bitbucket.org/shihaoji/wordrank
cd wordrank/
# replace icc to gcc in install.sh
./install.sh
On MacOS
brew install cmake
brew install wget
brew install boost
brew install mercurial
git clone https://bitbucket.org/shihaoji/wordrank
cd wordrank/
# replace icc to gcc in install.sh
./install.sh
Examples
>>> from gensim.models.wrappers import Wordrank
>>>
>>> path_to_wordrank_binary = '/path/to/wordrank/binary'
>>> model = Wordrank.train(path_to_wordrank_binary, corpus_file='text8', out_name='wr_model')
>>>
>>> print(model["hello"]) # prints vector for given words
Warning
Note that the wrapper might not work in a docker container for large datasets due to memory limits (caused by MPI).
gensim.models.wrappers.wordrank.
Wordrank
(vector_size)¶Bases: gensim.models.keyedvectors.Word2VecKeyedVectors
Python wrapper using Wordrank implementation
Communication between Wordrank and Python takes place by working with data files on disk and calling the Wordrank binary and glove’s helper binaries (for preparing training data) with subprocess module.
Warning
This is only python wrapper for Wordrank implementation,
you need to install original implementation first and pass the path to wordrank dir to wr_path
.
accuracy
(**kwargs)¶Compute accuracy of the model.
The accuracy is reported (=printed to log and returned as a list) for each section separately, plus there’s one aggregate summary at the end.
Parameters: 


Returns:  Full lists of correct and incorrect predictions divided by sections. 
Return type:  list of dict of (str, (str, str, str) 
add
(entities, weights, replace=False)¶Append entities and theirs vectors in a manual way. If some entity is already in the vocabulary, the old vector is kept unless replace flag is True.
Parameters: 


closer_than
(entity1, entity2)¶Get all entities that are closer to entity1 than entity2 is to entity1.
cosine_similarities
(vector_1, vectors_all)¶Compute cosine similarities between one vector and a set of other vectors.
Parameters: 


Returns:  Contains cosine distance between vector_1 and each row in vectors_all, shape (num_vectors,). 
Return type:  numpy.ndarray 
distance
(w1, w2)¶Compute cosine distance between two words.
Calculate 1  similarity()
.
Parameters: 


Returns:  Distance between w1 and w2. 
Return type:  float 
distances
(word_or_vector, other_words=())¶Compute cosine distances from given word or vector to all words in other_words. If other_words is empty, return distance between word_or_vectors and all words in vocab.
Parameters: 


Returns:  Array containing distances to all words in other_words from input word_or_vector. 
Return type:  numpy.array 
Raises:  KeyError – If either word_or_vector or any word in other_words is absent from vocab. 
doesnt_match
(words)¶Which word from the given list doesn’t go with the others?
Parameters:  words (list of str) – List of words. 

Returns:  The word further away from the mean of all words. 
Return type:  str 
ensemble_embedding
(word_embedding, context_embedding)¶Replace current syn0 with the sum of context and word embeddings.
Parameters: 


Returns:  Matrix with new embeddings. 
Return type:  numpy.ndarray 
evaluate_word_analogies
(analogies, restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)¶Compute performance of the model on an analogy test set.
This is modern variant of accuracy()
, see
discussion on GitHub #1935.
The accuracy is reported (printed to log and returned as a score) for each section separately, plus there’s one aggregate summary at the end.
This method corresponds to the computeaccuracy script of the original C word2vec. See also Analogy (State of the art).
Parameters: 


Returns: 

evaluate_word_pairs
(pairs, delimiter='\t', restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)¶Compute correlation of the model with human similarity judgments.
Notes
More datasets can be found at * http://technion.ac.il/~ira.leviant/MultilingualVSMdata.html * https://www.cl.cam.ac.uk/~fh295/simlex.html.
Parameters: 


Returns: 

get_keras_embedding
(train_embeddings=False)¶Get a Keras ‘Embedding’ layer with weights set as the Word2Vec model’s learned word embeddings.
Parameters:  train_embeddings (bool) – If False, the weights are frozen and stopped from being updated. If True, the weights can/will be further trained/updated. 

Returns:  Embedding layer. 
Return type:  keras.layers.Embedding 
Raises:  ImportError – If Keras not installed. 
Warning
Current method work only if Keras installed.
get_vector
(word)¶Get the entity’s representations in vector space, as a 1D numpy array.
Parameters:  entity (str) – Identifier of the entity to return the vector for. 

Returns:  Vector for the specified entity. 
Return type:  numpy.ndarray 
Raises:  KeyError – If the given entity identifier doesn’t exist. 
index2entity
¶init_sims
(replace=False)¶Precompute L2normalized vectors.
Parameters:  replace (bool, optional) – If True  forget the original vectors and only keep the normalized ones = saves lots of memory! 

Warning
You cannot continue training after doing a replace.
The model becomes effectively readonly: you can call
most_similar()
,
similarity()
, etc., but not train.
load
(fname_or_handle, **kwargs)¶Load an object previously saved using save()
from a file.
Parameters: 


See also
save()
Returns:  Object loaded from fname. 

Return type:  object 
Raises:  AttributeError – When called on an object instance instead of class (this is a class method). 
load_word2vec_format
(fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict', limit=None, datatype=<type 'numpy.float32'>)¶Load the inputhidden weight matrix from the original C word2vectool format.
Warning
The information stored in the file is incomplete (the binary tree is missing), so while you can query for word similarity etc., you cannot continue training with a model loaded this way.
Parameters: 


Returns:  Loaded model. 
Return type: 
load_wordrank_model
(model_file, vocab_file=None, context_file=None, sorted_vocab=1, ensemble=1)¶Load model from model_file.
Parameters: 


log_accuracy
(section)¶log_evaluate_word_pairs
(pearson, spearman, oov, pairs)¶most_similar
(positive=None, negative=None, topn=10, restrict_vocab=None, indexer=None)¶Find the topN most similar words. Positive words contribute positively towards the similarity, negative words negatively.
This method computes cosine similarity between a simple mean of the projection weight vectors of the given words and the vectors for each word in the model. The method corresponds to the wordanalogy and distance scripts in the original word2vec implementation.
Parameters: 


Returns:  Sequence of (word, similarity). 
Return type:  list of (str, float) 
most_similar_cosmul
(positive=None, negative=None, topn=10)¶Find the topN most similar words, using the multiplicative combination objective, proposed by Omer Levy and Yoav Goldberg “Linguistic Regularities in Sparse and Explicit Word Representations”. Positive words still contribute positively towards the similarity, negative words negatively, but with less susceptibility to one large distance dominating the calculation. In the common analogysolving case, of two positive and one negative examples, this method is equivalent to the “3CosMul” objective (equation (4)) of Levy and Goldberg.
Additional positive or negative examples contribute to the numerator or denominator,
respectively  a potentially sensible but untested extension of the method.
With a single positive example, rankings will be the same as in the default
most_similar()
.
Parameters: 


Returns:  Sequence of (word, similarity). 
Return type:  list of (str, float) 
most_similar_to_given
(entity1, entities_list)¶Get the entity from entities_list most similar to entity1.
n_similarity
(ws1, ws2)¶Compute cosine similarity between two sets of words.
Parameters: 


Returns:  Similarities between ws1 and ws2. 
Return type:  numpy.ndarray 
rank
(entity1, entity2)¶Rank of the distance of entity2 from entity1, in relation to distances of all entities from entity1.
relative_cosine_similarity
(wa, wb, topn=10)¶Compute the relative cosine similarity between two words given topn similar words, by Artuur Leeuwenberga, Mihaela Velab , Jon Dehdaribc, Josef van Genabithbc “A Minimally Supervised Approach for Synonym Extraction with Word Embeddings”.
To calculate relative cosine similarity between two words, equation (1) of the paper is used. For WordNet synonyms, if rcs(topn=10) is greater than 0.10 then wa and wb are more similar than any arbitrary word pairs.
Parameters: 


Returns:  Relative cosine similarity between wa and wb. 
Return type:  numpy.float64 
save
(*args, **kwargs)¶Save KeyedVectors.
Parameters:  fname (str) – Path to the output file. 

See also
load()
save_word2vec_format
(fname, fvocab=None, binary=False, total_vec=None)¶Store the inputhidden weight matrix in the same format used by the original C word2vectool, for compatibility.
Parameters: 


similar_by_vector
(vector, topn=10, restrict_vocab=None)¶Find the topN most similar words by vector.
Parameters: 


Returns:  Sequence of (word, similarity). 
Return type:  list of (str, float) 
similar_by_word
(word, topn=10, restrict_vocab=None)¶Find the topN most similar words.
Parameters: 


Returns:  Sequence of (word, similarity). 
Return type:  list of (str, float) 
similarity
(w1, w2)¶Compute cosine similarity between two words.
Parameters: 


Returns:  Cosine similarity between w1 and w2. 
Return type:  float 
similarity_matrix
(**kwargs)¶Construct a term similarity matrix for computing Soft Cosine Measure.
This creates a sparse term similarity matrix in the scipy.sparse.csc_matrix
format for computing
Soft Cosine Measure between documents.
Parameters: 


Returns:  Term similarity matrix. 
Return type: 

See also
gensim.matutils.softcossim()
SoftCosineSimilarity
Notes
The constructed matrix corresponds to the matrix Mrel defined in section 2.1 of Delphine Charlet and Geraldine Damnati, “SimBow at SemEval2017 Task 3: SoftCosine Semantic Similarity between Questions for Community Question Answering”, 2017.
sort_embeddings
(vocab_file)¶Sort embeddings according to word frequency.
Parameters:  vocab_file (str) – Path to file with vocabulary. 

syn0
¶syn0norm
¶train
(wr_path, corpus_file, out_name, size=100, window=15, symmetric=1, min_count=5, max_vocab_size=0, sgd_num=100, lrate=0.001, period=10, iter=90, epsilon=0.75, dump_period=10, reg=0, alpha=100, beta=99, loss='hinge', memory=4.0, np=1, cleanup_files=False, sorted_vocab=1, ensemble=0)¶Train model.
Parameters: 


wmdistance
(document1, document2)¶Compute the Word Mover’s Distance between two documents.
When using this code, please consider citing the following papers:
Parameters: 


Returns:  Word Mover’s distance between document1 and document2. 
Return type:  float 
Warning
This method only works if pyemd is installed.
If one of the documents have no words that exist in the vocab, float(‘inf’) (i.e. infinity) will be returned.
Raises:  ImportError – If pyemd isn’t installed. 

word_vec
(word, use_norm=False)¶Get word representations in vector space, as a 1D numpy array.
Parameters: 


Returns:  Vector representation of word. 
Return type:  numpy.ndarray 
Raises:  KeyError – If word not in vocabulary. 
words_closer_than
(w1, w2)¶Get all words that are closer to w1 than w2 is to w1.
Parameters: 


Returns:  List of words that are closer to w1 than w2 is to w1. 
Return type:  list (str) 
wv
¶