gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

models.wrappers.wordrank – Word Embeddings from WordRank

models.wrappers.wordrank – Word Embeddings from WordRank

Python wrapper around word representation learning from Wordrank. The wrapped model can NOT be updated with new documents for online training – use gensim’s Word2Vec for that.

Example: >>> model = gensim.models.wrappers.Wordrank.train(‘/Users/dummy/wordrank’, corpus_file=’text8’, out_name=’wr_model’) >>> print model[word] # prints vector for given words

[1]https://bitbucket.org/shihaoji/wordrank/
[2]https://arxiv.org/pdf/1506.02761v3.pdf

Note that the wrapper might not work in a docker container for large datasets due to memory limits (caused by MPI).

class gensim.models.wrappers.wordrank.Wordrank

Bases: gensim.models.keyedvectors.KeyedVectors

Class for word vector training using Wordrank. Communication between Wordrank and Python takes place by working with data files on disk and calling the Wordrank binary and glove’s helper binaries (for preparing training data) with subprocess module.

accuracy(questions, restrict_vocab=30000, most_similar=<function most_similar>, case_insensitive=True)

Compute accuracy of the model. questions is a filename where lines are 4-tuples of words, split into sections by ”: SECTION NAME” lines. See questions-words.txt in https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/word2vec/source-archive.zip for an example.

The accuracy is reported (=printed to log and returned as a list) for each section separately, plus there’s one aggregate summary at the end.

Use restrict_vocab to ignore all questions containing a word not in the first restrict_vocab words (default 30,000). This may be meaningful if you’ve sorted the vocabulary by descending frequency. In case case_insensitive is True, the first restrict_vocab words are taken first, and then case normalization is performed.

Use case_insensitive to convert all words in questions and vocab to their uppercase form before evaluating the accuracy (default True). Useful in case of case-mismatch between training tokens and question words. In case of multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.

This method corresponds to the compute-accuracy script of the original C word2vec.

doesnt_match(words)

Which word from the given list doesn’t go with the others?

Example:

>>> trained_model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'
ensemble_embedding(word_embedding, context_embedding)

Replace syn0 with the sum of context and word embeddings.

evaluate_word_pairs(pairs, delimiter='\t', restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)

Compute correlation of the model with human similarity judgments. pairs is a filename of a dataset where lines are 3-tuples, each consisting of a word pair and a similarity value, separated by delimiter. An example dataset is included in Gensim (test/test_data/wordsim353.tsv). More datasets can be found at http://technion.ac.il/~ira.leviant/MultilingualVSMdata.html or https://www.cl.cam.ac.uk/~fh295/simlex.html.

The model is evaluated using Pearson correlation coefficient and Spearman rank-order correlation coefficient between the similarities from the dataset and the similarities produced by the model itself. The results are printed to log and returned as a triple (pearson, spearman, ratio of pairs with unknown words).

Use restrict_vocab to ignore all word pairs containing a word not in the first restrict_vocab words (default 300,000). This may be meaningful if you’ve sorted the vocabulary by descending frequency. If case_insensitive is True, the first restrict_vocab words are taken, and then case normalization is performed.

Use case_insensitive to convert all words in the pairs and vocab to their uppercase form before evaluating the model (default True). Useful when you expect case-mismatch between training tokens and words pairs in the dataset. If there are multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.

Use dummy4unknown=True to produce zero-valued similarities for pairs with out-of-vocabulary words. Otherwise (default False), these pairs are skipped entirely.

get_embedding_layer(train_embeddings=False)

Return a Keras ‘Embedding’ layer with weights set as the Word2Vec model’s learned word embeddings

init_sims(replace=False)

Precompute L2-normalized vectors.

If replace is set, forget the original vectors and only keep the normalized ones = saves lots of memory!

Note that you cannot continue training after doing a replace. The model becomes effectively read-only = you can call most_similar, similarity etc., but not train.

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

load_word2vec_format(fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict', limit=None, datatype=<type 'numpy.float32'>)

Load the input-hidden weight matrix from the original C word2vec-tool format.

Note that the information stored in the file is incomplete (the binary tree is missing), so while you can query for word similarity etc., you cannot continue training with a model loaded this way.

binary is a boolean indicating whether the data is in binary word2vec format. norm_only is a boolean indicating whether to only store normalised word2vec vectors in memory. Word counts are read from fvocab filename, if set (this is the file generated by -save-vocab flag of the original C tool).

If you trained the C model using non-utf8 encoding for words, specify that encoding in encoding.

unicode_errors, default ‘strict’, is a string suitable to be passed as the errors argument to the unicode() (Python 2.x) or str() (Python 3.x) function. If your source file may include word tokens truncated in the middle of a multibyte unicode character (as is common from the original word2vec.c tool), ‘ignore’ or ‘replace’ may help.

limit sets a maximum number of word-vectors to read from the file. The default, None, means read all.

datatype (experimental) can coerce dimensions to a non-default float type (such as np.float16) to save memory. (Such types may result in much slower bulk operations or incompatibility with optimized routines.)

classmethod load_wordrank_model(model_file, vocab_file=None, context_file=None, sorted_vocab=1, ensemble=1)
log_accuracy(section)
log_evaluate_word_pairs(pearson, spearman, oov, pairs)
most_similar(positive=[], negative=[], topn=10, restrict_vocab=None, indexer=None)

Find the top-N most similar words. Positive words contribute positively towards the similarity, negative words negatively.

This method computes cosine similarity between a simple mean of the projection weight vectors of the given words and the vectors for each word in the model. The method corresponds to the word-analogy and distance scripts in the original word2vec implementation.

If topn is False, most_similar returns the vector of similarity scores.

restrict_vocab is an optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)

Example:

>>> trained_model.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.50882536), ...]
most_similar_cosmul(positive=[], negative=[], topn=10)

Find the top-N most similar words, using the multiplicative combination objective proposed by Omer Levy and Yoav Goldberg in [4]. Positive words still contribute positively towards the similarity, negative words negatively, but with less susceptibility to one large distance dominating the calculation.

In the common analogy-solving case, of two positive and one negative examples, this method is equivalent to the “3CosMul” objective (equation (4)) of Levy and Goldberg.

Additional positive or negative examples contribute to the numerator or denominator, respectively – a potentially sensible but untested extension of the method. (With a single positive example, rankings will be the same as in the default most_similar.)

Example:

>>> trained_model.most_similar_cosmul(positive=['baghdad', 'england'], negative=['london'])
[(u'iraq', 0.8488819003105164), ...]
[4]Omer Levy and Yoav Goldberg. Linguistic Regularities in Sparse and Explicit Word Representations, 2014.
n_similarity(ws1, ws2)

Compute cosine similarity between two sets of words.

Example:

>>> trained_model.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant'])
0.61540466561049689

>>> trained_model.n_similarity(['restaurant', 'japanese'], ['japanese', 'restaurant'])
1.0000000000000004

>>> trained_model.n_similarity(['sushi'], ['restaurant']) == trained_model.similarity('sushi', 'restaurant')
True
save(*args, **kwargs)
save_word2vec_format(fname, fvocab=None, binary=False, total_vec=None)

Store the input-hidden weight matrix in the same format used by the original C word2vec-tool, for compatibility.

fname is the file used to save the vectors in fvocab is an optional file used to save the vocabulary binary is an optional boolean indicating whether the data is to be saved in binary word2vec format (default: False) total_vec is an optional parameter to explicitly specify total no. of vectors (in case word vectors are appended with document vectors afterwards)
similar_by_vector(vector, topn=10, restrict_vocab=None)

Find the top-N most similar words by vector.

If topn is False, similar_by_vector returns the vector of similarity scores.

restrict_vocab is an optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)

Example:

>>> trained_model.similar_by_vector([1,2])
[('survey', 0.9942699074745178), ...]
similar_by_word(word, topn=10, restrict_vocab=None)

Find the top-N most similar words.

If topn is False, similar_by_word returns the vector of similarity scores.

restrict_vocab is an optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 word vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)

Example:

>>> trained_model.similar_by_word('graph')
[('user', 0.9999163150787354), ...]
similarity(w1, w2)

Compute cosine similarity between two words.

Example:

>>> trained_model.similarity('woman', 'man')
0.73723527

>>> trained_model.similarity('woman', 'woman')
1.0
sort_embeddings(vocab_file)

Sort embeddings according to word frequency.

classmethod train(wr_path, corpus_file, out_name, size=100, window=15, symmetric=1, min_count=5, max_vocab_size=0, sgd_num=100, lrate=0.001, period=10, iter=90, epsilon=0.75, dump_period=10, reg=0, alpha=100, beta=99, loss='hinge', memory=4.0, np=1, cleanup_files=False, sorted_vocab=1, ensemble=0)

The word and context embedding files are generated by wordrank binary and are saved in “out_name” directory which is created inside wordrank directory. The vocab and cooccurence files are generated using glove code available inside the wordrank directory. These files are used by the wordrank binary for training.

wr_path is the absolute path to the Wordrank directory. corpus_file is the filename of the text file to be used for training the Wordrank model. Expects file to contain space-separated tokens in a single line out_name is name of the directory which will be created (in wordrank folder) to save embeddings and training data. It will contain following contents:

Word Embeddings saved after every dump_period and stored in a file model_word_currentiter.txt Context Embeddings saved after every dump_period and stored in a file model_context_currentiter.txt A meta directory which contain: ‘vocab.txt’ - vocab words, ‘wiki.toy’ - word-word coccurence values, ‘meta’ - vocab and coccurence lengths

size is the dimensionality of the feature vectors. window is the number of context words to the left (and to the right, if symmetric = 1). symmetric if 0, only use left context words, else use left and right both. min_count = ignore all words with total frequency lower than this. max_vocab_size upper bound on vocabulary size, i.e. keep the <int> most frequent words. Default is 0 for no limit. sgd_num number of SGD taken for each data point. lrate is the learning rate (too high diverges, give Nan). period is the period of xi variable updates iter = number of iterations (epochs) over the corpus. epsilon is the power scaling value for weighting function. dump_period is the period after which embeddings should be dumped. reg is the value of regularization parameter. alpha is the alpha parameter of gamma distribution. beta is the beta parameter of gamma distribution. loss = name of the loss (logistic, hinge). memory = soft limit for memory consumption, in GB. np number of copies to execute. (mpirun option) cleanup_files if True, delete directory and files used by this wrapper, setting to False can be useful for debugging sorted_vocab = if 1 (default), sort the vocabulary by descending frequency before assigning word indexes. ensemble = 0 (default), use ensemble of word and context vectors

wmdistance(document1, document2)

Compute the Word Mover’s Distance between two documents. When using this code, please consider citing the following papers:

Note that if one of the documents have no words that exist in the Word2Vec vocab, float(‘inf’) (i.e. infinity) will be returned.

This method only works if pyemd is installed (can be installed via pip, but requires a C compiler).

Example

>>> # Train word2vec model.
>>> model = Word2Vec(sentences)
>>> # Some sentences to test.
>>> sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
>>> sentence_president = 'The president greets the press in Chicago'.lower().split()
>>> # Remove their stopwords.
>>> from nltk.corpus import stopwords
>>> stopwords = nltk.corpus.stopwords.words('english')
>>> sentence_obama = [w for w in sentence_obama if w not in stopwords]
>>> sentence_president = [w for w in sentence_president if w not in stopwords]
>>> # Compute WMD.
>>> distance = model.wmdistance(sentence_obama, sentence_president)
word_vec(word, use_norm=False)

Accept a single word as input. Returns the word’s representations in vector space, as a 1D numpy array.

If use_norm is True, returns the normalized word vector.

Example:

>>> trained_model['office']
array([ -1.40128313e-02, ...])
wv