models.poincare
– Train and use Poincare embeddings¶
Python implementation of Poincaré Embeddings.
These embeddings are better at capturing latent hierarchical information than traditional Euclidean embeddings. The method is described in detail in Maximilian Nickel, Douwe Kiela - “Poincaré Embeddings for Learning Hierarchical Representations”.
The main use-case is to automatically learn hierarchical representations of nodes from a tree-like structure, such as a Directed Acyclic Graph (DAG), using a transitive closure of the relations. Representations of nodes in a symmetric graph can also be learned.
This module allows training Poincaré Embeddings from a training file containing relations of graph in a csv-like format, or from a Python iterable of relations.
Examples
Initialize and train a model from a list
>>> from gensim.models.poincare import PoincareModel
>>> relations = [('kangaroo', 'marsupial'), ('kangaroo', 'mammal'), ('gib', 'cat')]
>>> model = PoincareModel(relations, negative=2)
>>> model.train(epochs=50)
Initialize and train a model from a file containing one relation per line
>>> from gensim.models.poincare import PoincareModel, PoincareRelations
>>> from gensim.test.utils import datapath
>>> file_path = datapath('poincare_hypernyms.tsv')
>>> model = PoincareModel(PoincareRelations(file_path), negative=2)
>>> model.train(epochs=50)
- class gensim.models.poincare.LexicalEntailmentEvaluation(filepath)¶
Bases:
object
Evaluate reconstruction on given network for any embedding.
Initialize evaluation instance with HyperLex text file containing relation pairs.
- Parameters
filepath (str) – Path to HyperLex text file.
- static create_vocab_trie(embedding)¶
Create trie with vocab terms of the given embedding to enable quick prefix searches.
- Parameters
embedding (
PoincareKeyedVectors
) – Embedding for which trie is to be created.- Returns
Trie containing vocab terms of the input embedding.
- Return type
pygtrie.Trie
- evaluate_spearman(embedding)¶
Evaluate spearman scores for lexical entailment for given embedding.
- Parameters
embedding (
PoincareKeyedVectors
) – Embedding for which evaluation is to be done.- Returns
Spearman correlation score for the task for input embedding.
- Return type
float
- static find_matching_terms(trie, word)¶
Find terms in the trie beginning with the word.
- Parameters
trie (
pygtrie.Trie
) – Trie to use for finding matching terms.word (str) – Input word to use for prefix search.
- Returns
List of matching terms.
- Return type
list of str
- score_function(embedding, trie, term_1, term_2)¶
Compute predicted score - extent to which term_1 is a type of term_2.
- Parameters
embedding (
PoincareKeyedVectors
) – Embedding to use for computing predicted score.trie (
pygtrie.Trie
) – Trie to use for finding matching vocab terms for input terms.term_1 (str) – Input term.
term_2 (str) – Input term.
- Returns
Predicted score (the extent to which term_1 is a type of term_2).
- Return type
float
- class gensim.models.poincare.LinkPredictionEvaluation(train_path, test_path, embedding)¶
Bases:
object
Evaluate reconstruction on given network for given embedding.
Initialize evaluation instance with tsv file containing relation pairs and embedding to be evaluated.
- Parameters
train_path (str) – Path to tsv file containing relation pairs used for training.
test_path (str) – Path to tsv file containing relation pairs to evaluate.
embedding (
PoincareKeyedVectors
) – Embedding to be evaluated.
- evaluate(max_n=None)¶
Evaluate all defined metrics for the link prediction task.
- Parameters
max_n (int, optional) – Maximum number of positive relations to evaluate, all if max_n is None.
- Returns
(metric_name, metric_value) pairs, e.g. {‘mean_rank’: 50.3, ‘MAP’: 0.31}.
- Return type
dict of (str, float)
- evaluate_mean_rank_and_map(max_n=None)¶
Evaluate mean rank and MAP for link prediction.
- Parameters
max_n (int, optional) – Maximum number of positive relations to evaluate, all if max_n is None.
- Returns
(mean_rank, MAP), e.g (50.3, 0.31).
- Return type
tuple (float, float)
- static get_unknown_relation_ranks_and_avg_prec(all_distances, unknown_relations, known_relations)¶
Compute ranks and Average Precision of unknown positive relations.
- Parameters
all_distances (numpy.array of float) – Array of all distances for a specific item.
unknown_relations (list of int) – List of indices of unknown positive relations.
known_relations (list of int) – List of indices of known positive relations.
- Returns
The list contains ranks of positive relations in the same order as positive_relations. The float is the Average Precision of the ranking, e.g. ([1, 2, 3, 20], 0.610).
- Return type
tuple (list of int, float)
- class gensim.models.poincare.NegativesBuffer(items)¶
Bases:
object
Buffer and return negative samples.
Initialize instance from list or numpy array of samples.
- Parameters
items (list/numpy.array) – List or array containing negative samples.
- get_items(num_items)¶
Get the next num_items from buffer.
- Parameters
num_items (int) – Number of items to fetch.
- Returns
Slice containing num_items items from the original data.
- Return type
numpy.array or list
Notes
No error is raised if less than num_items items are remaining, simply all the remaining items are returned.
- num_items()¶
Get the number of items remaining in the buffer.
- Returns
Number of items in the buffer that haven’t been consumed yet.
- Return type
int
- class gensim.models.poincare.PoincareBatch(vectors_u, vectors_v, indices_u, indices_v, regularization_coeff=1.0)¶
Bases:
object
Compute Poincare distances, gradients and loss for a training batch.
Store intermediate state to avoid recomputing multiple times.
Initialize instance with sets of vectors for which distances are to be computed.
- Parameters
vectors_u (numpy.array) – Vectors of all nodes u in the batch. Expected shape (batch_size, dim).
vectors_v (numpy.array) – Vectors of all positively related nodes v and negatively sampled nodes v’, for each node u in the batch. Expected shape (1 + neg_size, dim, batch_size).
indices_u (list of int) – List of node indices for each of the vectors in vectors_u.
indices_v (list of lists of int) – Nested list of lists, each of which is a list of node indices for each of the vectors in vectors_v for a specific node u.
regularization_coeff (float, optional) – Coefficient to use for l2-regularization
- compute_all()¶
Convenience method to perform all computations.
- compute_distance_gradients()¶
Compute and store partial derivatives of poincare distance d(u, v) w.r.t all u and all v.
- compute_distances()¶
Compute and store norms, euclidean distances and poincare distances between input vectors.
- compute_gradients()¶
Compute and store gradients of loss function for all input vectors.
- compute_loss()¶
Compute and store loss value for the given batch of examples.
- class gensim.models.poincare.PoincareKeyedVectors(vector_size, vector_count, dtype=<class 'numpy.float32'>)¶
Bases:
KeyedVectors
Vectors and vocab for the
PoincareModel
training class.Used to perform operations on the vectors such as vector lookup, distance calculations etc.
(May be used to save/load final vectors in the plain word2vec format, via the inherited methods save_word2vec_format() and load_word2vec_format().)
Examples
>>> from gensim.test.utils import datapath >>> >>> # Read the sample relations file and train the model >>> relations = PoincareRelations(file_path=datapath('poincare_hypernyms_large.tsv')) >>> model = PoincareModel(train_data=relations) >>> model.train(epochs=50) >>> >>> # Query the trained model. >>> wv = model.kv.get_vector('kangaroo.n.01')
Mapping between keys (such as words) and vectors for
Word2Vec
and related models.Used to perform operations on the vectors such as vector lookup, distance, similarity etc.
To support the needs of specific models and other downstream uses, you can also set additional attributes via the
set_vecattr()
andget_vecattr()
methods. Note that all such attributes under the same attr name must have compatible numpy types, as the type and storage array for such attributes is established by the 1st time such attr is set.- Parameters
vector_size (int) – Intended number of dimensions for all contained vectors.
count (int, optional) – If provided, vectors wil be pre-allocated for at least this many vectors. (Otherwise they can be added later.)
dtype (type, optional) – Vector dimensions will default to np.float32 (AKA REAL in some Gensim code) unless another type is provided here.
mapfile_path (string, optional) – Currently unused.
- __contains__(key)¶
- __getitem__(key_or_keys)¶
Get vector representation of key_or_keys.
- Parameters
key_or_keys ({str, list of str, int, list of int}) – Requested key or list-of-keys.
- Returns
Vector representation for key_or_keys (1D if key_or_keys is single key, otherwise - 2D).
- Return type
numpy.ndarray
- add_lifecycle_event(event_name, log_level=20, **event)¶
Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.
Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.
The lifecycle_events attribute is persisted across object’s
save()
andload()
operations. It has no impact on the use of the model, but is useful during debugging and support.Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.
- Parameters
event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.
This method will automatically add the following key-values to event, so you don’t have to specify them:
datetime: the current date & time
gensim: the current Gensim version
python: the current Python version
platform: the current platform
event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.
- add_vector(key, vector)¶
Add one new vector at the given key, into existing slot if available.
Warning: using this repeatedly is inefficient, requiring a full reallocation & copy, if this instance hasn’t been preallocated to be ready for such incremental additions.
- Parameters
key (str) – Key identifier of the added vector.
vector (numpy.ndarray) – 1D numpy array with the vector values.
- Returns
Index of the newly added vector, so that
self.vectors[result] == vector
andself.index_to_key[result] == key
.- Return type
int
- add_vectors(keys, weights, extras=None, replace=False)¶
Append keys and their vectors in a manual way. If some key is already in the vocabulary, the old vector is kept unless replace flag is True.
- Parameters
keys (list of (str or int)) – Keys specified by string or int ids.
weights (list of numpy.ndarray or numpy.ndarray) – List of 1D np.array vectors or a 2D np.array of vectors.
replace (bool, optional) – Flag indicating whether to replace vectors for keys which already exist in the map; if True - replace vectors, otherwise - keep old vectors.
- allocate_vecattrs(attrs=None, types=None)¶
Ensure arrays for given per-vector extra-attribute names & types exist, at right size.
The length of the index_to_key list is canonical ‘intended size’ of KeyedVectors, even if other properties (vectors array) hasn’t yet been allocated or expanded. So this allocation targets that size.
- ancestors(node)¶
Get the list of recursively closest parents from the given node.
- Parameters
node ({str, int}) – Key for node for which ancestors are to be found.
- Returns
Ancestor nodes of the node node.
- Return type
list of str
- closer_than(key1, key2)¶
Get all keys that are closer to key1 than key2 is to key1.
- closest_child(node)¶
Get the node closest to node that is lower in the hierarchy than node.
- Parameters
node ({str, int}) – Key for node for which closest child is to be found.
- Returns
Node closest to node that is lower in the hierarchy than node. If there are no nodes lower in the hierarchy, None is returned.
- Return type
{str, None}
- closest_parent(node)¶
Get the node closest to node that is higher in the hierarchy than node.
- Parameters
node ({str, int}) – Key for node for which closest parent is to be found.
- Returns
Node closest to node that is higher in the hierarchy than node. If there are no nodes higher in the hierarchy, None is returned.
- Return type
{str, None}
- static cosine_similarities(vector_1, vectors_all)¶
Compute cosine similarities between one vector and a set of other vectors.
- Parameters
vector_1 (numpy.ndarray) – Vector from which similarities are to be computed, expected shape (dim,).
vectors_all (numpy.ndarray) – For each row in vectors_all, distance from vector_1 is computed, expected shape (num_vectors, dim).
- Returns
Contains cosine distance between vector_1 and each row in vectors_all, shape (num_vectors,).
- Return type
numpy.ndarray
- descendants(node, max_depth=5)¶
Get the list of recursively closest children from the given node, up to a max depth of max_depth.
- Parameters
node ({str, int}) – Key for node for which descendants are to be found.
max_depth (int) – Maximum number of descendants to return.
- Returns
Descendant nodes from the node node.
- Return type
list of str
- difference_in_hierarchy(node_or_vector_1, node_or_vector_2)¶
Compute relative position in hierarchy of node_or_vector_1 relative to node_or_vector_2. A positive value indicates node_or_vector_1 is higher in the hierarchy than node_or_vector_2.
- Parameters
node_or_vector_1 ({str, int, numpy.array}) – Input node key or vector.
node_or_vector_2 ({str, int, numpy.array}) – Input node key or vector.
- Returns
Relative position in hierarchy of node_or_vector_1 relative to node_or_vector_2.
- Return type
float
Examples
>>> from gensim.test.utils import datapath >>> >>> # Read the sample relations file and train the model >>> relations = PoincareRelations(file_path=datapath('poincare_hypernyms_large.tsv')) >>> model = PoincareModel(train_data=relations) >>> model.train(epochs=50) >>> >>> model.kv.difference_in_hierarchy('mammal.n.01', 'dog.n.01') 0.05382517902410999 >>> model.kv.difference_in_hierarchy('dog.n.01', 'mammal.n.01') -0.05382517902410999
Notes
The returned value can be positive or negative, depending on whether node_or_vector_1 is higher or lower in the hierarchy than node_or_vector_2.
- distance(w1, w2)¶
Calculate Poincare distance between vectors for nodes w1 and w2.
- Parameters
w1 ({str, int}) – Key for first node.
w2 ({str, int}) – Key for second node.
- Returns
Poincare distance between the vectors for nodes w1 and w2.
- Return type
float
Examples
>>> from gensim.test.utils import datapath >>> >>> # Read the sample relations file and train the model >>> relations = PoincareRelations(file_path=datapath('poincare_hypernyms_large.tsv')) >>> model = PoincareModel(train_data=relations) >>> model.train(epochs=50) >>> >>> # What is the distance between the words 'mammal' and 'carnivore'? >>> model.kv.distance('mammal.n.01', 'carnivore.n.01') 2.9742298803339304
- Raises
KeyError – If either of w1 and w2 is absent from vocab.
- distances(node_or_vector, other_nodes=())¶
Compute Poincare distances from given node_or_vector to all nodes in other_nodes. If other_nodes is empty, return distance between node_or_vector and all nodes in vocab.
- Parameters
node_or_vector ({str, int, numpy.array}) – Node key or vector from which distances are to be computed.
other_nodes ({iterable of str, iterable of int, None}, optional) – For each node in other_nodes distance from node_or_vector is computed. If None or empty, distance of node_or_vector from all nodes in vocab is computed (including itself).
- Returns
Array containing distances to all nodes in other_nodes from input node_or_vector, in the same order as other_nodes.
- Return type
numpy.array
Examples
>>> from gensim.test.utils import datapath >>> >>> # Read the sample relations file and train the model >>> relations = PoincareRelations(file_path=datapath('poincare_hypernyms_large.tsv')) >>> model = PoincareModel(train_data=relations) >>> model.train(epochs=50) >>> >>> # Check the distances between a word and a list of other words. >>> model.kv.distances('mammal.n.01', ['carnivore.n.01', 'dog.n.01']) array([2.97422988, 2.83007402]) >>> # Check the distances between a word and every other word in the vocab. >>> all_distances = model.kv.distances('mammal.n.01')
- Raises
KeyError – If either node_or_vector or any node in other_nodes is absent from vocab.
- doesnt_match(words)¶
Which key from the given list doesn’t go with the others?
- Parameters
words (list of str) – List of keys.
- Returns
The key further away from the mean of all keys.
- Return type
str
- evaluate_word_analogies(analogies, restrict_vocab=300000, case_insensitive=True, dummy4unknown=False, similarity_function='most_similar')¶
Compute performance of the model on an analogy test set.
The accuracy is reported (printed to log and returned as a score) for each section separately, plus there’s one aggregate summary at the end.
This method corresponds to the compute-accuracy script of the original C word2vec. See also Analogy (State of the art).
- Parameters
analogies (str) – Path to file, where lines are 4-tuples of words, split into sections by “: SECTION NAME” lines. See gensim/test/test_data/questions-words.txt as example.
restrict_vocab (int, optional) – Ignore all 4-tuples containing a word not in the first restrict_vocab words. This may be meaningful if you’ve sorted the model vocabulary by descending frequency (which is standard in modern word embedding models).
case_insensitive (bool, optional) – If True - convert all words to their uppercase form before evaluating the performance. Useful to handle case-mismatch between training tokens and words in the test set. In case of multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.
dummy4unknown (bool, optional) – If True - produce zero accuracies for 4-tuples with out-of-vocabulary words. Otherwise, these tuples are skipped entirely and not used in the evaluation.
similarity_function (str, optional) – Function name used for similarity calculation.
- Returns
score (float) – The overall evaluation score on the entire evaluation set
sections (list of dict of {str : str or list of tuple of (str, str, str, str)}) – Results broken down by each section of the evaluation set. Each dict contains the name of the section under the key ‘section’, and lists of correctly and incorrectly predicted 4-tuples of words under the keys ‘correct’ and ‘incorrect’.
- evaluate_word_pairs(pairs, delimiter='\t', encoding='utf8', restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)¶
Compute correlation of the model with human similarity judgments.
Notes
More datasets can be found at * http://technion.ac.il/~ira.leviant/MultilingualVSMdata.html * https://www.cl.cam.ac.uk/~fh295/simlex.html.
- Parameters
pairs (str) – Path to file, where lines are 3-tuples, each consisting of a word pair and a similarity value. See test/test_data/wordsim353.tsv as example.
delimiter (str, optional) – Separator in pairs file.
restrict_vocab (int, optional) – Ignore all 4-tuples containing a word not in the first restrict_vocab words. This may be meaningful if you’ve sorted the model vocabulary by descending frequency (which is standard in modern word embedding models).
case_insensitive (bool, optional) – If True - convert all words to their uppercase form before evaluating the performance. Useful to handle case-mismatch between training tokens and words in the test set. In case of multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.
dummy4unknown (bool, optional) – If True - produce zero accuracies for 4-tuples with out-of-vocabulary words. Otherwise, these tuples are skipped entirely and not used in the evaluation.
- Returns
pearson (tuple of (float, float)) – Pearson correlation coefficient with 2-tailed p-value.
spearman (tuple of (float, float)) – Spearman rank-order correlation coefficient between the similarities from the dataset and the similarities produced by the model itself, with 2-tailed p-value.
oov_ratio (float) – The ratio of pairs with unknown words.
- fill_norms(force=False)¶
Ensure per-vector norms are available.
Any code which modifies vectors should ensure the accompanying norms are either recalculated or ‘None’, to trigger a full recalculation later on-request.
- get_index(key, default=None)¶
Return the integer index (slot/position) where the given key’s vector is stored in the backing vectors array.
- get_mean_vector(keys, weights=None, pre_normalize=True, post_normalize=False, ignore_missing=True)¶
Get the mean vector for a given list of keys.
- Parameters
keys (list of (str or int or ndarray)) – Keys specified by string or int ids or numpy array.
weights (list of float or numpy.ndarray, optional) – 1D array of same size of keys specifying the weight for each key.
pre_normalize (bool, optional) – Flag indicating whether to normalize each keyvector before taking mean. If False, individual keyvector will not be normalized.
post_normalize (bool, optional) – Flag indicating whether to normalize the final mean vector. If True, normalized mean vector will be return.
ignore_missing (bool, optional) – If False, will raise error if a key doesn’t exist in vocabulary.
- Returns
Mean vector for the list of keys.
- Return type
numpy.ndarray
- Raises
ValueError – If the size of the list of keys and weights doesn’t match.
KeyError – If any of the key doesn’t exist in vocabulary and ignore_missing is false.
- get_normed_vectors()¶
Get all embedding vectors normalized to unit L2 length (euclidean), as a 2D numpy array.
To see which key corresponds to which vector = which array row, refer to the
index_to_key
attribute.- Returns
2D numpy array of shape
(number_of_keys, embedding dimensionality)
, L2-normalized along the rows (key vectors).- Return type
numpy.ndarray
- get_vecattr(key, attr)¶
Get attribute value associated with given key.
- Parameters
key (str) – Vector key for which to fetch the attribute value.
attr (str) – Name of the additional attribute to fetch for the given key.
- Returns
Value of the additional attribute fetched for the given key.
- Return type
object
- get_vector(key, norm=False)¶
Get the key’s vector, as a 1D numpy array.
- Parameters
key (str) – Key for vector to return.
norm (bool, optional) – If True, the resulting vector will be L2-normalized (unit Euclidean length).
- Returns
Vector for the specified key.
- Return type
numpy.ndarray
- Raises
KeyError – If the given key doesn’t exist.
- has_index_for(key)¶
Can this model return a single index for this key?
Subclasses that synthesize vectors for out-of-vocabulary words (like
FastText
) may respond True for a simple word in wv (__contains__()) check but False for this more-specific check.
- property index2entity¶
- property index2word¶
- init_sims(replace=False)¶
Precompute data helpful for bulk similarity calculations.
fill_norms()
now preferred for this purpose.- Parameters
replace (bool, optional) – If True - forget the original vectors and only keep the normalized ones.
Warning
You cannot sensibly continue training after doing a replace on a model’s internal KeyedVectors, and a replace is no longer necessary to save RAM. Do not use this method.
- intersect_word2vec_format(fname, lockf=0.0, binary=False, encoding='utf8', unicode_errors='strict')¶
Merge in an input-hidden weight matrix loaded from the original C word2vec-tool format, where it intersects with the current vocabulary.
No words are added to the existing vocabulary, but intersecting words adopt the file’s weights, and non-intersecting words are left alone.
- Parameters
fname (str) – The file path to load the vectors from.
lockf (float, optional) – Lock-factor value to be set for any imported word-vectors; the default value of 0.0 prevents further updating of the vector during subsequent training. Use 1.0 to allow further training updates of merged vectors.
binary (bool, optional) – If True, fname is in the binary word2vec C format.
encoding (str, optional) – Encoding of text for unicode function (python2 only).
unicode_errors (str, optional) – Error handling behaviour, used as parameter for unicode function (python2 only).
- classmethod load(fname, mmap=None)¶
Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
- classmethod load_word2vec_format(fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict', limit=None, datatype=<class 'numpy.float32'>, no_header=False)¶
Load KeyedVectors from a file produced by the original C word2vec-tool format.
Warning
The information stored in the file is incomplete (the binary tree is missing), so while you can query for word similarity etc., you cannot continue training with a model loaded this way.
- Parameters
fname (str) – The file path to the saved word2vec-format file.
fvocab (str, optional) – File path to the vocabulary.Word counts are read from fvocab filename, if set (this is the file generated by -save-vocab flag of the original C tool).
binary (bool, optional) – If True, indicates whether the data is in binary word2vec format.
encoding (str, optional) – If you trained the C model using non-utf8 encoding for words, specify that encoding in encoding.
unicode_errors (str, optional) – default ‘strict’, is a string suitable to be passed as the errors argument to the unicode() (Python 2.x) or str() (Python 3.x) function. If your source file may include word tokens truncated in the middle of a multibyte unicode character (as is common from the original word2vec.c tool), ‘ignore’ or ‘replace’ may help.
limit (int, optional) – Sets a maximum number of word-vectors to read from the file. The default, None, means read all.
datatype (type, optional) – (Experimental) Can coerce dimensions to a non-default float type (such as np.float16) to save memory. Such types may result in much slower bulk operations or incompatibility with optimized routines.)
no_header (bool, optional) – Default False means a usual word2vec-format file, with a 1st line declaring the count of following vectors & number of dimensions. If True, the file is assumed to lack a declaratory (vocab_size, vector_size) header and instead start with the 1st vector, and an extra reading-pass will be used to discover the number of vectors. Works only with binary=False.
- Returns
Loaded model.
- Return type
- static log_accuracy(section)¶
- static log_evaluate_word_pairs(pearson, spearman, oov, pairs)¶
- most_similar(node_or_vector, topn=10, restrict_vocab=None)¶
Find the top-N most similar nodes to the given node or vector, sorted in increasing order of distance.
- Parameters
node_or_vector ({str, int, numpy.array}) – node key or vector for which similar nodes are to be found.
topn (int or None, optional) – Number of top-N similar nodes to return, when topn is int. When topn is None, then distance for all nodes are returned.
restrict_vocab (int or None, optional) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 node vectors in the vocabulary order. This may be meaningful if vocabulary is sorted by descending frequency.
- Returns
When topn is int, a sequence of (node, distance) is returned in increasing order of distance. When topn is None, then similarities for all words are returned as a one-dimensional numpy array with the size of the vocabulary.
- Return type
list of (str, float) or numpy.array
Examples
>>> from gensim.test.utils import datapath >>> >>> # Read the sample relations file and train the model >>> relations = PoincareRelations(file_path=datapath('poincare_hypernyms_large.tsv')) >>> model = PoincareModel(train_data=relations) >>> model.train(epochs=50) >>> >>> # Which words are most similar to 'kangaroo'? >>> model.kv.most_similar('kangaroo.n.01', topn=2) [(u'kangaroo.n.01', 0.0), (u'marsupial.n.01', 0.26524229460827725)]
- most_similar_cosmul(positive=None, negative=None, topn=10, restrict_vocab=None)¶
Find the top-N most similar words, using the multiplicative combination objective, proposed by Omer Levy and Yoav Goldberg “Linguistic Regularities in Sparse and Explicit Word Representations”. Positive words still contribute positively towards the similarity, negative words negatively, but with less susceptibility to one large distance dominating the calculation. In the common analogy-solving case, of two positive and one negative examples, this method is equivalent to the “3CosMul” objective (equation (4)) of Levy and Goldberg.
Additional positive or negative examples contribute to the numerator or denominator, respectively - a potentially sensible but untested extension of the method. With a single positive example, rankings will be the same as in the default
most_similar()
.Allows calls like most_similar_cosmul(‘dog’, ‘cat’), as a shorthand for most_similar_cosmul([‘dog’], [‘cat’]) where ‘dog’ is positive and ‘cat’ negative
- Parameters
positive (list of str, optional) – List of words that contribute positively.
negative (list of str, optional) – List of words that contribute negatively.
topn (int or None, optional) – Number of top-N similar words to return, when topn is int. When topn is None, then similarities for all words are returned.
restrict_vocab (int or None, optional) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 node vectors in the vocabulary order. This may be meaningful if vocabulary is sorted by descending frequency.
- Returns
When topn is int, a sequence of (word, similarity) is returned. When topn is None, then similarities for all words are returned as a one-dimensional numpy array with the size of the vocabulary.
- Return type
list of (str, float) or numpy.array
- most_similar_to_given(key1, keys_list)¶
Get the key from keys_list most similar to key1.
- n_similarity(ws1, ws2)¶
Compute cosine similarity between two sets of keys.
- Parameters
ws1 (list of str) – Sequence of keys.
ws2 (list of str) – Sequence of keys.
- Returns
Similarities between ws1 and ws2.
- Return type
numpy.ndarray
- norm(node_or_vector)¶
Compute absolute position in hierarchy of input node or vector. Values range between 0 and 1. A lower value indicates the input node or vector is higher in the hierarchy.
- Parameters
node_or_vector ({str, int, numpy.array}) – Input node key or vector for which position in hierarchy is to be returned.
- Returns
Absolute position in the hierarchy of the input vector or node.
- Return type
float
Examples
>>> from gensim.test.utils import datapath >>> >>> # Read the sample relations file and train the model >>> relations = PoincareRelations(file_path=datapath('poincare_hypernyms_large.tsv')) >>> model = PoincareModel(train_data=relations) >>> model.train(epochs=50) >>> >>> # Get the norm of the embedding of the word `mammal`. >>> model.kv.norm('mammal.n.01') 0.6423008703542398
Notes
The position in hierarchy is based on the norm of the vector for the node.
- rank(key1, key2)¶
Rank of the distance of key2 from key1, in relation to distances of all keys from key1.
- rank_by_centrality(words, use_norm=True)¶
Rank the given words by similarity to the centroid of all the words.
- Parameters
words (list of str) – List of keys.
use_norm (bool, optional) – Whether to calculate centroid using unit-normed vectors; default True.
- Returns
Ranked list of (similarity, key), most-similar to the centroid first.
- Return type
list of (float, str)
- relative_cosine_similarity(wa, wb, topn=10)¶
Compute the relative cosine similarity between two words given top-n similar words, by Artuur Leeuwenberga, Mihaela Velab , Jon Dehdaribc, Josef van Genabithbc “A Minimally Supervised Approach for Synonym Extraction with Word Embeddings”.
To calculate relative cosine similarity between two words, equation (1) of the paper is used. For WordNet synonyms, if rcs(topn=10) is greater than 0.10 then wa and wb are more similar than any arbitrary word pairs.
- Parameters
wa (str) – Word for which we have to look top-n similar word.
wb (str) – Word for which we evaluating relative cosine similarity with wa.
topn (int, optional) – Number of top-n similar words to look with respect to wa.
- Returns
Relative cosine similarity between wa and wb.
- Return type
numpy.float64
- resize_vectors(seed=0)¶
Make underlying vectors match index_to_key size; random-initialize any new rows.
- save(*args, **kwargs)¶
Save KeyedVectors to a file.
- Parameters
fname_or_handle (str) – Path to the output file.
See also
load()
Load a previously saved model.
- save_word2vec_format(fname, fvocab=None, binary=False, total_vec=None, write_header=True, prefix='', append=False, sort_attr='count')¶
Store the input-hidden weight matrix in the same format used by the original C word2vec-tool, for compatibility.
- Parameters
fname (str) – File path to save the vectors to.
fvocab (str, optional) – File path to save additional vocabulary information to. None to not store the vocabulary.
binary (bool, optional) – If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.
total_vec (int, optional) – Explicitly specify total number of vectors (in case word vectors are appended with document vectors afterwards).
write_header (bool, optional) – If False, don’t write the 1st line declaring the count of vectors and dimensions. This is the format used by e.g. gloVe vectors.
prefix (str, optional) – String to prepend in front of each stored word. Default = no prefix.
append (bool, optional) – If set, open fname in ab mode instead of the default wb mode.
sort_attr (str, optional) – Sort the output vectors in descending order of this attribute. Default: most frequent keys first.
- set_vecattr(key, attr, val)¶
Set attribute associated with the given key to value.
- Parameters
key (str) – Store the attribute for this vector key.
attr (str) – Name of the additional attribute to store for the given key.
val (object) – Value of the additional attribute to store for the given key.
- Return type
None
- similar_by_key(key, topn=10, restrict_vocab=None)¶
Find the top-N most similar keys.
- Parameters
key (str) – Key
topn (int or None, optional) – Number of top-N similar keys to return. If topn is None, similar_by_key returns the vector of similarity scores.
restrict_vocab (int, optional) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 key vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)
- Returns
When topn is int, a sequence of (key, similarity) is returned. When topn is None, then similarities for all keys are returned as a one-dimensional numpy array with the size of the vocabulary.
- Return type
list of (str, float) or numpy.array
- similar_by_vector(vector, topn=10, restrict_vocab=None)¶
Find the top-N most similar keys by vector.
- Parameters
vector (numpy.array) – Vector from which similarities are to be computed.
topn (int or None, optional) – Number of top-N similar keys to return, when topn is int. When topn is None, then similarities for all keys are returned.
restrict_vocab (int, optional) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 key vectors in the vocabulary order. (This may be meaningful if you’ve sorted the vocabulary by descending frequency.)
- Returns
When topn is int, a sequence of (key, similarity) is returned. When topn is None, then similarities for all keys are returned as a one-dimensional numpy array with the size of the vocabulary.
- Return type
list of (str, float) or numpy.array
- similar_by_word(word, topn=10, restrict_vocab=None)¶
Compatibility alias for similar_by_key().
- similarity(w1, w2)¶
Compute similarity based on Poincare distance between vectors for nodes w1 and w2.
- Parameters
w1 ({str, int}) – Key for first node.
w2 ({str, int}) – Key for second node.
- Returns
Similarity between the between the vectors for nodes w1 and w2 (between 0 and 1).
- Return type
float
Examples
>>> from gensim.test.utils import datapath >>> >>> # Read the sample relations file and train the model >>> relations = PoincareRelations(file_path=datapath('poincare_hypernyms_large.tsv')) >>> model = PoincareModel(train_data=relations) >>> model.train(epochs=50) >>> >>> # What is the similarity between the words 'mammal' and 'carnivore'? >>> model.kv.similarity('mammal.n.01', 'carnivore.n.01') 0.25162107631176484
- Raises
KeyError – If either of w1 and w2 is absent from vocab.
- similarity_unseen_docs(*args, **kwargs)¶
- sort_by_descending_frequency()¶
Sort the vocabulary so the most frequent words have the lowest indexes.
- unit_normalize_all()¶
Destructively scale all vectors to unit-length.
You cannot sensibly continue training after such a step.
- static vector_distance(vector_1, vector_2)¶
Compute poincare distance between two input vectors. Convenience method over vector_distance_batch.
- Parameters
vector_1 (numpy.array) – Input vector.
vector_2 (numpy.array) – Input vector.
- Returns
Poincare distance between vector_1 and vector_2.
- Return type
numpy.float
- static vector_distance_batch(vector_1, vectors_all)¶
Compute poincare distances between one vector and a set of other vectors.
- Parameters
vector_1 (numpy.array) – vector from which Poincare distances are to be computed, expected shape (dim,).
vectors_all (numpy.array) – for each row in vectors_all, distance from vector_1 is computed, expected shape (num_vectors, dim).
- Returns
Poincare distance between vector_1 and each row in vectors_all, shape (num_vectors,).
- Return type
numpy.array
- vectors_for_all(keys: Iterable, allow_inference: bool = True, copy_vecattrs: bool = False) KeyedVectors ¶
Produce vectors for all given keys as a new
KeyedVectors
object.Notes
The keys will always be deduplicated. For optimal performance, you should not pass entire corpora to the method. Instead, you should construct a dictionary of unique words in your corpus:
>>> from collections import Counter >>> import itertools >>> >>> from gensim.models import FastText >>> from gensim.test.utils import datapath, common_texts >>> >>> model_corpus_file = datapath('lee_background.cor') # train word vectors on some corpus >>> model = FastText(corpus_file=model_corpus_file, vector_size=20, min_count=1) >>> corpus = common_texts # infer word vectors for words from another corpus >>> word_counts = Counter(itertools.chain.from_iterable(corpus)) # count words in your corpus >>> words_by_freq = (k for k, v in word_counts.most_common()) >>> word_vectors = model.wv.vectors_for_all(words_by_freq) # create word-vectors for words in your corpus
- Parameters
keys (iterable) – The keys that will be vectorized.
allow_inference (bool, optional) – In subclasses such as
FastTextKeyedVectors
, vectors for out-of-vocabulary keys (words) may be inferred. Default is True.copy_vecattrs (bool, optional) – Additional attributes set via the
KeyedVectors.set_vecattr()
method will be preserved in the producedKeyedVectors
object. Default is False. To ensure that all the produced vectors will have vector attributes assigned, you should set allow_inference=False.
- Returns
keyedvectors – Vectors for all the given keys.
- Return type
- property vectors_norm¶
- property vocab¶
- wmdistance(document1, document2, norm=True)¶
Compute the Word Mover’s Distance between two documents.
When using this code, please consider citing the following papers:
- Parameters
document1 (list of str) – Input document.
document2 (list of str) – Input document.
norm (boolean) – Normalize all word vectors to unit length before computing the distance? Defaults to True.
- Returns
Word Mover’s distance between document1 and document2.
- Return type
float
Warning
This method only works if POT is installed.
If one of the documents have no words that exist in the vocab, float(‘inf’) (i.e. infinity) will be returned.
- Raises
ImportError –
If POT isn’t installed.
- word_vec(*args, **kwargs)¶
Compatibility alias for get_vector(); must exist so subclass calls reach subclass get_vector().
- words_closer_than(word1, word2)¶
- class gensim.models.poincare.PoincareModel(train_data, size=50, alpha=0.1, negative=10, workers=1, epsilon=1e-05, regularization_coeff=1.0, burn_in=10, burn_in_alpha=0.01, init_range=(-0.001, 0.001), dtype=<class 'numpy.float64'>, seed=0)¶
Bases:
SaveLoad
Train, use and evaluate Poincare Embeddings.
The model can be stored/loaded via its
save()
andload()
methods, or stored/loaded in the word2vec format via model.kv.save_word2vec_format andload_word2vec_format()
.Notes
Training cannot be resumed from a model loaded via load_word2vec_format, if you wish to train further, use
save()
andload()
methods instead.An important attribute (that provides a lot of additional functionality when directly accessed) are the keyed vectors:
- self.kv
PoincareKeyedVectors
This object essentially contains the mapping between nodes and embeddings, as well the vocabulary of the model (set of unique nodes seen by the model). After training, it can be used to perform operations on the vectors such as vector lookup, distance and similarity calculations etc. See the documentation of its class for usage examples.
Initialize and train a Poincare embedding model from an iterable of relations.
- Parameters
train_data ({iterable of (str, str),
gensim.models.poincare.PoincareRelations
}) – Iterable of relations, e.g. a list of tuples, or agensim.models.poincare.PoincareRelations
instance streaming from a file. Note that the relations are treated as ordered pairs, i.e. a relation (a, b) does not imply the opposite relation (b, a). In case the relations are symmetric, the data should contain both relations (a, b) and (b, a).size (int, optional) – Number of dimensions of the trained model.
alpha (float, optional) – Learning rate for training.
negative (int, optional) – Number of negative samples to use.
workers (int, optional) – Number of threads to use for training the model.
epsilon (float, optional) – Constant used for clipping embeddings below a norm of one.
regularization_coeff (float, optional) – Coefficient used for l2-regularization while training (0 effectively disables regularization).
burn_in (int, optional) – Number of epochs to use for burn-in initialization (0 means no burn-in).
burn_in_alpha (float, optional) – Learning rate for burn-in initialization, ignored if burn_in is 0.
init_range (2-tuple (float, float)) – Range within which the vectors are randomly initialized.
dtype (numpy.dtype) – The numpy dtype to use for the vectors in the model (numpy.float64, numpy.float32 etc). Using lower precision floats may be useful in increasing training speed and reducing memory usage.
seed (int, optional) – Seed for random to ensure reproducibility.
Examples
Initialize a model from a list:
>>> from gensim.models.poincare import PoincareModel >>> relations = [('kangaroo', 'marsupial'), ('kangaroo', 'mammal'), ('gib', 'cat')] >>> model = PoincareModel(relations, negative=2)
Initialize a model from a file containing one relation per line:
>>> from gensim.models.poincare import PoincareModel, PoincareRelations >>> from gensim.test.utils import datapath >>> file_path = datapath('poincare_hypernyms.tsv') >>> model = PoincareModel(PoincareRelations(file_path), negative=2)
See
PoincareRelations
for more options.- add_lifecycle_event(event_name, log_level=20, **event)¶
Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.
Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.
The lifecycle_events attribute is persisted across object’s
save()
andload()
operations. It has no impact on the use of the model, but is useful during debugging and support.Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.
- Parameters
event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.
This method will automatically add the following key-values to event, so you don’t have to specify them:
datetime: the current date & time
gensim: the current Gensim version
python: the current Python version
platform: the current platform
event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.
- build_vocab(relations, update=False)¶
Build the model’s vocabulary from known relations.
- Parameters
relations ({iterable of (str, str),
gensim.models.poincare.PoincareRelations
}) – Iterable of relations, e.g. a list of tuples, or agensim.models.poincare.PoincareRelations
instance streaming from a file. Note that the relations are treated as ordered pairs, i.e. a relation (a, b) does not imply the opposite relation (b, a). In case the relations are symmetric, the data should contain both relations (a, b) and (b, a).update (bool, optional) – If true, only new nodes’s embeddings are initialized. Use this when the model already has an existing vocabulary and you want to update it. If false, all node’s embeddings are initialized. Use this when you’re creating a new vocabulary from scratch.
Examples
Train a model and update vocab for online training:
>>> from gensim.models.poincare import PoincareModel >>> >>> # train a new model from initial data >>> initial_relations = [('kangaroo', 'marsupial'), ('kangaroo', 'mammal')] >>> model = PoincareModel(initial_relations, negative=1) >>> model.train(epochs=50) >>> >>> # online training: update the vocabulary and continue training >>> online_relations = [('striped_skunk', 'mammal')] >>> model.build_vocab(online_relations, update=True) >>> model.train(epochs=50)
- classmethod load(*args, **kwargs)¶
Load model from disk, inherited from
SaveLoad
.See also
- Parameters
- Returns
The loaded model.
- Return type
- train(epochs, batch_size=10, print_every=1000, check_gradients_every=None)¶
Train Poincare embeddings using loaded data and model parameters.
- Parameters
epochs (int) – Number of iterations (epochs) over the corpus.
batch_size (int, optional) – Number of examples to train on in a single batch.
print_every (int, optional) – Prints progress and average loss after every print_every batches.
check_gradients_every (int or None, optional) – Compares computed gradients and autograd gradients after every check_gradients_every batches. Useful for debugging, doesn’t compare by default.
Examples
>>> from gensim.models.poincare import PoincareModel >>> relations = [('kangaroo', 'marsupial'), ('kangaroo', 'mammal'), ('gib', 'cat')] >>> model = PoincareModel(relations, negative=2) >>> model.train(epochs=50)
- self.kv
- class gensim.models.poincare.PoincareRelations(file_path, encoding='utf8', delimiter='\t')¶
Bases:
object
Stream relations for PoincareModel from a tsv-like file.
Initialize instance from file containing a pair of nodes (a relation) per line.
- Parameters
file_path (str) –
Path to file containing a pair of nodes (a relation) per line, separated by delimiter. Since the relations are asymmetric, the order of u and v nodes in each pair matters. To express a “u is v” relation, the lines should take the form u delimeter v. e.g: kangaroo mammal is a tab-delimited line expressing a “kangaroo is a mammal” relation.
For a full input file example, see gensim/test/test_data/poincare_hypernyms.tsv.
encoding (str, optional) – Character encoding of the input file.
delimiter (str, optional) – Delimiter character for each relation.
- __iter__()¶
Stream relations from self.file_path decoded into unicode strings.
- Yields
(unicode, unicode) – Relation from input file.
- class gensim.models.poincare.ReconstructionEvaluation(file_path, embedding)¶
Bases:
object
Evaluate reconstruction on given network for given embedding.
Initialize evaluation instance with tsv file containing relation pairs and embedding to be evaluated.
- Parameters
file_path (str) – Path to tsv file containing relation pairs.
embedding (
PoincareKeyedVectors
) – Embedding to be evaluated.
- evaluate(max_n=None)¶
Evaluate all defined metrics for the reconstruction task.
- Parameters
max_n (int, optional) – Maximum number of positive relations to evaluate, all if max_n is None.
- Returns
(metric_name, metric_value) pairs, e.g. {‘mean_rank’: 50.3, ‘MAP’: 0.31}.
- Return type
dict of (str, float)
- evaluate_mean_rank_and_map(max_n=None)¶
Evaluate mean rank and MAP for reconstruction.
- Parameters
max_n (int, optional) – Maximum number of positive relations to evaluate, all if max_n is None.
- Returns
(mean_rank, MAP), e.g (50.3, 0.31).
- Return type
(float, float)
- static get_positive_relation_ranks_and_avg_prec(all_distances, positive_relations)¶
Compute ranks and Average Precision of positive relations.
- Parameters
all_distances (numpy.array of float) – Array of all distances (floats) for a specific item.
positive_relations (list) – List of indices of positive relations for the item.
- Returns
The list contains ranks of positive relations in the same order as positive_relations. The float is the Average Precision of the ranking, e.g. ([1, 2, 3, 20], 0.610).
- Return type
(list of int, float)