gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

models.poincare – Train and use Poincare embeddings

models.poincare – Train and use Poincare embeddings

Python implementation of Poincaré Embeddings [1], an embedding that is better at capturing latent hierarchical information than traditional Euclidean embeddings. The method is described in more detail in [1].

The main use-case is to automatically learn hierarchical representations of nodes from a tree-like structure, such as a Directed Acyclic Graph, using a transitive closure of the relations. Representations of nodes in a symmetric graph can also be learned, using an iterable of the direct relations in the graph.

This module allows training a Poincaré Embedding from a training file containing relations of graph in a csv-like format, or a Python iterable of relations.

[1](1, 2) Maximilian Nickel, Douwe Kiela - “Poincaré Embeddings for Learning Hierarchical Representations” https://arxiv.org/abs/1705.08039

Examples

Initialize and train a model from a list:

>>> from gensim.models.poincare import PoincareModel
>>> relations = [('kangaroo', 'marsupial'), ('kangaroo', 'mammal'), ('gib', 'cat')]
>>> model = PoincareModel(relations, negative=2)
>>> model.train(epochs=50)

Initialize and train a model from a file containing one relation per line:

>>> from gensim.models.poincare import PoincareModel, PoincareRelations
>>> from gensim.test.utils import datapath
>>> file_path = datapath('poincare_hypernyms.tsv')
>>> model = PoincareModel(PoincareRelations(file_path), negative=2)
>>> model.train(epochs=50)
class gensim.models.poincare.LexicalEntailmentEvaluation(filepath)

Bases: object

Evaluating reconstruction on given network for any embedding.

Initialize evaluation instance with HyperLex text file containing relation pairs.

Parameters:filepath (str) – Path to HyperLex text file.
static create_vocab_trie(embedding)

Create trie with vocab terms of the given embedding to enable quick prefix searches.

Parameters:embedding (PoincareKeyedVectors instance) – Embedding for which trie is to be created.
Returns:Trie containing vocab terms of the input embedding.
Return type:pygtrie.Trie instance
evaluate_spearman(embedding)

Evaluate spearman scores for lexical entailment for given embedding.

Parameters:embedding (PoincareKeyedVectors instance) – Embedding for which evaluation is to be done.
Returns:Spearman correlation score for the task for input embedding.
Return type:float
static find_matching_terms(trie, word)

Given a trie and a word, find terms in the trie beginning with the word.

Parameters:
  • trie (pygtrie.Trie instance) – Trie to use for finding matching terms.
  • word (str) – Input word to use for prefix search.
Returns:

List of matching terms.

Return type:

list (str)

score_function(embedding, trie, term_1, term_2)

Given an embedding and two terms, return the predicted score for them - extent to which term_1 is a type of term_2.

Parameters:
  • embedding (PoincareKeyedVectors instance) – Embedding to use for computing predicted score.
  • trie (pygtrie.Trie instance) – Trie to use for finding matching vocab terms for input terms.
  • term_1 (str) – Input term.
  • term_2 (str) – Input term.
Returns:

Predicted score (the extent to which term_1 is a type of term_2).

Return type:

float

class gensim.models.poincare.LinkPredictionEvaluation(train_path, test_path, embedding)

Bases: object

Evaluating reconstruction on given network for given embedding.

Initialize evaluation instance with tsv file containing relation pairs and embedding to be evaluated.

Parameters:
  • train_path (str) – Path to tsv file containing relation pairs used for training.
  • test_path (str) – Path to tsv file containing relation pairs to evaluate.
  • embedding (PoincareKeyedVectors instance) – Embedding to be evaluated.
evaluate(max_n=None)

Evaluate all defined metrics for the link prediction task.

Parameters:max_n (int or None) – Maximum number of positive relations to evaluate, all if max_n is None.
Returns:Contains (metric_name, metric_value) pairs. e.g. {‘mean_rank’: 50.3, ‘MAP’: 0.31}.
Return type:dict
evaluate_mean_rank_and_map(max_n=None)

Evaluate mean rank and MAP for link prediction.

Parameters:max_n (int or None) – Maximum number of positive relations to evaluate, all if max_n is None.
Returns:Contains (mean_rank, MAP). e.g (50.3, 0.31).
Return type:tuple (float, float)
static get_unknown_relation_ranks_and_avg_prec(all_distances, unknown_relations, known_relations)

Given a numpy array of distances and indices of known and unknown positive relations, compute ranks and Average Precision of unknown positive relations.

Parameters:
  • all_distances (numpy.array (float)) – Array of all distances for a specific item.
  • unknown_relations (list) – List of indices of unknown positive relations.
  • known_relations (list) – List of indices of known positive relations.
Returns:

The list contains ranks (int) of positive relations in the same order as positive_relations. The float is the Average Precision of the ranking. e.g. ([1, 2, 3, 20], 0.610).

Return type:

tuple (list, float)

class gensim.models.poincare.NegativesBuffer(items)

Bases: object

Class to buffer and return negative samples.

Initialize instance from list or numpy array of samples.

Parameters:items (list/numpy.array) – List or array containing negative samples.
get_items(num_items)

Returns next num_items from buffer.

Parameters:num_items (int) – Number of items to fetch.
Returns:Slice containing num_items items from the original data.
Return type:numpy.array or list

Notes

No error is raised if less than num_items items are remaining, simply all the remaining items are returned.

num_items()

Returns number of items remaining in the buffer.

Returns:Number of items in the buffer that haven’t been consumed yet.
Return type:int
class gensim.models.poincare.PoincareBatch(vectors_u, vectors_v, indices_u, indices_v, regularization_coeff=1.0)

Bases: object

Compute Poincare distances, gradients and loss for a training batch.

Class for computing Poincare distances, gradients and loss for a training batch, and storing intermediate state to avoid recomputing multiple times.

Initialize instance with sets of vectors for which distances are to be computed.

Parameters:
  • vectors_u (numpy.array) – Vectors of all nodes u in the batch. Expected shape (batch_size, dim).
  • vectors_v (numpy.array) – Vectors of all positively related nodes v and negatively sampled nodes v’, for each node u in the batch. Expected shape (1 + neg_size, dim, batch_size).
  • indices_u (list) – List of node indices for each of the vectors in vectors_u.
  • indices_v (list) – Nested list of lists, each of which is a list of node indices for each of the vectors in vectors_v for a specific node u.
  • regularization_coeff (float) – Coefficient to use for l2-regularization
compute_all()

Convenience method to perform all computations.

compute_distance_gradients()

Compute and store partial derivatives of poincare distance d(u, v) w.r.t all u and all v.

compute_distances()

Compute and store norms, euclidean distances and poincare distances between input vectors.

compute_gradients()

Compute and store gradients of loss function for all input vectors.

compute_loss()

Compute and store loss value for the given batch of examples.

class gensim.models.poincare.PoincareKeyedVectors

Bases: gensim.models.keyedvectors.KeyedVectorsBase

Class to contain vectors and vocab for the PoincareModel training class.

Used to perform operations on the vectors such as vector lookup, distance etc.

__contains__(word)
__getitem__(words)

Accept a single word or a list of words as input.

If a single word: returns the word’s representations in vector space, as a 1D numpy array.

Multiple words: return the words’ representations in vector space, as a 2d numpy array: #words x #vector_size. Matrix rows are in the same order as in input.

Example:

>>> trained_model['office']
array([ -1.40128313e-02, ...])

>>> trained_model[['office', 'products']]
array([ -1.40128313e-02, ...]
      [ -1.70425311e-03, ...]
       ...)
ancestors(node)

Returns the list of recursively closest parents from the given node.

Parameters:node (str or int) – Key for node for which ancestors are to be found.
Returns:Ancestor nodes of the node node.
Return type:list (str)
closest_child(node)

Returns the node closest to node that is lower in the hierarchy than node.

Parameters:node (str or int) – Key for node for which closest child is to be found.
Returns:Node closest to node that is lower in the hierarchy than node. If there are no nodes lower in the hierarchy, None is returned.
Return type:str or None
closest_parent(node)

Returns the node closest to node that is higher in the hierarchy than node.

Parameters:node (str or int) – Key for node for which closest parent is to be found.
Returns:Node closest to node that is higher in the hierarchy than node. If there are no nodes higher in the hierarchy, None is returned.
Return type:str or None
descendants(node, max_depth=5)

Returns the list of recursively closest children from the given node, upto a max depth of max_depth.

Parameters:
  • node (str or int) – Key for node for which descendants are to be found.
  • max_depth (int) – Maximum number of descendants to return.
Returns:

Descendant nodes from the node node.

Return type:

list (str)

difference_in_hierarchy(node_or_vector_1, node_or_vector_2)

Relative position in hierarchy of node_or_vector_1 relative to node_or_vector_2. A positive value indicates node_or_vector_1 is higher in the hierarchy than node_or_vector_2.

Parameters:
  • node_or_vector_1 (str/int or numpy.array) – Input node key or vector.
  • node_or_vector_2 (str/int or numpy.array) – Input node key or vector.
Returns:

Relative position in hierarchy of node_or_vector_1 relative to node_or_vector_2.

Return type:

float

Examples

>>> model.difference_in_hierarchy('mammal.n.01', 'dog.n.01')
0.51
>>> model.difference_in_hierarchy('dog.n.01', 'mammal.n.01')
-0.51

Notes

The returned value can be positive or negative, depending on whether node_or_vector_1 is higher or lower in the hierarchy than node_or_vector_2.

distance(w1, w2)

Return Poincare distance between vectors for nodes w1 and w2.

Parameters:
  • w1 (str or int) – Key for first node.
  • w2 (str or int) – Key for second node.
Returns:

Poincare distance between the vectors for nodes w1 and w2.

Return type:

float

Examples

>>> model.distance('mammal.n.01', 'carnivore.n.01')
2.13

Notes

Raises KeyError if either of w1 and w2 is absent from vocab.

distances(node_or_vector, other_nodes=())

Compute Poincare distances from given node or vector to all nodes in other_nodes. If other_nodes is empty, return distance between node_or_vector and all nodes in vocab.

Parameters:
  • node_or_vector (str/int or numpy.array) – Node key or vector from which distances are to be computed.
  • other_nodes (iterable of str/int or None) – For each node in other_nodes distance from node_or_vector is computed. If None or empty, distance of node_or_vector from all nodes in vocab is computed (including itself).
Returns:

Array containing distances to all nodes in other_nodes from input node_or_vector, in the same order as other_nodes.

Return type:

numpy.array

Examples

>>> model.distances('mammal.n.01', ['carnivore.n.01', 'dog.n.01'])
np.array([2.1199, 2.0710]
>>> model.distances('mammal.n.01')
np.array([0.43753847, 3.67973852, ..., 6.66172886])

Notes

Raises KeyError if either node_or_vector or any node in other_nodes is absent from vocab.

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

load_word2vec_format(fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict', limit=None, datatype=<type 'numpy.float32'>)

Load the input-hidden weight matrix from the original C word2vec-tool format.

Note that the information stored in the file is incomplete (the binary tree is missing), so while you can query for word similarity etc., you cannot continue training with a model loaded this way.

binary is a boolean indicating whether the data is in binary word2vec format. norm_only is a boolean indicating whether to only store normalised word2vec vectors in memory. Word counts are read from fvocab filename, if set (this is the file generated by -save-vocab flag of the original C tool).

If you trained the C model using non-utf8 encoding for words, specify that encoding in encoding.

unicode_errors, default ‘strict’, is a string suitable to be passed as the errors argument to the unicode() (Python 2.x) or str() (Python 3.x) function. If your source file may include word tokens truncated in the middle of a multibyte unicode character (as is common from the original word2vec.c tool), ‘ignore’ or ‘replace’ may help.

limit sets a maximum number of word-vectors to read from the file. The default, None, means read all.

datatype (experimental) can coerce dimensions to a non-default float type (such as np.float16) to save memory. (Such types may result in much slower bulk operations or incompatibility with optimized routines.)

most_similar(node_or_vector, topn=10, restrict_vocab=None)

Find the top-N most similar nodes to the given node or vector, sorted in increasing order of distance.

Parameters:
  • node_or_vector (str/int or numpy.array) – node key or vector for which similar nodes are to be found.
  • topn (int or None, optional) – number of similar nodes to return, if None, returns all.
  • restrict_vocab (int or None, optional) – Optional integer which limits the range of vectors which are searched for most-similar values. For example, restrict_vocab=10000 would only check the first 10000 node vectors in the vocabulary order. This may be meaningful if vocabulary is sorted by descending frequency.
Returns:

List of tuples containing (node, distance) pairs in increasing order of distance.

Return type:

list of tuples (str, float)

Examples

>>> vectors.most_similar('lion.n.01')
[('lion_cub.n.01', 0.4484), ('lionet.n.01', 0.6552), ...]
most_similar_to_given(w1, word_list)

Return the word from word_list most similar to w1.

Parameters:
  • w1 (str) – a word
  • word_list (list) – list of words containing a word most similar to w1
Returns:

the word in word_list with the highest similarity to w1

Raises:

KeyError – If w1 or any word in word_list is not in the vocabulary

Example:

>>> trained_model.most_similar_to_given('music', ['water', 'sound', 'backpack', 'mouse'])
'sound'

>>> trained_model.most_similar_to_given('snake', ['food', 'pencil', 'animal', 'phone'])
'animal'
norm(node_or_vector)

Return absolute position in hierarchy of input node or vector. Values range between 0 and 1. A lower value indicates the input node or vector is higher in the hierarchy.

Parameters:node_or_vector (str/int or numpy.array) – Input node key or vector for which position in hierarchy is to be returned.
Returns:Absolute position in the hierarchy of the input vector or node.
Return type:float

Examples

>>> model.norm('mammal.n.01')
0.9

Notes

The position in hierarchy is based on the norm of the vector for the node.

rank(w1, w2)

Rank of the distance of w2 from w1, in relation to distances of all words from w1.

Parameters:
  • w1 (str) – Input word.
  • w2 (str) – Input word.
Returns:

Rank of w2 from w1 in relation to all other nodes.

Return type:

int

Examples

>>> model.rank('mammal.n.01', 'carnivore.n.01')
3
save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.

save_word2vec_format(fname, fvocab=None, binary=False, total_vec=None)

Store the input-hidden weight matrix in the same format used by the original C word2vec-tool, for compatibility.

fname is the file used to save the vectors in fvocab is an optional file used to save the vocabulary binary is an optional boolean indicating whether the data is to be saved in binary word2vec format (default: False) total_vec is an optional parameter to explicitly specify total no. of vectors (in case word vectors are appended with document vectors afterwards)
similarity(w1, w2)

Return similarity based on Poincare distance between vectors for nodes w1 and w2.

Parameters:
  • w1 (str or int) – Key for first node.
  • w2 (str or int) – Key for second node.
Returns:

Similarity between the between the vectors for nodes w1 and w2 (between 0 and 1).

Return type:

float

Examples

>>> model.similarity('mammal.n.01', 'carnivore.n.01')
0.73

Notes

Raises KeyError if either of w1 and w2 is absent from vocab. Similarity lies between 0 and 1.

static vector_distance(vector_1, vector_2)

Return poincare distance between two input vectors. Convenience method over vector_distance_batch.

Parameters:
  • vector_1 (numpy.array) – input vector
  • vector_2 (numpy.array) – input vector
Returns:

Poincare distance between vector_1 and vector_2.

Return type:

numpy.float

static vector_distance_batch(vector_1, vectors_all)

Return poincare distances between one vector and a set of other vectors.

Parameters:
  • vector_1 (numpy.array) – vector from which Poincare distances are to be computed. expected shape (dim,)
  • vectors_all (numpy.array) – for each row in vectors_all, distance from vector_1 is computed. expected shape (num_vectors, dim)
Returns:

Contains Poincare distance between vector_1 and each row in vectors_all. shape (num_vectors,)

Return type:

numpy.array

word_vec(word)

Accept a single word as input. Returns the word’s representations in vector space, as a 1D numpy array.

Example:

>>> trained_model.word_vec('office')
array([ -1.40128313e-02, ...])
words_closer_than(w1, w2)

Returns all words that are closer to w1 than w2 is to w1.

Parameters:
  • w1 (str) – Input word.
  • w2 (str) – Input word.
Returns:

List of words that are closer to w1 than w2 is to w1.

Return type:

list (str)

Examples

>>> model.words_closer_than('carnivore.n.01', 'mammal.n.01')
['dog.n.01', 'canine.n.02']
class gensim.models.poincare.PoincareModel(train_data, size=50, alpha=0.1, negative=10, workers=1, epsilon=1e-05, regularization_coeff=1.0, burn_in=10, burn_in_alpha=0.01, init_range=(-0.001, 0.001), dtype=<type 'numpy.float64'>, seed=0)

Bases: gensim.utils.SaveLoad

Class for training, using and evaluating Poincare Embeddings.

The model can be stored/loaded via its save() and load() methods, or stored/loaded in the word2vec format via model.kv.save_word2vec_format and load_word2vec_format().

Note that training cannot be resumed from a model loaded via load_word2vec_format, if you wish to train further, use save() and load() methods instead.

Initialize and train a Poincare embedding model from an iterable of relations.

Parameters:
  • train_data (iterable of (str, str)) – Iterable of relations, e.g. a list of tuples, or a PoincareRelations instance streaming from a file. Note that the relations are treated as ordered pairs, i.e. a relation (a, b) does not imply the opposite relation (b, a). In case the relations are symmetric, the data should contain both relations (a, b) and (b, a).
  • size (int, optional) – Number of dimensions of the trained model.
  • alpha (float, optional) – Learning rate for training.
  • negative (int, optional) – Number of negative samples to use.
  • workers (int, optional) – Number of threads to use for training the model.
  • epsilon (float, optional) – Constant used for clipping embeddings below a norm of one.
  • regularization_coeff (float, optional) – Coefficient used for l2-regularization while training (0 effectively disables regularization).
  • burn_in (int, optional) – Number of epochs to use for burn-in initialization (0 means no burn-in).
  • burn_in_alpha (float, optional) – Learning rate for burn-in initialization, ignored if burn_in is 0.
  • init_range (2-tuple (float, float)) – Range within which the vectors are randomly initialized.
  • dtype (numpy.dtype) – The numpy dtype to use for the vectors in the model (numpy.float64, numpy.float32 etc). Using lower precision floats may be useful in increasing training speed and reducing memory usage.
  • seed (int, optional) – Seed for random to ensure reproducibility.

Examples

Initialize a model from a list:

>>> from gensim.models.poincare import PoincareModel
>>> relations = [('kangaroo', 'marsupial'), ('kangaroo', 'mammal'), ('gib', 'cat')]
>>> model = PoincareModel(relations, negative=2)

Initialize a model from a file containing one relation per line:

>>> from gensim.models.poincare import PoincareModel, PoincareRelations
>>> from gensim.test.utils import datapath
>>> file_path = datapath('poincare_hypernyms.tsv')
>>> model = PoincareModel(PoincareRelations(file_path), negative=2)

See PoincareRelations for more options.

classmethod load(*args, **kwargs)

Load model from disk, inherited from SaveLoad.

save(*args, **kwargs)

Save complete model to disk, inherited from gensim.utils.SaveLoad.

train(epochs, batch_size=10, print_every=1000, check_gradients_every=None)

Trains Poincare embeddings using loaded data and model parameters.

Parameters:
  • batch_size (int, optional) – Number of examples to train on in a single batch.
  • epochs (int) – Number of iterations (epochs) over the corpus.
  • print_every (int, optional) – Prints progress and average loss after every print_every batches.
  • check_gradients_every (int or None, optional) – Compares computed gradients and autograd gradients after every check_gradients_every batches. Useful for debugging, doesn’t compare by default.

Examples

>>> from gensim.models.poincare import PoincareModel
>>> relations = [('kangaroo', 'marsupial'), ('kangaroo', 'mammal'), ('gib', 'cat')]
>>> model = PoincareModel(relations, negative=2)
>>> model.train(epochs=50)
class gensim.models.poincare.PoincareRelations(file_path, encoding='utf8', delimiter='t')

Bases: object

Class to stream relations for PoincareModel from a tsv-like file.

Initialize instance from file containing a pair of nodes (a relation) per line.

Parameters:
  • file_path (str) – Path to file containing a pair of nodes (a relation) per line, separated by delimiter.
  • encoding (str, optional) – Character encoding of the input file.
  • delimiter (str, optional) – Delimiter character for each relation.
__iter__()

Streams relations from self.file_path decoded into unicode strings.

Yields:2-tuple (unicode, unicode) – Relation from input file.
class gensim.models.poincare.ReconstructionEvaluation(file_path, embedding)

Bases: object

Evaluating reconstruction on given network for given embedding.

Initialize evaluation instance with tsv file containing relation pairs and embedding to be evaluated.

Parameters:
  • file_path (str) – Path to tsv file containing relation pairs.
  • embedding (PoincareKeyedVectors instance) – Embedding to be evaluated.
evaluate(max_n=None)

Evaluate all defined metrics for the reconstruction task.

Parameters:max_n (int or None) – Maximum number of positive relations to evaluate, all if max_n is None.
Returns:Contains (metric_name, metric_value) pairs. e.g. {‘mean_rank’: 50.3, ‘MAP’: 0.31}.
Return type:dict
evaluate_mean_rank_and_map(max_n=None)

Evaluate mean rank and MAP for reconstruction.

Parameters:max_n (int or None) – Maximum number of positive relations to evaluate, all if max_n is None.
Returns:Contains (mean_rank, MAP). e.g (50.3, 0.31)
Return type:tuple (float, float)
static get_positive_relation_ranks_and_avg_prec(all_distances, positive_relations)

Given a numpy array of all distances from an item and indices of its positive relations, compute ranks and Average Precision of positive relations.

Parameters:
  • all_distances (numpy.array (float)) – Array of all distances (floats) for a specific item.
  • positive_relations (list) – List of indices of positive relations for the item.
Returns:

The list contains ranks (int) of positive relations in the same order as positive_relations. The float is the Average Precision of the ranking. e.g. ([1, 2, 3, 20], 0.610).

Return type:

tuple (list, float)