`models.translation_matrix` – Translation Matrix model¶

Produce a translation matrix to translate words from one language to another, using either a standard nearest neighbour method or a globally corrected neighbour retrieval method 1.

This method can be used to augment the existing phrase tables with more candidate translations, or filter out errors from the translation tables and known dictionaries 2. What’s more, it also works for any two sets of named-vectors where there are some paired-guideposts to learn the transformation.

Examples

How to make translation between two set of word-vectors¶

Initialize two word-vector models

>>> from gensim.models import KeyedVectors
>>> from gensim.test.utils import datapath
>>>
>>> model_en = KeyedVectors.load_word2vec_format(datapath("EN.1-10.cbow1_wind5_hs0_neg10_size300_smpl1e-05.txt"))
>>> model_it = KeyedVectors.load_word2vec_format(datapath("IT.1-10.cbow1_wind5_hs0_neg10_size300_smpl1e-05.txt"))

Define word pairs (that will be used for construction of translation matrix)

>>> word_pairs = [
...     ("one", "uno"), ("two", "due"), ("three", "tre"), ("four", "quattro"), ("five", "cinque"),
...     ("seven", "sette"), ("eight", "otto"),
...     ("dog", "cane"), ("pig", "maiale"), ("fish", "cavallo"), ("birds", "uccelli"),
...     ("apple", "mela"), ("orange", "arancione"), ("grape", "acino"), ("banana", "banana")
... ]

Fit TranslationMatrix

>>> trans_model = TranslationMatrix(model_en, model_it, word_pairs=word_pairs)

Apply model (translate words “dog” and “one”)

>>> trans_model.translate(["dog", "one"], topn=3)
OrderedDict([('dog', [u'cane', u'gatto', u'cavallo']), ('one', [u'uno', u'due', u'tre'])])

Save / load model

>>> with temporary_file("model_file") as fname:
...     trans_model.save(fname)  # save model to file
...     loaded_trans_model = TranslationMatrix.load(fname)  # load model

How to make translation between two `Doc2Vec` models¶

Prepare data and models

>>> from gensim.test.utils import datapath
>>> from gensim.test.test_translation_matrix import read_sentiment_docs
>>> from gensim.models import Doc2Vec
>>>
>>> data = read_sentiment_docs(datapath("alldata-id-10.txt"))[:5]
>>> src_model = Doc2Vec.load(datapath("small_tag_doc_5_iter50"))
>>> dst_model = Doc2Vec.load(datapath("large_tag_doc_10_iter50"))

Train backward translation

>>> model_trans = BackMappingTranslationMatrix(data, src_model, dst_model)
>>> trans_matrix = model_trans.train(data)

Apply model

>>> result = model_trans.infer_vector(dst_model.dv[data[3].tags])

References

1(1,2): Dinu, Georgiana, Angeliki Lazaridou, and Marco Baroni. “Improving zero-shot learning by mitigating the hubness problem”, https://arxiv.org/abs/1412.6568
2: Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. “Distributed Representations of Words and Phrases and their Compositionality”, https://arxiv.org/abs/1310.4546

class gensim.models.translation_matrix.BackMappingTranslationMatrix(source_lang_vec, target_lang_vec, tagged_docs=None, random_state=None)¶

Bases: SaveLoad

Realize the BackMapping translation matrix which maps the source model’s document vector to the target model’s document vector (old model).

BackMapping translation matrix is used to learn a mapping for two document vector spaces which we specify as source document vector and target document vector. The target document vectors are trained on a superset corpus of source document vectors; we can incrementally increase the vector in the old model through the BackMapping translation matrix.

For details on use, see the tutorial notebook 3.

Examples

>>> from gensim.test.utils import datapath
>>> from gensim.test.test_translation_matrix import read_sentiment_docs
>>> from gensim.models import Doc2Vec, BackMappingTranslationMatrix
>>>
>>> data = read_sentiment_docs(datapath("alldata-id-10.txt"))[:5]
>>> src_model = Doc2Vec.load(datapath("small_tag_doc_5_iter50"))
>>> dst_model = Doc2Vec.load(datapath("large_tag_doc_10_iter50"))
>>>
>>> model_trans = BackMappingTranslationMatrix(src_model, dst_model)
>>> trans_matrix = model_trans.train(data)
>>>
>>> result = model_trans.infer_vector(dst_model.dv[data[3].tags])

Parameters

source_lang_vec (Doc2Vec) – Source Doc2Vec model.
target_lang_vec (Doc2Vec) – Target Doc2Vec model.
tagged_docs (list of TaggedDocument, optional.) – Documents that will be used for training, both the source language document vector and target language document vector trained on those tagged documents.
random_state ({None, int, array_like}, optional) – Seed for random state.

add_lifecycle_event(event_name, log_level=20, **event)¶

Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.

Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.

The lifecycle_events attribute is persisted across object’s save() and load() operations. It has no impact on the use of the model, but is useful during debugging and support.

Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.

Parameters

event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.

This method will automatically add the following key-values to event, so you don’t have to specify them:
- datetime: the current date & time
- gensim: the current Gensim version
- python: the current Python version
- platform: the current platform
- event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.

infer_vector(target_doc_vec)¶

Translate the target model’s document vector to the source model’s document vector

Parameters: target_doc_vec (numpy.ndarray) – Document vector from the target document, whose document are not in the source model.
Returns: Vector target_doc_vec in the source model.
Return type: numpy.ndarray

classmethod load(fname, mmap=None)¶

Load an object previously saved using save() from a file.

Parameters

fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

load(): Load object from file.

train(tagged_docs)¶

Build the translation matrix to map from the source model’s vectors to target model’s vectors

Parameters: tagged_docs (list of TaggedDocument, Documents) – that will be used for training, both the source language document vector and target language document vector trained on those tagged documents.
Returns: Translation matrix that maps from the source model’s vectors to target model’s vectors.
Return type: numpy.ndarray

class gensim.models.translation_matrix.Space(matrix, index2word)¶

Bases: object

An auxiliary class for storing the the words space.

Parameters

matrix (iterable of numpy.ndarray) – Matrix that contains word-vectors.
index2word (list of str) – Words which correspond to the matrix.

classmethod build(lang_vec, lexicon=None)¶

Construct a space class for the lexicon, if it’s provided.

Parameters

lang_vec (KeyedVectors) – Model from which the vectors will be extracted.
lexicon (list of str, optional) – Words which contains in the lang_vec, if lexicon = None, the lexicon is all the lang_vec’s word.

Returns

Object that stored word-vectors

Return type

Space

normalize()¶: Normalize the word vectors matrix.

class gensim.models.translation_matrix.TranslationMatrix(source_lang_vec, target_lang_vec, word_pairs=None, random_state=None)¶

Bases: SaveLoad

Objects of this class realize the translation matrix which maps the source language to the target language. The main methods are:

We map it to the other language space by computing z = Wx, then return the word whose representation is close to z.

For details on use, see the tutorial notebook 3

Examples

>>> from gensim.models import KeyedVectors
>>> from gensim.test.utils import datapath
>>> en = datapath("EN.1-10.cbow1_wind5_hs0_neg10_size300_smpl1e-05.txt")
>>> it = datapath("IT.1-10.cbow1_wind5_hs0_neg10_size300_smpl1e-05.txt")
>>> model_en = KeyedVectors.load_word2vec_format(en)
>>> model_it = KeyedVectors.load_word2vec_format(it)
>>>
>>> word_pairs = [
...     ("one", "uno"), ("two", "due"), ("three", "tre"), ("four", "quattro"), ("five", "cinque"),
...     ("seven", "sette"), ("eight", "otto"),
...     ("dog", "cane"), ("pig", "maiale"), ("fish", "cavallo"), ("birds", "uccelli"),
...     ("apple", "mela"), ("orange", "arancione"), ("grape", "acino"), ("banana", "banana")
... ]
>>>
>>> trans_model = TranslationMatrix(model_en, model_it)
>>> trans_model.train(word_pairs)
>>> trans_model.translate(["dog", "one"], topn=3)
OrderedDict([('dog', [u'cane', u'gatto', u'cavallo']), ('one', [u'uno', u'due', u'tre'])])

References

3(1,2): https://github.com/RaRe-Technologies/gensim/blob/3.2.0/docs/notebooks/translation_matrix.ipynb

Parameters

source_lang_vec (KeyedVectors) – Word vectors for source language.
target_lang_vec (KeyedVectors) – Word vectors for target language.
word_pairs (list of (str, str), optional) – Pairs of words that will be used for training.
random_state ({None, int, array_like}, optional) – Seed for random state.

add_lifecycle_event(event_name, log_level=20, **event)¶

Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.

Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.

The lifecycle_events attribute is persisted across object’s save() and load() operations. It has no impact on the use of the model, but is useful during debugging and support.

Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.

Parameters

event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.

This method will automatically add the following key-values to event, so you don’t have to specify them:
- datetime: the current date & time
- gensim: the current Gensim version
- python: the current Python version
- platform: the current platform
- event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.

apply_transmat(words_space)¶

Map the source word vector to the target word vector using translation matrix.

Parameters: words_space (Space) – Space object constructed for the words to be translated.
Returns: Space object constructed for the mapped words.
Return type: Space

classmethod load(fname, mmap=None)¶

Load an object previously saved using save() from a file.

Parameters

fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

Please sponsor Gensim to help sustain this open source project!

models.translation_matrix – Translation Matrix model¶

How to make translation between two set of word-vectors¶

How to make translation between two Doc2Vec models¶

`models.translation_matrix` – Translation Matrix model¶

How to make translation between two `Doc2Vec` models¶