gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

models.translation_matrix – Translation Matrix model

models.translation_matrix – Translation Matrix model

class gensim.models.translation_matrix.BackMappingTranslationMatrix(tagged_docs, source_lang_vec, target_lang_vec, random_state=None)

Bases: gensim.utils.SaveLoad

Objects of this class realize the BackMapping translation matrix which map the source model’s document vector to the target model’s document vector(old model). The main methods are:

  1. constructor, initializing
  2. the train method, which build a translation matrix
  3. the infer_vector method, which given the target model’s document vector

We map it to the other language space by computing z = Wx, then return the word whose representation is close to z.

the details use seen the notebook (translation matrix revist.ipynb)

>>> transmat = BackMappingTranslationMatrix(tagged, source_lang_vec, target_lang_vec)
>>> transmat.train(word_pair)
>>> infered_vec = transmat.infer_vector(tagged_doc)
Initialize the model from a list of tagged_docs. Each word_pair is tupe
with source language word and target language word.

Examples: [(“one”, “uno”), (“two”, “due”)]

Parameters:
  • tagged_docs (list) – a list of tagged document
  • source_lang_vec (Doc2vec) – provide the document vector
  • target_lang_vec (Doc2vec) – provide the document vector
infer_vector(target_doc_vec)

Translate the target model’s document vector to the source model’s document vector

Returns:infered_vec the tagged_doc’s document vector in the source model
load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.

train(tagged_docs)

Build the translation matrix that mapping from the source model’s vector to target model’s vector

Returns:translation matrix that mapping from the source model’s vector to target model’s vector
class gensim.models.translation_matrix.Space(matrix, index2word)

Bases: object

An auxiliary class for storing the the words space

Attributes: mat (ndarray): each row is the word vector of the lexicon index2word (list): a list of words in the Space object word2index (dict): map the word to index

matrix: N * length_of_word_vec, which store the word’s vector index2word: a list of words in the Space object word2index: a dict which for word indexing

classmethod build(lang_vec, lexicon=None)

Construct a space class for the lexicon, if it’s provided. :param lang_vec: word2vec model that extract word vector for lexicon :param lexicon: the default is None, if it is not provided, the lexicon is all the lang_vec’s word, i.e. lang_vec.vocab.keys()

Returns:Space object for the lexicon
normalize()

Normalize the word vector’s matrix

class gensim.models.translation_matrix.TranslationMatrix(source_lang_vec, target_lang_vec, word_pairs=None, random_state=None)

Bases: gensim.utils.SaveLoad

Objects of this class realize the translation matrix which map the source language to the target language. The main methods are:

  1. constructor,
  2. the train method, which initialize everything needed to build a translation matrix
  3. the translate method, which given new word and its vector representation.

We map it to the other language space by computing z = Wx, then return the word whose representation is close to z.

The details use seen the notebook (translation_matrix.ipynb)

>>> transmat = TranslationMatrix(source_lang_vec, target_lang_vec, word_pair)
>>> transmat.train(word_pair)
>>> translated_word = transmat.translate(words, topn=3)
Initialize the model from a list pair of word_pair. Each word_pair is tupe
with source language word and target language word.

Examples: [(“one”, “uno”), (“two”, “due”)]

Parameters:
  • word_pair (list) – a list pair of words
  • source_lang_vec (KeyedVectors) – a set of word vector of source language
  • target_lang_vec (KeyedVectors) – a set of word vector of target language
apply_transmat(words_space)

Map the source word vector to the target word vector using translation matrix :param words_space: the Space object that constructed for those words to be translate

Returns:A Space object that constructed for those mapped words
classmethod load(*args, **kwargs)

Load the pre-trained translation matrix model

save(*args, **kwargs)

Save the model to file but ignoring the souce_space and target_space

train(word_pairs)

Build the translation matrix that mapping from source space to target space.

Parameters:word_pairs (list) – a list pair of words
Returns:translation matrix that mapping from the source language to target language
translate(source_words, topn=5, gc=0, sample_num=None, source_lang_vec=None, target_lang_vec=None)

Translate the word from the source language to the target language, and return the topn most similar words. :param source_words: single word or a list of words to be translated :type source_words: str/list :param topn: return the top N similar words. By default (topn=5) :param gc: defines the training algorithm. By default (gc=0), use standard NN retrieval. :param Otherwise use globally corrected neighbour retrieval method: :type Otherwise use globally corrected neighbour retrieval method: as described in[1] :param sample_num: an int parameter that specify the number of word to sample from the source lexicon. :param if gc=1, then sample_num must be provided.: :param source_lang_vec: you can specify the source language vector for translation, the default is to use :param the model’s source language vector.: :param target_lang_vec: you can specify the target language vector for retrieving the most similar word, :param the default is to use the model’s target language vector.:

Returns:topn translated words)
Return type:A OrderedDict object, each item is (word

[1] Dinu, Georgiana, Angeliki Lazaridou, and Marco Baroni. “Improving zero-shot learning by mitigating the hubness problem.” arXiv preprint arXiv:1412.6568 (2014).