gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

models.utils_any2vec – Utils for any2vec models

models.utils_any2vec – Utils for any2vec models

General functions used for any2vec models.

One of the goals of this module is to provide an abstraction over the Cython extensions for FastText. If they are not available, then the module substitutes slower Python versions in their place.

Another related set of FastText functionality is computing ngrams for a word. The compute_ngrams() and compute_ngrams_bytes() hashes achieve that.

Closely related is the functionality for hashing ngrams, implemented by the ft_hash() and ft_hash_broken() functions. The module exposes “working” and “broken” hash functions in order to maintain backwards compatibility with older versions of Gensim.

For compatibility with older Gensim, use compute_ngrams() and ft_hash_broken() to has each ngram. For compatibility with the current Facebook implementation, use compute_ngrams_bytes() and ft_hash_bytes().

gensim.models.utils_any2vec.ft_ngram_hashes(word, minn, maxn, num_buckets, fb_compatible=True)

Calculate the ngrams of the word and hash them.

Parameters:
  • word (str) – The word to calculate ngram hashes for.
  • minn (int) – Minimum ngram length
  • maxn (int) – Maximum ngram length
  • num_buckets (int) – The number of buckets
  • fb_compatible (boolean, optional) – True for compatibility with the Facebook implementation. False for compatibility with the old Gensim implementation.
Returns:

Return type:

A list of hashes (integers), one per each detected ngram.