gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

models._utils_any2vec – Cython utils for any2vec models

models._utils_any2vec – Cython utils for any2vec models

General functions used for any2vec models.

gensim.models._utils_any2vec.compute_ngrams(word, unsigned int min_n, unsigned int max_n)

Get the list of all possible ngrams for a given word.

Parameters:
  • word (str) – The word whose ngrams need to be computed.
  • min_n (unsigned int) – Minimum character length of the ngrams.
  • max_n (unsigned int) – Maximum character length of the ngrams.
Returns:

Sequence of character ngrams.

Return type:

list of str

gensim.models._utils_any2vec.compute_ngrams_bytes(word, unsigned int min_n, unsigned int max_n)

Computes ngrams for a word.

Ported from the original FB implementation.

Parameters:
  • word (str) – A unicode string.
  • min_n (unsigned int) – The minimum ngram length.
  • max_n (unsigned int) – The maximum ngram length.
  • Returns
  • --------
  • of str (list) – A list of ngrams, where each ngram is a list of bytes.
gensim.models._utils_any2vec.ft_hash_broken(unicode string)

Calculate hash based on string.

This implementation is broken, see https://github.com/RaRe-Technologies/gensim/issues/2059. It is here only for maintaining backwards compatibility with older models.

Parameters:string (unicode) – The string whose hash needs to be calculated.
Returns:The hash of the string.
Return type:unsigned int
gensim.models._utils_any2vec.ft_hash_bytes(bytes bytez)

Calculate hash based on bytez. Reproduce hash method from Facebook fastText implementation.

Parameters:bytez (bytes) – The string whose hash needs to be calculated, encoded as UTF-8.
Returns:The hash of the string.
Return type:unsigned int