gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

Corporate trainings in Data Science, NLP and Deep Learning

models._utils_any2vec – Cython utils for any2vec models

models._utils_any2vec – Cython utils for any2vec models

General functions used for any2vec models.

gensim.models._utils_any2vec.compute_ngrams(word, unsigned int min_n, unsigned int max_n)

Get the list of all possible ngrams for a given word.

Parameters
  • word (str) – The word whose ngrams need to be computed.

  • min_n (unsigned int) – Minimum character length of the ngrams.

  • max_n (unsigned int) – Maximum character length of the ngrams.

Returns

Sequence of character ngrams.

Return type

list of str

gensim.models._utils_any2vec.compute_ngrams_bytes(word, unsigned int min_n, unsigned int max_n)

Computes ngrams for a word.

Ported from the original FB implementation.

Parameters
  • word (str) – A unicode string.

  • min_n (unsigned int) – The minimum ngram length.

  • max_n (unsigned int) – The maximum ngram length.

  • Returns

  • --------

  • of str (list) – A list of ngrams, where each ngram is a list of bytes.

gensim.models._utils_any2vec.ft_hash_broken(unicode string)

Calculate hash based on string.

This implementation is broken, see https://github.com/RaRe-Technologies/gensim/issues/2059. It is here only for maintaining backwards compatibility with older models.

Parameters

string (unicode) – The string whose hash needs to be calculated.

Returns

The hash of the string.

Return type

unsigned int

gensim.models._utils_any2vec.ft_hash_bytes(bytes bytez)

Calculate hash based on bytez. Reproduce hash method from Facebook fastText implementation.

Parameters

bytez (bytes) – The string whose hash needs to be calculated, encoded as UTF-8.

Returns

The hash of the string.

Return type

unsigned int