gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

Corporate trainings in Data Science, NLP and Deep Learning

models.base_any2vec – Base classes for any2vec models

models.base_any2vec – Base classes for any2vec models

This module contains base classes required for implementing *2vec algorithms.

The class hierarchy is designed to facilitate adding more concrete implementations for creating embeddings. In the most general case, the purpose of this class is to transform an arbitrary representation to a numerical vector (embedding). This is represented by the base BaseAny2VecModel. The input space in most cases (in the NLP field at least) is plain text. For this reason, we enrich the class hierarchy with the abstract BaseWordEmbeddingsModel to be used as a base for models where the input space is text.

Notes

Even though this is the usual case, not all embeddings transform text, such as the PoincareModel that embeds graphs.

See also

Word2Vec.

Word2Vec model - embeddings for words.

FastText.

FastText model - embeddings for words (ngram-based).

Doc2Vec.

Doc2Vec model - embeddings for documents.

PoincareModel

Poincare model - embeddings for graphs.

class gensim.models.base_any2vec.BaseAny2VecModel(workers=3, vector_size=100, epochs=5, callbacks=(), batch_words=10000)

Bases: gensim.utils.SaveLoad

Base class for training, using and evaluating *2vec model.

Contains implementation for multi-threaded training. The purpose of this class is to provide a reference interface for concrete embedding implementations, whether the input space is a corpus of words, documents or anything else. At the same time, functionality that we expect to be common for those implementations is provided here to avoid code duplication.

In the special but usual case where the input space consists of words, a more specialized layer is provided, consider inheriting from BaseWordEmbeddingsModel

Notes

A subclass should initialize the following attributes:

Parameters
  • workers (int, optional) – Number of working threads, used for multithreading.

  • vector_size (int, optional) – Dimensionality of the feature vectors.

  • epochs (int, optional) – Number of iterations (epochs) of training through the corpus.

  • callbacks (list of CallbackAny2Vec, optional) – List of callbacks that need to be executed/run at specific stages during training.

  • batch_words (int, optional) – Number of words to be processed by a single job.

classmethod load(fname_or_handle, **kwargs)

Load a previously saved object (using gensim.models.base_any2vec.BaseAny2VecModel.save()) from a file.

Parameters
  • fname_or_handle ({str, file-like object}) – Path to file that contains needed object or handle to an open file.

  • **kwargs (object) – Keyword arguments propagated to load().

See also

save()

Method for save a model.

Returns

Object loaded from fname_or_handle.

Return type

object

Raises

IOError – When methods are called on an instance (should be called on a class, this is a class method).

save(fname_or_handle, **kwargs)

“Save the object to file.

Parameters
  • fname_or_handle ({str, file-like object}) – Path to file where the model will be persisted.

  • **kwargs (object) – Key word arguments propagated to save().

See also

load()

Method for load model after current method.

train(data_iterable=None, corpus_file=None, epochs=None, total_examples=None, total_words=None, queue_factor=2, report_delay=1.0, callbacks=(), **kwargs)

Train the model for multiple epochs using multiple workers.

Parameters
  • data_iterable (iterable of list of object) – The input corpus. This will be split in chunks and these chunks will be pushed to the queue.

  • corpus_file (str, optional) – Path to a corpus file in LineSentence format. If you use this argument instead of data_iterable, you must provide total_words argument as well.

  • epochs (int, optional) – Number of epochs (training iterations over the whole input) of training.

  • total_examples (int, optional) – Count of objects in the data_iterator. In the usual case this would correspond to the number of sentences in a corpus, used to log progress.

  • total_words (int, optional) – Count of total objects in data_iterator. In the usual case this would correspond to the number of raw words in a corpus, used to log progress.

  • queue_factor (int, optional) – Multiplier for size of queue -> size = number of workers * queue_factor.

  • report_delay (float, optional) – Number of seconds between two consecutive progress report messages in the logger.

  • callbacks (list of CallbackAny2Vec, optional) – List of callbacks to execute at specific stages during training.

  • **kwargs (object) – Additional key word parameters for the specific model inheriting from this class.

Returns

The total training report consisting of two elements:
  • size of total data processed, for example number of sentences in the whole corpus.

  • Effective word count used in training (after ignoring unknown words and trimming the sentence length).

Return type

(int, int)

class gensim.models.base_any2vec.BaseWordEmbeddingsModel(sentences=None, corpus_file=None, workers=3, vector_size=100, epochs=5, callbacks=(), batch_words=10000, trim_rule=None, sg=0, alpha=0.025, window=5, seed=1, hs=0, negative=5, ns_exponent=0.75, cbow_mean=1, min_alpha=0.0001, compute_loss=False, **kwargs)

Bases: gensim.models.base_any2vec.BaseAny2VecModel

Base class containing common methods for training, using & evaluating word embeddings learning models.

See also

Word2Vec.

Word2Vec model - embeddings for words.

FastText.

FastText model - embeddings for words (ngram-based).

Doc2Vec.

Doc2Vec model - embeddings for documents.

PoincareModel

Poincare model - embeddings for graphs.

Parameters
  • sentences (iterable of list of str, optional) – Can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, Text8Corpus or LineSentence for such examples.

  • corpus_file (str, optional) – Path to a corpus file in LineSentence format. You may use this argument instead of sentences to get performance boost. Only one of sentences or corpus_file arguments need to be passed (or none of them, in that case, the model is left uninitialized).

  • workers (int, optional) – Number of working threads, used for multiprocessing.

  • vector_size (int, optional) – Dimensionality of the feature vectors.

  • epochs (int, optional) – Number of iterations (epochs) of training through the corpus.

  • callbacks (list of CallbackAny2Vec, optional) – List of callbacks that need to be executed/run at specific stages during training.

  • batch_words (int, optional) – Number of words to be processed by a single job.

  • trim_rule (function, optional) –

    Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to keep_vocab_item()), or a callable that accepts parameters (word, count, min_count) and returns either gensim.utils.RULE_DISCARD, gensim.utils.RULE_KEEP or gensim.utils.RULE_DEFAULT. The rule, if given, is only used to prune vocabulary during current method call and is not stored as part of the model.

    The input parameters are of the following types:
    • word (str) - the word we are examining

    • count (int) - the word’s frequency count in the corpus

    • min_count (int) - the minimum count threshold.

  • sg ({1, 0}, optional) – Defines the training algorithm. If 1, skip-gram is used, otherwise, CBOW is employed.

  • alpha (float, optional) – The beginning learning rate. This will linearly reduce with iterations until it reaches min_alpha.

  • window (int, optional) – The maximum distance between the current and predicted word within a sentence.

  • seed (int, optional) – Seed for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread (workers=1), to eliminate ordering jitter from OS thread scheduling. In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization.

  • hs ({1,0}, optional) – If 1, hierarchical softmax will be used for model training. If set to 0, and negative is non-zero, negative sampling will be used.

  • negative (int, optional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.

  • cbow_mean ({1,0}, optional) – If 0, use the sum of the context word vectors. If 1, use the mean, only applies when cbow is used.

  • min_alpha (float, optional) – Final learning rate. Drops linearly with the number of iterations from alpha.

  • compute_loss (bool, optional) – If True, loss will be computed while training the Word2Vec model and stored in running_training_loss attribute.

  • **kwargs (object) – Key word arguments needed to allow children classes to accept more arguments.

build_vocab(sentences=None, corpus_file=None, update=False, progress_per=10000, keep_raw_vocab=False, trim_rule=None, **kwargs)

Build vocabulary from a sequence of sentences (can be a once-only generator stream).

Parameters
  • sentences (iterable of list of str) – Can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, Text8Corpus or LineSentence module for such examples.

  • corpus_file (str, optional) – Path to a corpus file in LineSentence format. You may use this argument instead of sentences to get performance boost. Only one of sentences or corpus_file arguments need to be passed (not both of them).

  • update (bool) – If true, the new words in sentences will be added to model’s vocab.

  • progress_per (int, optional) – Indicates how many words to process before showing/updating the progress.

  • keep_raw_vocab (bool, optional) – If False, the raw vocabulary will be deleted after the scaling is done to free up RAM.

  • trim_rule (function, optional) –

    Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to keep_vocab_item()), or a callable that accepts parameters (word, count, min_count) and returns either gensim.utils.RULE_DISCARD, gensim.utils.RULE_KEEP or gensim.utils.RULE_DEFAULT. The rule, if given, is only used to prune vocabulary during current method call and is not stored as part of the model.

    The input parameters are of the following types:
    • word (str) - the word we are examining

    • count (int) - the word’s frequency count in the corpus

    • min_count (int) - the minimum count threshold.

  • **kwargs (object) – Key word arguments propagated to self.vocabulary.prepare_vocab

build_vocab_from_freq(word_freq, keep_raw_vocab=False, corpus_count=None, trim_rule=None, update=False)

Build vocabulary from a dictionary of word frequencies.

Parameters
  • word_freq (dict of (str, int)) – A mapping from a word in the vocabulary to its frequency count.

  • keep_raw_vocab (bool, optional) – If False, delete the raw vocabulary after the scaling is done to free up RAM.

  • corpus_count (int, optional) – Even if no corpus is provided, this argument can set corpus_count explicitly.

  • trim_rule (function, optional) –

    Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to keep_vocab_item()), or a callable that accepts parameters (word, count, min_count) and returns either gensim.utils.RULE_DISCARD, gensim.utils.RULE_KEEP or gensim.utils.RULE_DEFAULT. The rule, if given, is only used to prune vocabulary during current method call and is not stored as part of the model.

    The input parameters are of the following types:
    • word (str) - the word we are examining

    • count (int) - the word’s frequency count in the corpus

    • min_count (int) - the minimum count threshold.

  • update (bool, optional) – If true, the new provided words in word_freq dict will be added to model’s vocab.

property cum_table
doesnt_match(words)

Deprecated, use self.wv.doesnt_match() instead.

Refer to the documentation for doesnt_match().

estimate_memory(vocab_size=None, report=None)

Estimate required memory for a model using current settings and provided vocabulary size.

Parameters
  • vocab_size (int, optional) – Number of unique tokens in the vocabulary

  • report (dict of (str, int), optional) – A dictionary from string representations of the model’s memory consuming members to their size in bytes.

Returns

A dictionary from string representations of the model’s memory consuming members to their size in bytes.

Return type

dict of (str, int)

evaluate_word_pairs(pairs, delimiter='\t', restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)

Deprecated, use self.wv.evaluate_word_pairs() instead.

Refer to the documentation for evaluate_word_pairs().

property hashfxn
property iter
property layer1_size
classmethod load(*args, **kwargs)

Load a previously saved object (using save()) from file.

Also initializes extra instance attributes in case the loaded model does not include them. *args or **kwargs MUST include the fname argument (path to saved file). See load().

Parameters
  • *args (object) – Positional arguments passed to load().

  • **kwargs (object) – Key word arguments passed to load().

See also

save()

Method for save a model.

Returns

Model loaded from disk.

Return type

BaseWordEmbeddingsModel

Raises

IOError – When methods are called on instance (should be called from class).

property min_count
most_similar(positive=None, negative=None, topn=10, restrict_vocab=None, indexer=None)

Deprecated, use self.wv.most_similar() instead.

Refer to the documentation for most_similar().

most_similar_cosmul(positive=None, negative=None, topn=10)

Deprecated, use self.wv.most_similar_cosmul() instead.

Refer to the documentation for most_similar_cosmul().

n_similarity(ws1, ws2)

Deprecated, use self.wv.n_similarity() instead.

Refer to the documentation for n_similarity().

property sample
save(fname_or_handle, **kwargs)

“Save the object to file.

Parameters
  • fname_or_handle ({str, file-like object}) – Path to file where the model will be persisted.

  • **kwargs (object) – Key word arguments propagated to save().

See also

load()

Method for load model after current method.

similar_by_vector(vector, topn=10, restrict_vocab=None)

Deprecated, use self.wv.similar_by_vector() instead.

Refer to the documentation for similar_by_vector().

similar_by_word(word, topn=10, restrict_vocab=None)

Deprecated, use self.wv.similar_by_word() instead.

Refer to the documentation for similar_by_word().

similarity(w1, w2)

Deprecated, use self.wv.similarity() instead.

Refer to the documentation for similarity().

property syn0_lockf
property syn1
property syn1neg
train(sentences=None, corpus_file=None, total_examples=None, total_words=None, epochs=None, start_alpha=None, end_alpha=None, word_count=0, queue_factor=2, report_delay=1.0, compute_loss=False, callbacks=(), **kwargs)

Train the model. If the hyper-parameters are passed, they override the ones set in the constructor.

Parameters
  • sentences (iterable of list of str) – Can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, Text8Corpus or LineSentence module for such examples.

  • corpus_file (str, optional) – Path to a corpus file in LineSentence format. You may use this argument instead of sentences to get performance boost. Only one of sentences or corpus_file arguments need to be passed (not both of them).

  • total_examples (int, optional) – Count of sentences.

  • total_words (int, optional) – Count of raw words in sentences.

  • epochs (int, optional) – Number of iterations (epochs) over the corpus.

  • start_alpha (float, optional) – Initial learning rate.

  • end_alpha (float, optional) – Final learning rate. Drops linearly with the number of iterations from start_alpha.

  • word_count (int, optional) – Count of words already trained. Leave this to 0 for the usual case of training on all words in sentences.

  • queue_factor (int, optional) – Multiplier for size of queue -> size = number of workers * queue_factor.

  • report_delay (float, optional) – Seconds to wait before reporting progress.

  • compute_loss (bool, optional) – If True, loss will be computed while training the Word2Vec model and stored in running_training_loss.

  • callbacks (list of CallbackAny2Vec, optional) – List of callbacks that need to be executed/run at specific stages during training.

  • **kwargs (object) – Additional key word parameters for the specific model inheriting from this class.

Returns

Tuple of (effective word count after ignoring unknown words and sentence length trimming, total word count).

Return type

(int, int)

wmdistance(document1, document2)

Deprecated, use self.wv.wmdistance() instead.

Refer to the documentation for wmdistance().