models.base_any2vec
– Base classes for any2vec models¶This module contains base classes required for implementing *2vec algorithms.
The class hierarchy is designed to facilitate adding more concrete implementations for creating embeddings.
In the most general case, the purpose of this class is to transform an arbitrary representation to a numerical vector
(embedding). This is represented by the base BaseAny2VecModel
. The input space in
most cases (in the NLP field at least) is plain text. For this reason, we enrich the class hierarchy with the abstract
BaseWordEmbeddingsModel
to be used as a base for models where the input
space is text.
Notes
Even though this is the usual case, not all embeddings transform text, such as the
PoincareModel
that embeds graphs.
See also
Word2Vec
.Word2Vec model - embeddings for words.
FastText
.FastText model - embeddings for words (ngram-based).
Doc2Vec
.Doc2Vec model - embeddings for documents.
PoincareModel
Poincare model - embeddings for graphs.
gensim.models.base_any2vec.
BaseAny2VecModel
(workers=3, vector_size=100, epochs=5, callbacks=(), batch_words=10000)¶Bases: gensim.utils.SaveLoad
Base class for training, using and evaluating *2vec model.
Contains implementation for multi-threaded training. The purpose of this class is to provide a reference interface for concrete embedding implementations, whether the input space is a corpus of words, documents or anything else. At the same time, functionality that we expect to be common for those implementations is provided here to avoid code duplication.
In the special but usual case where the input space consists of words, a more specialized layer
is provided, consider inheriting from BaseWordEmbeddingsModel
Notes
A subclass should initialize the following attributes:
self.kv - keyed vectors in model (see Word2VecKeyedVectors
as example)
self.vocabulary - vocabulary (see Word2VecVocab
as example)
self.trainables - internal matrices (see Word2VecTrainables
as example)
workers (int, optional) – Number of working threads, used for multithreading.
vector_size (int, optional) – Dimensionality of the feature vectors.
epochs (int, optional) – Number of iterations (epochs) of training through the corpus.
callbacks (list of CallbackAny2Vec
, optional) – List of callbacks that need to be executed/run at specific stages during training.
batch_words (int, optional) – Number of words to be processed by a single job.
load
(fname_or_handle, **kwargs)¶Load a previously saved object (using gensim.models.base_any2vec.BaseAny2VecModel.save()
) from a file.
fname_or_handle ({str, file-like object}) – Path to file that contains needed object or handle to an open file.
**kwargs (object) – Keyword arguments propagated to load()
.
See also
save()
Method for save a model.
Object loaded from fname_or_handle.
object
IOError – When methods are called on an instance (should be called on a class, this is a class method).
save
(fname_or_handle, **kwargs)¶“Save the object to file.
fname_or_handle ({str, file-like object}) – Path to file where the model will be persisted.
**kwargs (object) – Key word arguments propagated to save()
.
See also
load()
Method for load model after current method.
train
(data_iterable=None, corpus_file=None, epochs=None, total_examples=None, total_words=None, queue_factor=2, report_delay=1.0, callbacks=(), **kwargs)¶Train the model for multiple epochs using multiple workers.
data_iterable (iterable of list of object) – The input corpus. This will be split in chunks and these chunks will be pushed to the queue.
corpus_file (str, optional) – Path to a corpus file in LineSentence
format.
If you use this argument instead of data_iterable, you must provide total_words argument as well.
epochs (int, optional) – Number of epochs (training iterations over the whole input) of training.
total_examples (int, optional) – Count of objects in the data_iterator. In the usual case this would correspond to the number of sentences in a corpus, used to log progress.
total_words (int, optional) – Count of total objects in data_iterator. In the usual case this would correspond to the number of raw words in a corpus, used to log progress.
queue_factor (int, optional) – Multiplier for size of queue -> size = number of workers * queue_factor.
report_delay (float, optional) – Number of seconds between two consecutive progress report messages in the logger.
callbacks (list of CallbackAny2Vec
, optional) – List of callbacks to execute at specific stages during training.
**kwargs (object) – Additional key word parameters for the specific model inheriting from this class.
size of total data processed, for example number of sentences in the whole corpus.
Effective word count used in training (after ignoring unknown words and trimming the sentence length).
(int, int)
gensim.models.base_any2vec.
BaseWordEmbeddingsModel
(sentences=None, corpus_file=None, workers=3, vector_size=100, epochs=5, callbacks=(), batch_words=10000, trim_rule=None, sg=0, alpha=0.025, window=5, seed=1, hs=0, negative=5, ns_exponent=0.75, cbow_mean=1, min_alpha=0.0001, compute_loss=False, **kwargs)¶Bases: gensim.models.base_any2vec.BaseAny2VecModel
Base class containing common methods for training, using & evaluating word embeddings learning models.
See also
Word2Vec
.Word2Vec model - embeddings for words.
FastText
.FastText model - embeddings for words (ngram-based).
Doc2Vec
.Doc2Vec model - embeddings for documents.
PoincareModel
Poincare model - embeddings for graphs.
sentences (iterable of list of str, optional) – Can be simply a list of lists of tokens, but for larger corpora,
consider an iterable that streams the sentences directly from disk/network.
See BrownCorpus
, Text8Corpus
or LineSentence
for such examples.
corpus_file (str, optional) – Path to a corpus file in LineSentence
format.
You may use this argument instead of sentences to get performance boost. Only one of sentences or
corpus_file arguments need to be passed (or none of them, in that case, the model is left uninitialized).
workers (int, optional) – Number of working threads, used for multiprocessing.
vector_size (int, optional) – Dimensionality of the feature vectors.
epochs (int, optional) – Number of iterations (epochs) of training through the corpus.
callbacks (list of CallbackAny2Vec
, optional) – List of callbacks that need to be executed/run at specific stages during training.
batch_words (int, optional) – Number of words to be processed by a single job.
trim_rule (function, optional) –
Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary,
be trimmed away, or handled using the default (discard if word count < min_count).
Can be None (min_count will be used, look to keep_vocab_item()
),
or a callable that accepts parameters (word, count, min_count) and returns either
gensim.utils.RULE_DISCARD
, gensim.utils.RULE_KEEP
or gensim.utils.RULE_DEFAULT
.
The rule, if given, is only used to prune vocabulary during current method call and is not stored as part
of the model.
word (str) - the word we are examining
count (int) - the word’s frequency count in the corpus
min_count (int) - the minimum count threshold.
sg ({1, 0}, optional) – Defines the training algorithm. If 1, skip-gram is used, otherwise, CBOW is employed.
alpha (float, optional) – The beginning learning rate. This will linearly reduce with iterations until it reaches min_alpha.
window (int, optional) – The maximum distance between the current and predicted word within a sentence.
seed (int, optional) – Seed for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread (workers=1), to eliminate ordering jitter from OS thread scheduling. In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization.
hs ({1,0}, optional) – If 1, hierarchical softmax will be used for model training. If set to 0, and negative is non-zero, negative sampling will be used.
negative (int, optional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.
cbow_mean ({1,0}, optional) – If 0, use the sum of the context word vectors. If 1, use the mean, only applies when cbow is used.
min_alpha (float, optional) – Final learning rate. Drops linearly with the number of iterations from alpha.
compute_loss (bool, optional) – If True, loss will be computed while training the Word2Vec model and stored in
running_training_loss
attribute.
**kwargs (object) – Key word arguments needed to allow children classes to accept more arguments.
build_vocab
(sentences=None, corpus_file=None, update=False, progress_per=10000, keep_raw_vocab=False, trim_rule=None, **kwargs)¶Build vocabulary from a sequence of sentences (can be a once-only generator stream).
sentences (iterable of list of str) – Can be simply a list of lists of tokens, but for larger corpora,
consider an iterable that streams the sentences directly from disk/network.
See BrownCorpus
, Text8Corpus
or LineSentence
module for such examples.
corpus_file (str, optional) – Path to a corpus file in LineSentence
format.
You may use this argument instead of sentences to get performance boost. Only one of sentences or
corpus_file arguments need to be passed (not both of them).
update (bool) – If true, the new words in sentences will be added to model’s vocab.
progress_per (int, optional) – Indicates how many words to process before showing/updating the progress.
keep_raw_vocab (bool, optional) – If False, the raw vocabulary will be deleted after the scaling is done to free up RAM.
trim_rule (function, optional) –
Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary,
be trimmed away, or handled using the default (discard if word count < min_count).
Can be None (min_count will be used, look to keep_vocab_item()
),
or a callable that accepts parameters (word, count, min_count) and returns either
gensim.utils.RULE_DISCARD
, gensim.utils.RULE_KEEP
or gensim.utils.RULE_DEFAULT
.
The rule, if given, is only used to prune vocabulary during current method call and is not stored as part
of the model.
word (str) - the word we are examining
count (int) - the word’s frequency count in the corpus
min_count (int) - the minimum count threshold.
**kwargs (object) – Key word arguments propagated to self.vocabulary.prepare_vocab
build_vocab_from_freq
(word_freq, keep_raw_vocab=False, corpus_count=None, trim_rule=None, update=False)¶Build vocabulary from a dictionary of word frequencies.
word_freq (dict of (str, int)) – A mapping from a word in the vocabulary to its frequency count.
keep_raw_vocab (bool, optional) – If False, delete the raw vocabulary after the scaling is done to free up RAM.
corpus_count (int, optional) – Even if no corpus is provided, this argument can set corpus_count explicitly.
trim_rule (function, optional) –
Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary,
be trimmed away, or handled using the default (discard if word count < min_count).
Can be None (min_count will be used, look to keep_vocab_item()
),
or a callable that accepts parameters (word, count, min_count) and returns either
gensim.utils.RULE_DISCARD
, gensim.utils.RULE_KEEP
or gensim.utils.RULE_DEFAULT
.
The rule, if given, is only used to prune vocabulary during current method call and is not stored as part
of the model.
word (str) - the word we are examining
count (int) - the word’s frequency count in the corpus
min_count (int) - the minimum count threshold.
update (bool, optional) – If true, the new provided words in word_freq dict will be added to model’s vocab.
cum_table
¶doesnt_match
(words)¶Deprecated, use self.wv.doesnt_match() instead.
Refer to the documentation for doesnt_match()
.
estimate_memory
(vocab_size=None, report=None)¶Estimate required memory for a model using current settings and provided vocabulary size.
vocab_size (int, optional) – Number of unique tokens in the vocabulary
report (dict of (str, int), optional) – A dictionary from string representations of the model’s memory consuming members to their size in bytes.
A dictionary from string representations of the model’s memory consuming members to their size in bytes.
dict of (str, int)
evaluate_word_pairs
(pairs, delimiter='\t', restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)¶Deprecated, use self.wv.evaluate_word_pairs() instead.
Refer to the documentation for
evaluate_word_pairs()
.
hashfxn
¶iter
¶layer1_size
¶load
(*args, **kwargs)¶Load a previously saved object (using save()
) from file.
Also initializes extra instance attributes in case the loaded model does not include them.
*args or **kwargs MUST include the fname argument (path to saved file).
See load()
.
See also
save()
Method for save a model.
Model loaded from disk.
IOError – When methods are called on instance (should be called from class).
min_count
¶most_similar
(positive=None, negative=None, topn=10, restrict_vocab=None, indexer=None)¶Deprecated, use self.wv.most_similar() instead.
Refer to the documentation for most_similar()
.
most_similar_cosmul
(positive=None, negative=None, topn=10)¶Deprecated, use self.wv.most_similar_cosmul() instead.
Refer to the documentation for
most_similar_cosmul()
.
n_similarity
(ws1, ws2)¶Deprecated, use self.wv.n_similarity() instead.
Refer to the documentation for n_similarity()
.
sample
¶save
(fname_or_handle, **kwargs)¶“Save the object to file.
fname_or_handle ({str, file-like object}) – Path to file where the model will be persisted.
**kwargs (object) – Key word arguments propagated to save()
.
See also
load()
Method for load model after current method.
similar_by_vector
(vector, topn=10, restrict_vocab=None)¶Deprecated, use self.wv.similar_by_vector() instead.
Refer to the documentation for similar_by_vector()
.
similar_by_word
(word, topn=10, restrict_vocab=None)¶Deprecated, use self.wv.similar_by_word() instead.
Refer to the documentation for similar_by_word()
.
similarity
(w1, w2)¶Deprecated, use self.wv.similarity() instead.
Refer to the documentation for similarity()
.
syn0_lockf
¶syn1
¶syn1neg
¶train
(sentences=None, corpus_file=None, total_examples=None, total_words=None, epochs=None, start_alpha=None, end_alpha=None, word_count=0, queue_factor=2, report_delay=1.0, compute_loss=False, callbacks=(), **kwargs)¶Train the model. If the hyper-parameters are passed, they override the ones set in the constructor.
sentences (iterable of list of str) – Can be simply a list of lists of tokens, but for larger corpora,
consider an iterable that streams the sentences directly from disk/network.
See BrownCorpus
, Text8Corpus
or LineSentence
module for such examples.
corpus_file (str, optional) – Path to a corpus file in LineSentence
format.
You may use this argument instead of sentences to get performance boost. Only one of sentences or
corpus_file arguments need to be passed (not both of them).
total_examples (int, optional) – Count of sentences.
total_words (int, optional) – Count of raw words in sentences.
epochs (int, optional) – Number of iterations (epochs) over the corpus.
start_alpha (float, optional) – Initial learning rate.
end_alpha (float, optional) – Final learning rate. Drops linearly with the number of iterations from start_alpha.
word_count (int, optional) – Count of words already trained. Leave this to 0 for the usual case of training on all words in sentences.
queue_factor (int, optional) – Multiplier for size of queue -> size = number of workers * queue_factor.
report_delay (float, optional) – Seconds to wait before reporting progress.
compute_loss (bool, optional) – If True, loss will be computed while training the Word2Vec model and stored in
running_training_loss
.
callbacks (list of CallbackAny2Vec
, optional) – List of callbacks that need to be executed/run at specific stages during training.
**kwargs (object) – Additional key word parameters for the specific model inheriting from this class.
Tuple of (effective word count after ignoring unknown words and sentence length trimming, total word count).
(int, int)
wmdistance
(document1, document2)¶Deprecated, use self.wv.wmdistance() instead.
Refer to the documentation for wmdistance()
.