models.deprecated.word2vec – Deep learning with word2vec

`models.deprecated.word2vec` – Deep learning with word2vec¶

Warning

Deprecated since version 3.3.0: Use gensim.models.word2vec instead.

Produce word vectors with deep learning via word2vec’s “skip-gram and CBOW models”, using either hierarchical softmax or negative sampling 1 2.

NOTE: There are more ways to get word vectors in Gensim than just Word2Vec. See wrappers for FastText, VarEmbed and WordRank.

The training algorithms were originally ported from the C package https://code.google.com/p/word2vec/ and extended with additional functionality.

For a blog tutorial on gensim word2vec, with an interactive web app trained on GoogleNews, visit http://radimrehurek.com/2014/02/word2vec-tutorial/

Make sure you have a C compiler before installing gensim, to use optimized (compiled) word2vec training (70x speedup compared to plain NumPy implementation 3).

Initialize a model with e.g.:

>>> model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)

Persist a model to disk with:

>>> model.save(fname)
>>> model = Word2Vec.load(fname)  # you can continue training with the loaded model!

The word vectors are stored in a KeyedVectors instance in model.wv. This separates the read-only word vector lookup operations in KeyedVectors from the training code in Word2Vec:

>>> model.wv['computer']  # numpy vector of a word
array([-0.00449447, -0.00310097,  0.02421786, ...], dtype=float32)

The word vectors can also be instantiated from an existing file on disk in the word2vec C format as a KeyedVectors instance:

NOTE: It is impossible to continue training the vectors loaded from the C format because hidden weights,
vocabulary frequency and the binary tree is missing:

.. sourcecode:: pycon

    >>> from gensim.models.keyedvectors import KeyedVectors
    >>> word_vectors = KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False)  # C text format
    >>> word_vectors = KeyedVectors.load_word2vec_format('/tmp/vectors.bin', binary=True)  # C binary format

You can perform various NLP word tasks with the model. Some of them are already built-in:

>>> model.wv.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.50882536), ...]

>>> model.wv.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])
[('queen', 0.71382287), ...]

>>> model.wv.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'

>>> model.wv.similarity('woman', 'man')
0.73723527

Probability of a text under the model:

>>> model.score(["The fox jumped over a lazy dog".split()])
0.2158356

Correlation with human opinion on word similarity:

>>> model.wv.evaluate_word_pairs(os.path.join(module_path, 'test_data','wordsim353.tsv'))
0.51, 0.62, 0.13

And on analogies:

>>> model.wv.accuracy(os.path.join(module_path, 'test_data', 'questions-words.txt'))

and so on.

If you’re finished training a model (i.e. no more updates, only querying), then switch to the gensim.models.KeyedVectors instance in wv

>>> word_vectors = model.wv
>>> del model

to trim unneeded model memory = use much less RAM.

Note that there is a gensim.models.phrases module which lets you automatically detect phrases longer than one word. Using phrases, you can learn a word2vec model where “words” are actually multiword expressions, such as new_york_times or financial_crisis:

>>> bigram_transformer = gensim.models.Phrases(sentences)
>>> model = Word2Vec(bigram_transformer[sentences], size=100, ...)

1: Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.
2: Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.
3: Optimizing word2vec in gensim, http://radimrehurek.com/2013/09/word2vec-in-python-part-two-optimizing/

class gensim.models.deprecated.word2vec.BrownCorpus(dirname)¶

Bases: object

Iterate over sentences from the Brown corpus (part of NLTK data).

class gensim.models.deprecated.word2vec.LineSentence(source, max_sentence_length=10000, limit=None)¶

Bases: object

Simple format: one sentence = one line; words already preprocessed and separated by whitespace.

source can be either a string or a file object. Clip the file to the first limit lines (or not clipped if limit is None, the default).

Example:

sentences = LineSentence('myfile.txt')

Or for compressed files:

sentences = LineSentence('compressed_text.txt.bz2')
sentences = LineSentence('compressed_text.txt.gz')

class gensim.models.deprecated.word2vec.PathLineSentences(source, max_sentence_length=10000, limit=None)¶

Bases: object

Works like word2vec.LineSentence, but will process all files in a directory in alphabetical order by filename. The directory can only contain files that can be read by LineSentence: .bz2, .gz, and text files. Any file not ending with .bz2 or .gz is assumed to be a text file. Does not work with subdirectories.

The format of files (either text, or compressed text files) in the path is one sentence = one line, with words already preprocessed and separated by whitespace.

source should be a path to a directory (as a string) where all files can be opened by the LineSentence class. Each file will be read up to limit lines (or not clipped if limit is None, the default).

Example:

sentences = PathLineSentences(os.getcwd() + '\corpus\')

The files in the directory should be either text files, .bz2 files, or .gz files.

class gensim.models.deprecated.word2vec.Text8Corpus(fname, max_sentence_length=10000)¶

Bases: object

Iterate over sentences from the “text8” corpus, unzipped from http://mattmahoney.net/dc/text8.zip .

class gensim.models.deprecated.word2vec.Word2Vec(sentences=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=<built-in function hash>, iter=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000, compute_loss=False)¶

Bases: gensim.models.deprecated.old_saveload.SaveLoad

Class for training, using and evaluating neural networks described in https://code.google.com/p/word2vec/

If you’re finished training a model (=no more updates, only querying) then switch to the gensim.models.KeyedVectors instance in wv

The model can be stored/loaded via its save() and load() methods, or stored/loaded in a format compatible with the original word2vec implementation via wv.save_word2vec_format() and KeyedVectors.load_word2vec_format().

Initialize the model from an iterable of sentences. Each sentence is a list of words (unicode strings) that will be used for training.

The sentences iterable can be simply a list, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, Text8Corpus or LineSentence in this module for such examples.

If you don’t supply sentences, the model is left uninitialized – use if you plan to initialize it in some other way.

sg defines the training algorithm. By default (sg=0), CBOW is used. Otherwise (sg=1), skip-gram is employed.

size is the dimensionality of the feature vectors.

window is the maximum distance between the current and predicted word within a sentence.

alpha is the initial learning rate (will linearly drop to min_alpha as training progresses).

seed = for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread, to eliminate ordering jitter from OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization.)

min_count = ignore all words with total frequency lower than this.

max_vocab_size = limit RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit (default).

sample = threshold for configuring which higher-frequency words are randomly downsampled;: default is 1e-3, useful range is (0, 1e-5).

workers = use this many worker threads to train the model (=faster training with multicore machines).

hs = if 1, hierarchical softmax will be used for model training. If set to 0 (default), and negative is non-zero, negative sampling will be used.

negative = if > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). Default is 5. If set to 0, no negative samping is used.

cbow_mean = if 0, use the sum of the context word vectors. If 1 (default), use the mean. Only applies when cbow is used.

hashfxn = hash function to use to randomly initialize weights, for increased training reproducibility. Default is Python’s rudimentary built in hash function.

iter = number of iterations (epochs) over the corpus. Default is 5.

trim_rule = vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used), or a callable that accepts parameters (word, count, min_count) and returns either utils.RULE_DISCARD, utils.RULE_KEEP or utils.RULE_DEFAULT. Note: The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as part of the model.

sorted_vocab = if 1 (default), sort the vocabulary by descending frequency before assigning word indexes.

batch_words = target size (in words) for batches of examples passed to worker threads (and thus cython routines). Default is 10000. (Larger batches will be passed if individual texts are longer than 10000 words, but the standard cython code truncates to that maximum.)

accuracy(questions, restrict_vocab=30000, most_similar=None, case_insensitive=True)¶

build_vocab(sentences, keep_raw_vocab=False, trim_rule=None, progress_per=10000, update=False)¶: Build vocabulary from a sequence of sentences (can be a once-only generator stream). Each sentence must be a list of unicode strings.

build_vocab_from_freq(word_freq, keep_raw_vocab=False, corpus_count=None, trim_rule=None, update=False)¶

Build vocabulary from a dictionary of word frequencies. Build model vocabulary from a passed dictionary that contains (word,word count). Words must be of type unicode strings.

Parameters

word_freq (dict) – Word,Word_Count dictionary.
keep_raw_vocab (bool) – If not true, delete the raw vocabulary after the scaling is done and free up RAM.
corpus_count (int) – Even if no corpus is provided, this argument can set corpus_count explicitly.
= vocabulary trimming rule, specifies whether certain words should remain (trim_rule) –
the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count) (in) –
be None (min_count will be used), or a callable that accepts parameters (word, count, min_count) and (Can) –
either utils.RULE_DISCARD, utils.RULE_KEEP or utils.RULE_DEFAULT. (returns) –
update (bool) – If true, the new provided words in word_freq dict will be added to model’s vocab.

Returns

Return type

None

Examples

>>> from gensim.models.word2vec import Word2Vec
>>> model = Word2Vec()
>>> model.build_vocab_from_freq({"Word1": 15, "Word2": 20})

clear_sims()¶: Removes all L2-normalized vectors for words from the model. You will have to recompute them using init_sims method.

create_binary_tree()¶: Create a binary Huffman tree using stored vocabulary word counts. Frequent words will have shorter binary codes. Called internally from build_vocab().

delete_temporary_training_data(replace_word_vectors_with_normalized=False)¶: Discard parameters that are used in training and score. Use if you’re sure you’re done training a model. If replace_word_vectors_with_normalized is set, forget the original vectors and only keep the normalized ones = saves lots of memory!

doesnt_match(words)¶: Deprecated. Use self.wv.doesnt_match() instead. Refer to the documentation for gensim.models.KeyedVectors.doesnt_match

estimate_memory(vocab_size=None, report=None)¶: Estimate required memory for a model using current settings and provided vocabulary size.

evaluate_word_pairs(pairs, delimiter='\t', restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)¶: Deprecated. Use self.wv.evaluate_word_pairs() instead. Refer to the documentation for gensim.models.KeyedVectors.evaluate_word_pairs

finalize_vocab(update=False)¶: Build tables and model weights based on final vocabulary settings.

get_latest_training_loss()¶

init_sims(replace=False)¶: init_sims() resides in KeyedVectors because it deals with syn0 mainly, but because syn1 is not an attribute of KeyedVectors, it has to be deleted in this class, and the normalizing of syn0 happens inside of KeyedVectors

initialize_word_vectors()¶

intersect_word2vec_format(fname, lockf=0.0, binary=False, encoding='utf8', unicode_errors='strict')¶

Merge the input-hidden weight matrix from the original C word2vec-tool format given, where it intersects with the current vocabulary. (No words are added to the existing vocabulary, but intersecting words adopt the file’s weights, and non-intersecting words are left alone.)

binary is a boolean indicating whether the data is in binary word2vec format.

lockf is a lock-factor value to be set for any imported word-vectors; the default value of 0.0 prevents further updating of the vector during subsequent training. Use 1.0 to allow further training updates of merged vectors.

classmethod load(*args, **kwargs)¶

Load a previously saved object (using save()) from file.

Parameters

fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

Get Expert Help From The Gensim Authors

models.deprecated.word2vec – Deep learning with word2vec¶

`models.deprecated.word2vec` – Deep learning with word2vec¶