scripts.word2vec_standalone
– Train word2vec on text file CORPUS¶USAGE: %(program)s -train CORPUS -output VECTORS -size SIZE -window WINDOW -cbow CBOW -sample SAMPLE -hs HS -negative NEGATIVE -threads THREADS -iter ITER -min_count MIN-COUNT -alpha ALPHA -binary BINARY -accuracy FILE
Trains a neural embedding model on text file CORPUS. Parameters essentially reproduce those used by the original C tool (see https://code.google.com/archive/p/word2vec/).
Use text data from <file> to train the model
Use <file> to save the resulting word vectors / word clusters
Set size of word vectors; default is 100
Set max skip length between words; default is 5
Set threshold for occurrence of words. Those that appear with higher frequency in the training data will be randomly down-sampled; default is 1e-3, useful range is (0, 1e-5)
Use Hierarchical Softmax; default is 0 (not used)
Number of negative examples; default is 5, common values are 3 - 10 (0 = not used)
Use <int> threads (default 3)
Run more training iterations (default 5)
This will discard words that appear less than <int> times; default is 5
Set the starting learning rate; default is 0.025 for skip-gram and 0.05 for CBOW
Save the resulting vectors in binary moded; default is 0 (off)
Use the continuous bag of words model; default is 1 (use 0 for skip-gram model)
Compute accuracy of the resulting model analogical inference power on questions file <file> See an example of questions file at https://code.google.com/p/word2vec/source/browse/trunk/questions-words.txt
Example: python -m gensim.scripts.word2vec_standalone -train data.txt -output vec.txt -size 200 -sample 1e-4 -binary 0 -iter 3