I never got round to writing a tutorial on how to use word2vec in gensim. It’s simple enough and the API docs are straightforward, but I know some people prefer more verbose formats. Let this post be a tutorial and a reference example.
Preparing the Input
Starting from the beginning, gensim’s word2vec expects a sequence of sentences as its input. Each sentence a list of words (utf8 strings):
>>> # import modules & set up logging >>> import gensim, logging >>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) >>> >>> sentences = [['first', 'sentence'], ['second', 'sentence']] >>> # train word2vec on the two sentences >>> model = gensim.models.Word2Vec(sentences, min_count=1)
Keeping the input as a Python built-in list is convenient, but can use up a lot of RAM when the input is large.
Gensim only requires that the input must provide sentences sequentially, when iterated over. No need to keep everything in RAM: we can provide one sentence, process it, forget it, load another sentence…
For example, if our input is strewn across several files on disk, with one sentence per line, then instead of loading everything into an in-memory list, we can process the input file by file, line by line:
>>> class MySentences(object): ... def __init__(self, dirname): ... self.dirname = dirname ... ... def __iter__(self): ... for fname in os.listdir(self.dirname): ... for line in open(os.path.join(self.dirname, fname)): ... yield line.split() >>> >>> sentences = MySentences('/some/directory') # a memory-friendly iterator >>> model = gensim.models.Word2Vec(sentences)
Say we want to further preprocess the words from the files — convert to unicode, lowercase, remove numbers, extract named entities… All of this can be done inside the MySentences iterator and word2vec doesn’t need to know. All that is required is that the input yields one sentence (list of utf8 words) after another.
Note to advanced users: calling Word2Vec(sentences) will run two passes over the sentences iterator. The first pass collects words and their frequencies to build an internal dictionary tree structure.
The second pass trains the neural model.
These two passes can also be initiated manually, in case your input stream is non-repeatable (you can only afford one pass), and you’re able to initialize the vocabulary some other way:
>>> model = gensim.models.Word2Vec() # an empty model, no training >>> model.build_vocab(some_sentences) # can be a non-repeatable, 1-pass generator >>> model.train(other_sentences) # can be a non-repeatable, 1-pass generator
Training
Word2vec accepts several parameters that affect both training speed and quality.
One of them is for pruning the internal dictionary. Words that appear only once or twice in a billion-word corpus are probably uninteresting typos and garbage. In addition, there’s not enough data to make any meaningful training on those words, so it’s best to ignore them:
>>> model = Word2Vec(sentences, min_count=10) # default value is 5
A reasonable value for min_count is between 0-100, depending on the size of your dataset.
Another parameter is the size of the NN layers, which correspond to the “degrees” of freedom the training algorithm has:
>>> model = Word2Vec(sentences, size=200) # default value is 100
Bigger size values require more training data, but can lead to better (more accurate) models. Reasonable values are in the tens to hundreds.
The last of the major parameters (full list here) is for training parallelization, to speed up training:
>>> model = Word2Vec(sentences, workers=4) # default = 1 worker = no parallelization
The workers parameter has only effect if you have Cython installed. Without Cython, you’ll only be able to use one core because of the GIL (and word2vec training will be miserably slow).
Memory
At its core, word2vec model parameters are stored as matrices (NumPy arrays). Each array is #vocabulary (controlled by min_count parameter) times #size (size parameter) of floats (single precision aka 4 bytes).
Three such matrices are held in RAM (work is underway to reduce that number to two, or even one). So if your input contains 100,000 unique words, and you asked for layer size=200, the model will require approx. 100,000*200*4*3 bytes = ~229MB.
There’s a little extra memory needed for storing the vocabulary tree (100,000 words would take a few megabytes), but unless your words are extremely loooong strings, memory footprint will be dominated by the three matrices above.
Evaluating
Word2vec training is an unsupervised task, there’s no good way to objectively evaluate the result. Evaluation depends on your end application.
Google have released their testing set of about 20,000 syntactic and semantic test examples, following the “A is to B as C is to D” task: http://word2vec.googlecode.com/svn/trunk/questions-words.txt.
Gensim support the same evaluation set, in exactly the same format:
>>> model.accuracy('/tmp/questions-words.txt') 2014-02-01 22:14:28,387 : INFO : family: 88.9% (304/342) 2014-02-01 22:29:24,006 : INFO : gram1-adjective-to-adverb: 32.4% (263/812) 2014-02-01 22:36:26,528 : INFO : gram2-opposite: 50.3% (191/380) 2014-02-01 23:00:52,406 : INFO : gram3-comparative: 91.7% (1222/1332) 2014-02-01 23:13:48,243 : INFO : gram4-superlative: 87.9% (617/702) 2014-02-01 23:29:52,268 : INFO : gram5-present-participle: 79.4% (691/870) 2014-02-01 23:57:04,965 : INFO : gram7-past-tense: 67.1% (995/1482) 2014-02-02 00:15:18,525 : INFO : gram8-plural: 89.6% (889/992) 2014-02-02 00:28:18,140 : INFO : gram9-plural-verbs: 68.7% (482/702) 2014-02-02 00:28:18,140 : INFO : total: 74.3% (5654/7614)
This accuracy takes an optional parameter restrict_vocab which limits which test examples are to be considered.
Once again, good performance on this test set doesn’t mean word2vec will work well in your application, or vice versa. It’s always best to evaluate directly on your intended task.
Storing and loading models
You can store/load models using the standard gensim methods:
>>> model.save('/tmp/mymodel') >>> new_model = gensim.models.Word2Vec.load('/tmp/mymodel')
which uses pickle internally, optionally mmap‘ing the model’s internal large NumPy matrices into virtual memory directly from disk files, for inter-process memory sharing.
In addition, you can load models created by the original C tool, both using its text and binary formats:
>>> model = Word2Vec.load_word2vec_format('/tmp/vectors.txt', binary=False) >>> # using gzipped/bz2 input works too, no need to unzip: >>> model = Word2Vec.load_word2vec_format('/tmp/vectors.bin.gz', binary=True)
Online training / Resuming training
Advanced users can load a model and continue training it with more sentences:
>>> model = gensim.models.Word2Vec.load('/tmp/mymodel') >>> model.train(more_sentences)
You may need to tweak the total_words parameter to train(), depending on what learning rate decay you want to simulate.
Note that it’s not possible to resume training with models generated by the C tool, load_word2vec_format(). You can still use them for querying/similarity, but information vital for training (the vocab tree) is missing there.
Using the model
Word2vec supports several word similarity tasks out of the box:
>>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1) [('queen', 0.50882536)] >>> model.doesnt_match("breakfast cereal dinner lunch".split()) 'cereal' >>> model.similarity('woman', 'man') 0.73723527
If you need the raw output vectors in your application, you can access these either on a word-by-word basis
>>> model['computer'] # raw NumPy vector of a word array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)
…or en-masse as a 2D NumPy matrix from model.syn0.
Bonus app
As before with finding similar articles in the English Wikipedia with Latent Semantic Analysis, here’s a bonus web app for those who managed to read this far. It uses the word2vec model trained by Google on the Google News dataset, on about 100 billion words:
If you don’t get “queen” back, something went wrong and baby SkyNet cries.
Try more examples too: “he” is to “his” as “she” is to ?, “Berlin” is to “Germany” as “Paris” is to ? (click to fill in).
Try: U.S.A.; Monty_Python; PHP; Madiba (click to fill in).
Also try: “monkey ape baboon human chimp gorilla”; “blue red green crimson transparent” (click to fill in).
The model contains 3,000,000 unique phrases built with layer size of 300.
Note that the similarities were trained on a news dataset, and that Google did very little preprocessing there. So the phrases are case sensitive: watch out! Especially with proper nouns.
On a related note, I noticed about half the queries people entered into the LSA@Wiki demo contained typos/spelling errors, so they found nothing. Ouch.
To make it a little less challenging this time, I added phrase suggestions to the forms above. Start typing to see a list of valid phrases from the actual vocabulary of Google News’ word2vec model.
The “suggested” phrases are simply ten phrases starting from whatever bisect_left(all_model_phrases_alphabetically_sorted, prefix_you_typed_so_far) from Python’s built-in bisect module returns.
Outro
Full word2vec API docs here; get gensim here. Original C toolkit and word2vec papers by Google here.
And here’s me talking about the optimizations behind word2vec at PyData Berlin 2014
Hi radim,
Impressive tutorial. I have a query that the output Word2Vec model is returning in an array. How can we use that as an input to recursive neural network??
Thanks
model = gensim.models.Word2Vec(sentences) will not work as shown in the tutorial, because you will receive the error message: “RuntimeError: you must first build vocabulary before training the model”. You also have to set down the min_count manually, like model = gensim.models.Word2Vec(sentences, min_count=1).
Default `min_count=5` if you don’t set it explicitly. Vocabulary is built automatically from the sentences.
What version of gensim are you using? It should really work simply with `Word2Vec(sentences)`, there are even unit tests for that.
if you don’t set ‘min_count=1’, it will remove all the words in sentences in the example given –
logging:
‘INFO : total 0 word types after removing those with count<5'
Ah ok, thanks Claire. I’ve add the `min_count=1` parameter.
Hi Radim,
Is there any way to obtain the similarity of phrases out of the word2vec? I’m trying to get 2-word phrases to compare, but don’t know how to do it.
Thanks!
Pavel
Hello Pavel, yes, there is a way.
First, you must detect phrases in the text (such as 2-word phrases).
Then you build the word2vec model like you normally would, except some “tokens” will be strings of multiple words instead of one (example sentence: [“New York”, “was”, “founded”, “16th century”]).
Then, to get similarity of phrases, you do `model.similarity(“New York”, “16th century”)`.
It may be a good idea to replace spaces with underscores in the phrase-tokens, to avoid potential parsing problems (“New_York”, “16th_century”).
As for detecting the phrases, it’s a task unrelated to word2vec. You can use existing NLP tools like the NLTK/Freebase, or help finish a gensim pull request that does exactly this: https://github.com/piskvorky/gensim/pull/135 .
Hi Radim,
The Word2Vec function split my words as:
u’\u4e00\u822c’ ==>> u’\u4e00′ and u’\u822c’
How could I fix it?
Thanks, luopuya
Sorry, I did not read the blog carefully.
Every time reading a line of file, I should split it like what “MySentences” do
Hi Radim,
you are awesome, thank you so much for gensim and this tutorial!!
I have a question. I read in the docs that by default you utilize Skip-Gram, which can be switched to CBOW. From what I gathered in the NIPS slides, CBOW is faster, more effective and gets better results. So why use Skip-Gram in the first place? I’m sure I’m missing something obvious here :)
Thanks,
Max
Whoops, I just realized the parameter “sg” is not supported anymore in the Word2Vec constructor. Is that true? So what is used by default?
Hello Max,
thanks :)
Skip-gram is used because it gives better (more accurate) results.
CBOW is faster though. If you have lots data, you can be advantageous to run the simpler but faster model.
There’s a pull request under way, to enable all the word2vec options and parameters. You can try out the various models and their performance for yourself :)
Thanks for you answer Radim! I only saw that one experiment in the slides that said that CBOW was faster AND more accurate, but that might have been an outlier or my misinterpretation. I’m excited for that pull request! :)
Anyway, I have another question (or bug report?). I changed a bunch of training parameters and added input data and suddenly got segfaults on Python when asking for similarities for certain words… So I tried which of the changes caused this, and it turned out that the cause was that I set the output size to 200! Setting it to (apparently) any other number doesn’t cause any trouble, but 200 does… If you hear this from anyone else or are able to reproduce it yourself, consider it a bug :)
Hey Max — are you on OS X? If so, it may be https://github.com/numpy/numpy/issues/4007
If not, please file a bug report at https://github.com/piskvorky/gensim/issues with system/sw info. Cheers!
Hi Radim,
Indeed, a great tutorial! Thank you for that!
Playied a bit with Word2Vec and it’s quite impressive. Couldn’t figure out how the first part of the demo app works. Can you provide some insights please ?
Thanks! :D
Hello Bogdan, you’re welcome :-)
This demo app just calls the “model.most_similar(positive, negative)” method in the background. Check out the API docs: http://radimrehurek.com/gensim/models/word2vec.html
Hello,
i’d like to ask You, if this all can be done with other languages too, like Korean, Russian, Arabic and so, or whether is this toolkit fixed to the English only.
Thank You in advance for the answer
Hi, it can be done directly for languages where you can split text into sentences and tokens.
The only concern would be that the `window` parameter has to be large enough to capture syntactic/semantic relationships. For English, the default `window=5`, and I wouldn’t expect it to be dramatically different for other languages.
Hey, I wanted to know if the version you have in gensim is the same that you got after “optimizing word2vec in Python”.. I am using the pre-trained model of the Google News vector(found in the page of word2vec) and then I run model.accuracy(‘file_questions’) but it runs really slow… Just wanted to know if this is normal or i have to do some things to speed uṕ the version of gensim.. Thanks in advance and great work!
It is — gensim always contains the latest, most optimized version (=newer than this blog post).
However, the accuracy computations (unlike training) are NOT optimized :-) I never got to optimizing that part. If you want to help, let me know, I don’t think I’ll ever get to it myself. (Massive optimizations can be done directly in Python, no need to go C/Cython).
Hi,
could you please explain how do CBOW and skip-gram models actually do the learning. I’ve read ‘Efficient estimation…’ but it doesn’t really explain how does the actual training happen.
I’ve taken a look at the original source code and your implementation, and while I can understand the code I cannot understand the logic behind it.
I don’t understand these lines in your implementation (word2vec.py, gensim 0.9) CBOW:
————————————
l2a = model.syn1[word.point] # 2d matrix, codelen x layer1_size
fa = 1.0 / (1.0 + exp(-dot(l1, l2a.T))) # propagate hidden -> output
ga = (1 – word.code – fa) * alpha # vector of error gradients multiplied by the learning rate
model.syn1[word.point] += outer(ga, l1) # learn hidden -> output
—————————————-
I see that it has something to do with the Huffman-tree word representation, but despite the comments, I don’t understand what is actually happening, what does syn1 represent, why do we multiply l1 with l2a… why are we multiplying ga and l1 etc…
Could you please explain in a sentence or two what is actually happening in there.
I would be very grateful.
Hi, this is great, but I have a question about unknown words.
After loading a model, when train more sentences, new words will be ignored, not added to the vocab automatically.
I am not quite sure about this. Is that true?
Thank you very much!
Yes, it is true. The word2vec algorithm doesn’t support adding new words dynamically.
Hi,
Unfortunately I’m not sufficiently versed in programming (yet) to solve this by myself. I would like to be able to be able to add a new vector to the model after it has been trained.
I realize I could export the model to a text file, add the vector there and load the modified file. Is there a way to add the vector within Python, though? As in: create a new entry for the model such that model[‘new_entry’] is assigned the new_vector.
Thanks in advance!
The weights are a NumPy matrix — have a look at `model.syn0` and `model.syn1` matrices. You can append new vectors easily.
Off the top of my head I’m not sure whether there are any other variables in `model` that need to be modified when you do this. There probably are. I’d suggest checking the code to make sure everything’s consistent.
Pingback: Motorblog » [Review] PyData Berlin 2014 – Satellitenevent zur EuroPython
I am new to word2vec. Can I ask you two questions?
1. When I apply the pre-trained model to my own dataset, do you have any suggestion about how to deal with the unknown words?
2. Do you have any suggestion about aggregating the word embeddings of words in a sentence into one vector to represent that sentence?
Thanks a lot!
Good questions, but I don’t have any insights beyond the standard advice:
1. unknown words are ignored; or you can build a model with one special “word” to represent all OOV words.
2. for short sentences/phrases, you can average the individual vectors; for longer texts, look into something like paragraph2vec: https://github.com/piskvorky/gensim/issues/204#issuecomment-52093328
Thanks for your advice :D
Hi,
Could you tell me how to find the most similar word as in web app 3? Calculating the cosine similarity between each word seems like a no-brainer way to do it? Is there any API in gensim to do that?
Another question, I want to represent sentence using word vector, right now I only add up all the words in the sentence to get a new vector. I know this method does’t make sense, since each word has a coordinate in the semantic space, adding up coordinates is not an ideal to represent a sentence. I have read some papers talking about this problem? Could you tell me what will be an ideal way to represent sentence to do sentence clustering?
Thank you very much!
“Find most similar word”: look at the API docs.
“Ideal way to represent a sentence”: I don’t know about ideal, but another way to represent sentences is using “paragraph2vec”: https://github.com/piskvorky/gensim/issues/204
Radim, this is great stuff. I have posted a link to your software on our AI community blog in the USA at http://www.smesh.net/pages/980191274#54262e6cdb3c2facb5a41579 , so that other people can benefit from it too.
Does the workers = x work to multithread the iteration over the sentences or just the training of the model ?
It parallelizes training.
How you iterate over sentences is your business — word2vec only expects an iterator on input. What the iterator does internally to iterate over sentences is up to you and not part of word2vec.
I have a collection of 1500000 text files (with 10 lines each on average) and a machine with 12 cores/16G of ram(not sure if it is relevant for reading files).
How would you suggest me to build the iterator to utilize all the computing resources I have?
No, not relevant.
I’d suggest you loop over your files inside __iter__() and yield out your sentences (lines?), one after another.
ok thanks!
If I am using the model pre-trained with Google News data set, is there any way to control the size of the output vector corresponding to a word?
for “model.build_vocab(sentences)” command to work, we need to add “import os”. without that, i was getting error for ‘os’ not defined.
Not sure what you are talking about suvir, you don’t need any “import os”. If you run into problems, send us the full traceback (preferably to the gensim mailing list, not this blog). See http://radimrehurek.com/gensim/support.html. Cheers.
hello,
Where can I find the code (in python) of the Bonus App?
Pingback: How to grow a list of related words based on initial keywords? | CL-UAT
I am using the train function as described in the api doc. I notice that the training might have terminated “prematuredly”, according to the logging output below. Not sure if I understand the output properly. When it said “PROGRESS: at 4.10% words”, does it mean 4.1% of the corpus or 4.1% of the vocabs? I suspect the former, so it would suggest it only processed 4.1% of the words. Please enlighten me. Thanks!
2015-02-11 19:34:40,894 : INFO : Got records: 20143
2015-02-11 19:34:40,894 : INFO : training model with 4 workers on 67186 vocabulary and 200 features, using ‘skipgram’=1 ‘hierarchical softmax’=0 ‘subsample’=0 and ‘negative sampling’=15
2015-02-11 19:34:41,903 : INFO : PROGRESS: at 0.45% words, alpha 0.02491, 93073 words/s
2015-02-11 19:34:42,925 : INFO : PROGRESS: at 0.96% words, alpha 0.02477, 97772 words/s
2015-02-11 19:34:43,930 : INFO : PROGRESS: at 1.48% words, alpha 0.02465, 100986 words/s
2015-02-11 19:34:44,941 : INFO : PROGRESS: at 2.00% words, alpha 0.02452, 102187 words/s
2015-02-11 19:34:45,960 : INFO : PROGRESS: at 2.51% words, alpha 0.02439, 102371 words/s
2015-02-11 19:34:46,966 : INFO : PROGRESS: at 3.05% words, alpha 0.02426, 104070 words/s
2015-02-11 19:34:48,006 : INFO : PROGRESS: at 3.55% words, alpha 0.02413, 103439 words/s
2015-02-11 19:34:48,625 : INFO : reached the end of input; waiting to finish 8 outstanding jobs
2015-02-11 19:34:49,026 : INFO : PROGRESS: at 4.10% words, alpha 0.02400, 104259 words/s
Hi Radim,
Is there a whole example that I can use to understand the whole concept and to walk through the code.
Thanks much,
Hello Sasha,
not sure what concept / code you need, but there is one example right there in the word2vec.py source file:
https://github.com/piskvorky/gensim/blob/develop/gensim/models/word2vec.py#L997
(you can download the text8 corpus used there from http://mattmahoney.net/dc/text8.zip )
Hi Radim,
I‘m wondering about the difference between model from trained in C(original way) and trained in gensim.
when I trying to use the model.most_similar function,loading the model I’ve trained in C, I got a totally different result when i trying to do the same stuff with word-analogy.sh? So I just want to know if the model.most_similar function use the same way when trying to calculate ‘man’-‘king’ + ‘women’ ≈ ‘queue’ like mikolov achieved in his C codes (word-analogy) ,thanks!!!
Yes, exactly the same (cosine similarity).
The training is almost the same too, up to different randomized initialization of weights IIRC.
Maybe you’re using different data (preprocessing)?
Sorry to bother you again,here are two kinds of way when I try to do:
The way when i use gensim:
model=Word2Vec.load_word2vec_format(‘vectors_200.bin’,binary=True)
#Chinese
word1=u’石家庄’
word2=u’河北’
word3=u’河南’
le=model.most_similar(positive=[word2,word3],negative=[word1])
The way use C code:
./word-analogy vectors_200.bin
the input :’石家庄’ 河北’ 河南’
totally different results…
the same model loaded, how could that happened?
Oh, non-ASCII characters.
IIRC, the C code doesn’t handle unicode in any way, all text is treated as binary. Python code (gensim) uses Unicode for strings.
So, perhaps some encoding mismatch?
How was your model trained — with C code? Is so, what was the encoding?
The training corpus encoding in utf-8, that’s the reason?
Hello Radim,
Is there a way to extract the output feature vector (or, sort of, predicted probabilities) from the model, just like while it’s training?
Thanks
Hey Radim
Thanks for the wonderful tutorial.
I am new to word2vec and I am trying generate n-grams of words for an Indian Script. I have 2 quesries:
Q1. Should the input be in plain text:
ସୁଯୋଗ ଅସଟାର or unicodes 2860 2825 2858 2853 2821
Q2. Is there any code available to do clustering of the generated vectors to form word classes?
Please help
Hi Radim,
For this example: “woman king man”:
I run with bonus web app, and got the results:
521.9ms [[“kings”,0.6490576267242432],[“clown_prince”,0.5009066462516785],[“prince”,0.4854174852371216],[“crown_prince”,0.48162946105003357],[“King”,0.47213971614837646]]
The above result is the same with word2vec by Tomas Mikolov.
However, when I run example above in gensim, the output is:
[(u’queen’, 0.7118195295333862), (u’monarch’, 0.6189675331115723), (u’princess’, 0.5902432203292847), (u’crown_prince’, 0.5499461889266968), (u’prince’, 0.5377322435379028), (u’kings’, 0.523684561252594), (u’Queen_Consort’, 0.5235946178436279), (u’queens’, 0.5181134939193726), (u’sultan’, 0.5098595023155212), (u’monarchy’, 0.5087413191795349)]
So why is this the case?
Your web app’s result is different to gensim ???
Thanks!
Hi Cong
no, both are the same.
In fact, the web app just calls gensim under the hood. There’s no extra magic happening regarding word2vec queries, it’s just gensim wrapped in cherrypy web server.
Thank you for your reply.
I loaded the pre-trained model: GoogleNews-vectors-negative300.bin by Tomas Mikolov.
Then, I used word2vec in gensim to find the output.
This is my code when using gensim:
from gensim.models import word2vec
model_path = “…/GoogleNews-vectors-negative300.bin”
model = word2vec.Word2Vec.load_word2vec_format(model_path, binary=True)
stringA = ‘woman’
stringB = ‘king’
stringC = ‘man’
print model.most_similar(positive=[stringA, stringB], negative=[stringC], topn=10)
–> Output is:
[(u’queen’, 0.7118195295333862), (u’monarch’, 0.6189675331115723), (u’princess’, 0.5902432203292847), (u’crown_prince’, 0.5499461889266968), (u’prince’, 0.5377322435379028), (u’kings’, 0.523684561252594), (u’Queen_Consort’, 0.5235946178436279), (u’queens’, 0.5181134939193726), (u’sultan’, 0.5098595023155212), (u’monarchy’, 0.5087413191795349)]
The see that the output above is different to the web app?
So can you check it for me?
Thanks so much.
I found that in gensim, the order should be:
…positive=[stringB, stringC], negative=[stringA]..
Hi Radim,
Thank you for the great tool and tutorial.
I have one question regarding learning rate of the online training. You mentioned to adjust total_words in train(), but could you give a more detailed explanation about how this parameter will affect the learning rate?
Thank you in advance.
Fantastic tool and tutorial. Thanks for sharing.
I’m wondering about compounding use of LSI. Take large corpus and perform LSI to map words into some space. Now having a document when you hit a word look up the point in the space and use that rather than just the word. Words of similar meaning then start out closer together and more sensibly influence the docuement classification. Would model just reverse out those initial weights ? thanks for any ideas.
Hi Radim,
First of all, thanks for you great job on developing this tool. I am new in word2vec and unfortunately literature do not explain the details clearly. I would be grateful if you could answer my simple questions.
1- for CBOW (sg=0), does the method uses negative sampling as well? or this is something just related to skip-gram model.
2-what about the window size? is the window size also applicable when one uses CBOW? or all the words in 1 sentences is considered as bag-of-words?
3- what happens if the window size is larger than the size of a sentence? Is the sentence ignored or simply a smaller window size is chosen which fits the size of the sentence?
4- what happens if the word sits at the end of the sentence? there is no word after that for the skip-gram model !
Hi Radim,
Thanks for such a nice package! It may be bold to suggest, but I ran across what I think might be a bug. It’s likely a features :), but I thought I’d point it out since I needed to fix it in an unintuitive way.
If I train a word2vec model using a list of sentences:
sentences = MySentences(fname) # generator that yields sentences
mysentences = list(sentences)
model = gensim.models.Word2Vec(sentences=mysentences **kwargs)
then the model finishes training. Eg., the end of the logging shows
…snip…
2015-05-13 22:12:07,329 : INFO : PROGRESS: at 97.17% words, alpha 0.00075, 47620 words/s
2015-05-13 22:12:08,359 : INFO : PROGRESS: at 98.25% words, alpha 0.00049, 47605 words/s
2015-05-13 22:12:09,362 : INFO : PROGRESS: at 99.32% words, alpha 0.00019, 47603 words/s
2015-05-13 22:12:09,519 : INFO : reached the end of input; waiting to finish 16 outstanding jobs
2015-05-13 22:12:09,901 : INFO : training on 4427131 words took 92.9s, 47648 words/s
I’m training on many GB of data, so I need to pass in a generator that yields sentences line by line (like your MySentences class above). But when I try it as suggested with, say, iter=5:
sentences = MySentences(fname) # generator that yields sentences
model = gensim.models.Word2Vec(sentences=None, **kwargs) # iter=10 defined in kwargs
model.build_vocab(sentences_vocab)
model.train(sentences_train)
the model stops training 1/20 of the way through. If iter=10, it stops 1/10 of the way, etc. Eg., the end of the logging looks like,
…snip…
2015-05-13 22:31:37,265 : INFO : PROGRESS: at 18.21% words, alpha 0.02049, 49695 words/s
2015-05-13 22:31:38,266 : INFO : PROGRESS: at 19.29% words, alpha 0.02022, 49585 words/s
2015-05-13 22:31:38,452 : INFO : reached the end of input; waiting to finish 16 outstanding jobs
2015-05-13 22:31:38,857 : INFO : training on 885538 words took 17.8s, 49703 words/s
Looking in word2vec.py, around line 316 I noticed
sentences = gensim.utils.RepeatCorpusNTimes(sentences, iter)
so I added
sentences_train = gensim.utils.RepeatCorpusNTimes(Sentences(fname), model.iter)
before calling model.train() in the above code snippet. Does this seem like the correct course of action, or am I missing something fundamental about the way one should stream sentences to build the vocab and train the model?
Thanks for your help,
Jesse
Hello Jesse,
for your sentences, are you using a generator (=can be iterated over only once), or an iterable (can be iterated over many times)?
It is true that for multiple passes, generator is not enough. Anyway better ask at the gensim mailing list / github, that’s a better medium for this:
http://radimrehurek.com/gensim/support.html
Hello,
It is a great tutorial, thank you very much….
but i have a problem,
I used the function ( accuracy ) to print the evaluation of the model , but nothing is printed to me
how to sove this problem ??
thanks a lot
Hi Radim,
Great package and tutorial – thanks much! I need a linear walkthrough of what i need to install for speedup. I am using a (slow) Windows 7 64-bit Intel Core 2 Duo 1.4GHz CPU 4GB RAM laptop.
I assume this is the dependency list:
(1) gensim in python requires numpy and scipy
(2) scipy requires one or more of BLAS or LAPACK
(3) BLAS/LAPACK can be installed from openBLAS or Atlas or Intel MKL
(4) each of openBLAS, Atlas, and Intel MKL require a GCC (?) compiler, but each seems to ask for a different compiler (e.g. MinGW-w64, Visual Studio, Intel compiler, etc).
I tried installing Visual Studio Ultimate 2013 – it took like 4 hours to install, and another 4 hours to uninstall.
Among openBLAS, Atlas and Intel MKL, what do you suggest for my scenario, and which C compiler should I install?
thanks much,
Sumeet
Hi, I would like to know, is it possible to do Parts-of-speech tagging using word2vec? Is there any reference for that?
Pingback: Parallelizing word2vec in Python » RaRe Technologies
Hello,
Thanks for the nice tutorial. Do you know which pre-processing steps have been done by Google on the Google News Corpus data-set to generate the vectors? Have they done a stopword removal or punctuation symbols removal? Any pointers about where can I find this info will be very useful.
Best,
Madhumita
Hello Madhumita!
The blog has moved to our new site; can you ask your question there?
Cheers,
Radim
Hello,
Thanks for the nice tutorial. Do you know which pre-processing steps have been done by Google on the Google News Corpus data-set to generate the vectors? Have they done a stopword removal or punctuation symbols removal? Any pointers about where can I find this info will be very useful.
Best,
Madhumita