Doc2vec tutorial

Radim Řehůřek 2014-12-15 gensim, programming 89 Comments

The latest gensim release of 0.10.3 has a new class named Doc2Vec. All credit for this class, which is an implementation of Quoc Le & Tomáš Mikolov: “Distributed Representations of Sentences and Documents”, as well as for this tutorial, goes to the illustrious Tim Emerick.

Doc2vec (aka paragraph2vec, aka sentence embeddings) modifies the word2vec algorithm to unsupervised learning of continuous representations for larger blocks of text, such as sentences, paragraphs or entire documents.

IMPORTANT NOTE: the doc2vec functionality received a major facelift in gensim 0.12.0. The API is now cleaner, training faster, there are more tuning parameters exposed etc. While the basic ideas explained below still apply, see this IPython notebook for a more up-to-date tutorial on using doc2vec. For a commercial document similarity engine, see our scaletext.com.

Continuing in Tim’s own words:

Input

Since the Doc2Vec class extends gensim’s original Word2Vec class, many of the usage patterns are similar. You can easily adjust the dimension of the representation, the size of the sliding window, the number of workers, or almost any other parameter that you can change with the Word2Vec model.

The one exception to this rule are the parameters relating to the training method used by the model. In the word2vec architecture, the two algorithm names are “continuous bag of words” (cbow) and “skip-gram” (sg); in the doc2vec architecture, the corresponding algorithms are “distributed memory” (dm) and “distributed bag of words” (dbow). Since the distributed memory model performed noticeably better in the paper, that algorithm is the default when running Doc2Vec. You can still force the dbow model if you wish, by using the dm=0 flag in constructor.

The input to Doc2Vec is an iterator of LabeledSentence objects. Each such object represents a single sentence, and consists of two simple lists: a list of words and a list of labels:
sentence = LabeledSentence(words=[u'some', u'words', u'here'], labels=[u'SENT_1'])
The algorithm then runs through the sentences iterator twice: once to build the vocab, and once to train the model on the input data, learning a vector representation for each word and for each label in the dataset.

Although this architecture permits more than one label per sentence (and I myself have used it this way), I suspect the most popular use case would be to have a single label per sentence which is the unique identifier for the sentence. One could implement this kind of use case for a file with one sentence per line by using the following class as training data:
class LabeledLineSentence(object):
    def __init__(self, filename):
        self.filename = filename
    def __iter__(self):
        for uid, line in enumerate(open(filename)):
            yield LabeledSentence(words=line.split(), labels=['SENT_%s' % uid])
A more robust version of this LabeledLineSentence class above is also included in the doc2vec module, so you can use that. Read the doc2vec API docs for all constructor parameters.

Training

Doc2Vec learns representations for words and labels simultaneously. If you wish to only learn representations for words, you can use the flag train_lbls=False in your Doc2Vec class. Similarly, if you only wish to learn representations for labels and leave the word representations fixed, the model also has the flag train_words=False.

One caveat of the way this algorithm runs is that, since the learning rate decrease over the course of iterating over the data, labels which are only seen in a single LabeledSentence during training will only be trained with a fixed learning rate. This frequently produces less than optimal results. I have obtained better results by iterating over the data several times and either

randomizing the order of input sentences, or

manually controlling the learning rate over the course of several iterations.

For example, if one wanted to manually control the learning rate over the course of 10 epochs, one could use the following:
model = Doc2Vec(alpha=0.025, min_alpha=0.025)  # use fixed learning rate
model.build_vocab(sentences)
for epoch in range(10):
    model.train(sentences)
    model.alpha -= 0.002  # decrease the learning rate
    model.min_alpha = model.alpha  # fix the learning rate, no decay
The code runs on optimized C (via Cython), just like the original word2vec, so it’s fairly fast.

Note from Radim: I wanted to include the obligatory run-on-English-Wikipedia-example at this point, with some timings and code. But I couldn’t get reasonable results out of Doc2Vec, and didn’t want to delay publishing Tim’s write up any longer while I experiment. Despair not; the caravan goes on, and we’re working on a more scalable version of doc2vec, one which doesn’t require a vector in RAM for each document, and with a simpler API for inference on new documents. Ping me if you want to help.

Memory Usage

With the current implementation, all label vectors are stored separately in RAM. In the case above with a unique label per sentence, this causes memory usage to grow linearly with the size of the corpus, which may or may not be a problem depending on the size of your corpus and the amount of RAM available on your box. For example, I’ve successfully run this over a collection of over 2 million sentences with no problems whatsoever; however, when I tried to run it on 20x that much data my box ran out of RAM since it needed to create a new vector for each sentence.

I/O

The usage for Doc2Vec is the same as for gensim’s Word2Vec. One can save and load gensim Doc2Vec instances in the usual ways: directly with Python’s pickle, or using the optimized Doc2Vec.save() and Doc2Vec.load() methods:
model = Doc2Vec(sentences)
...
# store the model to mmap-able files
model.save('/tmp/my_model.doc2vec')
# load the model back
model_loaded = Doc2Vec.load('/tmp/my_model.doc2vec')
Helper functions like model.most_similar(), model.doesnt_match() and model.similarity() also exist. The raw words and label vectors are also accessible either individually via model['word'], or all at once via model.syn0.
See the docs.

The main point is, labels act in the same way as words in Doc2Vec. So, to get the most similar words/sentences to the first sentence (label SENT_0, for example), you’d do:
print model.most_similar(&quot;SENT_0&quot;)
[('SENT_48859', 0.2516525387763977),
 (u'paradox', 0.24025458097457886),
 (u'methodically', 0.2379375547170639),
 (u'tongued', 0.22196565568447113),
 (u'cosmetics', 0.21332012116909027),
 (u'Loos', 0.2114654779434204),
 (u'backstory', 0.2113303393125534),
 ('SENT_60862', 0.21070502698421478),
 (u'gobble', 0.20925869047641754),
 ('SENT_73365', 0.20847654342651367)]
or to get the raw embedding for that sentence as a NumPy vector:
print model[&quot;SENT_0&quot;]
etc. More functionality coming soon!

If you liked this article, you may also enjoy the Optimizing word2vec series and the Word2vec tutorial.

Comments 89

Simon Smith
2014-12-16 at 4:55 pm

Tim, Radim,

Thank you for this – it looks great. I’m a big fan of Gensim and it is now even better.

Simon

Reply
1. Post
  Author
  
  Radim
  2015-03-06 at 10:28 am
  
  Thanks Simon, I appreciate it.
  
  Reply
  1. gj
    2016-08-24 at 7:59 am
    
    Hi, thanks for your tutorial. I want to get sentences’vector ,my train text is “F:\\jj\\g.txt”,but i can not write code,could you help me write a code to get sentence vector ?thank you very much
    
    Reply
  2. GJ
    2016-08-25 at 1:46 am
    
    hi ,I hava a question need your help. if I train a model named “gj.bin”,but when I write code”model = Doc2Vec.load_word2vec_format(‘F:\\JJ\\gj.bin’, binary=True)”,it can show UnicodeDecodeError “utf-8′ codec can’t decode byte 0xfa in position 2: invalid start byte” what should i do?
    
    Reply
Wen
2014-12-30 at 8:25 am

Hi, thanks for your awesome tutorial. What I should do if I want to input new testing sentences to find similar sentences after training?

Thank you

Reply
1. Samuel Rönnqvist
  2015-01-28 at 3:44 pm
  
  You should be able to add new sentences to an existing model using train(), and then run most_similar([‘SENT_XX’]), where SENT_XX is the label of one of your new sentences.
  
  Reply
  1. Yikang
    2015-03-06 at 9:18 am
    
    Thanks for your awesome work. But I find that train() doesn’t add new sentences to an existing model. I get an ‘KeyError’ error while I was trying. Maybe train() just update the weight?
    
    Reply
    1. Ólavur Mortensen
      2015-04-02 at 2:24 pm
      
      I would very much like to know this too. I can’t figure out how to train on more data in word2vec either.
      
      Reply
dhruv shah
2015-01-15 at 9:44 am

Hi Radim,

I just had a quick question.
The doc2vec is an unsupervised algorithm. So why do we have to provide labels when we are training the model?

Reply
1. Bach
  2015-01-19 at 8:11 am
  
  Hi, I think that the labels here are not the “y”, they are like tags attached to each sentence, so you can access the vector that represents the sentence.
  
  Reply
  1. dhruv shah
    2015-01-20 at 10:22 am
    
    Does this mean that every single document label is unique?
    
    Reply
    1. Jean
      2016-03-02 at 4:27 pm
      
      Yes it is !
      
      Reply
    2. Post
      Author
      
      Radim Rehurek
      2016-03-03 at 2:56 am
      
      That’s up to you. Each label CAN be unique. Or you can use only a few labels across all documents (such as 10 target classes = labels for 1 million documents, reusing a single label for many documents).
      
      You would then learn 10 vectors via doc2vec, one vector to represent each class=label.
      
      Also note you can provide multiple labels per document (each document can be tagged with multiple “classes”).
      
      Reply
      1. Nataly Maslova
        2016-09-15 at 6:12 am
        
        Hello Radim.
        I try to attend 2 labels to some sentences in my sample and get the error:
        File “C:\doc2v\mylabelseparate.py”, line 122, in
        model_dm.build_vocab(np.concatenate((x_train, x_test, unsupforest))) # build
        vocab over all reviews
        File “C:\Python27\lib\site-packages\gensim\models\word2vec.py”, line 396, in b
        uild_vocab
        vocab = self._vocab_from(sentences)
        File “C:\Python27\lib\site-packages\gensim\models\doc2vec.py”, line 200, in _v
        ocab_from
        sentence_length = len(sentence.words)
        TypeError: object of type ‘LabeledSentence’ has no len()
        
        I would be extremely grateful to you if you look at my post:
        http://stackoverflow.com/questions/39504130/error-object-of-type-labeledsentence-has-no-len-while-building-vocabulary-in
Bach
2015-01-19 at 8:09 am

Hi,

How can I get the vector representing a paragraph? My understanding is that I can access the vectors representing sentences via their labels; So how do I label a paragraph contains a few sentences?

Thanks.

Reply
1. Denis
  2015-01-22 at 11:18 pm
  
  I guess you should use the same label for all sentences in your paragraph
  
  Reply
  1. Bach
    2015-01-27 at 11:18 am
    
    Hi,
    So will each sentence in a paragraph have 1 label for its own and 1 label for the paragraph? E.g., for paragraph 1:
    sentence.labels = [u’SENT_i’, u’PARA_1′]
    for sentence i.
    
    Reply
    1. Samuel Rönnqvist
      2015-01-28 at 3:54 pm
      
      You could also feed entire paragraphs (or arbitrarily long sequences of tokens) together with a PARA_X label, to reduce memory usage (i.e., number of unique labels), unless you’re interested in separating the semantics of paragraphs and individual sentences.
      
      Reply
Jeff
2015-01-26 at 9:57 pm

Hi Radim,

First, thank you for this tutorial. I did what you said in this tutorial, but it didn’t work. The issue is after my training, I got a key error with
print model[‘SENT_0’].

I have checked all of the keys, and I can’t find any labels, then I checked the source code, I find:
self.train_words = train_words
self.train_lbls = train_lbls
if sentences is not None:
self.build_vocab(sentences)
self.train(sentences)
In Doc2vec class, it just call the build_vocab() method inherited from Word2Vec class, I wonder how this can generate a key list include ‘SENT_0′,’SENT_1’,……

Hoping for your reply.
Thanks

Reply
1. Samuel Rönnqvist
  2015-01-28 at 4:12 pm
  
  If training was successful, you should be able to access the sentence vector by label like that. Here’s a minimal example:
  
  sentence = LabeledSentence(words=[u’some’, u’words’, u’here’], labels=[u’SENT_1′])
  model = Doc2Vec([sentence], min_count=0)
  model[‘SENT_1’]
  
  Reply
  1. Mark
    2015-09-02 at 12:16 pm
    
    That example code doesn’t work for me:
    
    >>> sentence = gensim.models.doc2vec.LabeledSentence(words=[u’some’, u’words’, u’here’], labels=[u’SENT_1′])
    Traceback (most recent call last):
    File “”, line 1, in
    TypeError: __new__() got an unexpected keyword argument ‘labels’
    
    Step one doesn’t work. But it appears that LabeledSentence now wants ‘tags’ instead of labels, so I’ll update it.
    
    >>> sentence = gensim.models.doc2vec.LabeledSentence(words=[u’some’, u’words’, u’here’], tags=[u’SENT_1′])
    >>> model = gensim.models.Doc2Vec([sentence], min_count=0)
    >>> model[‘SENT_1’]
    Traceback (most recent call last):
    File “”, line 1, in
    File “C:Usersu772700AppDataLocalContinuumAnaconda (x86)libsite-packagesgensimmodelsword2vec.py”, line 1204, in __getitem__
    return self.syn0[self.vocab[words].index]
    KeyError: ‘SENT_1’
    
    I can build the model, BUT the sentence tag doesn’t exist in the vocabulary.
    
    Reply
    1. Post
      Author
      
      Radim
      2015-09-02 at 12:26 pm
      
      Mark, did you read the IMPORTANT NOTE above? Doc2vec API has changed considerably since this blog post.
      
      Reply
    2. wolfgang
      2015-09-27 at 7:59 pm
      
      I believe you need to use the model.docvecs[‘SENT_1’], the API has changed as Radim mentioned
      
      Reply
2. Aaron
  2016-08-09 at 7:54 pm
  
  Something like this will work with the updated API:
  
  document_embedding_matrix = np.array([d.infer_vector(sents[i].words) for i in range(len(sents))])
  
  Reply
3. Aaron
  2016-08-09 at 7:54 pm
  
  Something like this will work with the updated API:
  
  document_embedding_matrix = np.array([doc2vec_model.infer_vector(sents[i].words) for i in range(len(sents))])
  
  Reply
Zach
2015-01-29 at 8:24 pm

I have a model trained with Doc2Vec. (It’s very cool!)

Is there an easy way to extract JUST the vectors for all of the sentences?

Basically, I want to save a matrix of all the original documents and their corresponding vectors.

Reply
1. Claudio
  2015-02-19 at 6:12 am
  
  I think you only have to read the labels which start with “SENT_ ” because that are the labels which contains the vector representation. However, I’m experimenting with Doc2Vec and after the a successful training process my model have some missing sentences labels. Did you solve it?
  
  Reply
  1. Nate
    2015-02-23 at 6:28 pm
    
    Let me start off by saying I really like Doc2Vec and I appreciate the efforts that have gone into making it!
    
    After successfully training a Doc2Vec model, many of my sentence labels are missing in the trained model and many new labels have appeared representing individual words (presumably extracted from some of the sentences, themselves). Maybe I’m not correctly understanding the purpose of the labels, but I thought that perhaps they represented the “hooks” for extracting the vector representations for each sentence I trained on?
    
    Reply
    1. Claudio
      2015-02-26 at 11:57 pm
      
      Nate, how are you training your model?. I fixed the missing sentences labels by setting the parameter min_count=1. According with the documentation “min_count = ignore all words with total frequency lower than this.” However, the problem with this solution is we are keeping the noise in the dataset. (very specific words).
      
      Reply
Paul F
2015-02-01 at 11:47 pm

Searching for golden needles in unlabeled email haystacks..I was astounded to see in Table two of the Le and Mikolov article that LDA demonstrates a 32.58% error rate compared to the authors’ Paragraph Vector model with only a 7.42%.error rate. I am searching an un-labled corpus of 10+ million emails to determine by topic what exists. We ran our initial test of LDA (single Core) a few days ago on the Enron email test file but have not had time to thoroughly analyze the results. This difference in the error rate means I likely must add Doc2Vec to the process. Under LDA our goal was a very homogeneous group of clusters with the goal of quickly eliminating the NOT RELEVANT clusters and the documents they include from further processing based upon a manual review of the top 5-10 top ranked topics in the top 10-20 top ranked documents in each cluster (recognizing that there could be MANY thousands of homogeneous clusters, we are thinking of ways that the machine can determine 80%+ of the Not Relevant clusters based on rules yet to be determined.) The NOT RELEVANT Clusters and their documents become a training set as do specific paragraphs and sentences of the POSSIBLY RELEVANT Clusters when we move to SuPervised learning using LSA. LDA gets me top ranked topics of top ranked documents in each cluster but apparently has limitation using distributed computing. LSA seems apparently more effective in Distributed Computing and is a comfortable known process for training with manually determined LSA training sets.
Doc2Vec apparently brings sentence, paragraph and even possibly document labeling still with the benefit of sentiment analysis at an apparent vastly increased processing time. Since my 12 core machine won’t cut it on Doc2Vec and I need to acquire a considerable number of cores and build a LAN. what suggestion can anyone supply concerning the least cost per core(I am at about US$45 each) and how to determine the amount of Ram required per worker core for a good trade off between processing speed and core cost? Will each worker need to have identical processors and RAM? Since they are plentiful, I am contemplating Dell Precision T7500 with dual 6 core and Ram based upon your comments. Please confirm that under all three models only actual cores and not hyperthreaded fake cores are usable?

Thanks folks, Your efforts are appreciated.
Paul F.

Reply
Alex
2015-02-04 at 6:35 pm

I am interested in such an application:
can I use existing word2vec C format binary file (e.g. GoogleNews-vectors-negative300.bin) to help training paragraph vector. I assume I should follow the steps
step1
model = Doc2Vector.load_word2vec_format(‘./GoogleNews-vectors-negative300.bin’, binary=True)
step2
model.train(sentences)

But when I did in this way, errors pop up like
File “”, line 1, in
File “/homes/xx302/lib/python2.7/site-packages/gensim-0.10.3-py2.7-linux-x86_64.egg/gensim/models/word2vec.py”, line 466, in train
total_words = total_words or int(sum(v.count * v.sample_probability for v in itervalues(self.vocab)) * self.iter)
File “/homes/xx302/lib/python2.7/site-packages/gensim-0.10.3-py2.7-linux-x86_64.egg/gensim/models/word2vec.py”, line 466, in
total_words = total_words or int(sum(v.count * v.sample_probability for v in itervalues(self.vocab)) * self.iter)
AttributeError: ‘Vocab’ object has no attribute ‘sample_probability’

So is it caused by using binary word2vector in which some information is lost?

Reply
1. Manuel Reis
  2015-04-25 at 4:46 pm
  
  Hello Alex,
  
  I am having the same problem. How did you managed to solve it?
  
  Reply
yuka
2015-02-28 at 10:19 am

Hello Radim, thank you for your tutorials, it is really interesting and enlightening. I am a rookie in python and therefore have some issues while manipulating doc2vec. Here it is, I have a corpora (folder) with five classes (folders) inside, and each document is simply a txt file. In this case, how should I start training all these documents? Thanks.

Reply
Huo
2015-03-09 at 1:37 pm

HI,
In the class LabledLineSentence(), every sentence line got a unique label, which is ‘SENT_%s’ % item_no. But what if two lines have duplicated contents ? In your code, they will have two different labels. Is it reasonable? Thanks.

Reply
JohnDannl
2015-03-18 at 3:01 am

Hi Radim,
I really appreciate your good work！I have successfully completed the first step: training a model,but I find the same problem as YiKang posted that “the train() method doesn’t add new sentences to an existing model” when I want to predict some new sentences.According to Mikolov‘s paper：in “the inference stage” ，we can get paragraph vectors D for new paragraphs.Is there something I missed? Looking forward to your reply.

Reply
Huanliang Wang
2015-04-08 at 5:43 am

Hi, Randim:
I want to get the vector of input sentence after training was successful. But I can’t find any method to get it. Could you tell me how to operate gensim?

Reply
Huanliang Wang
2015-04-08 at 7:55 am

Hi, Randim:
After training was successful, I find some SENT_?? are not exist. such as:
model([‘SENT_22’])
An error is thrown: KeyError: ‘SENT_22’.
But running both model([‘SENT_21’]) and model([‘SENT_23’]) are right.

Is it an error? How do I avoid such error?

Reply
1. Cathy1272015
  2015-12-02 at 9:34 am
  
  I came across the same problem as you when I run the program as mentioned above ,and now I don’t know why, if you get the answer please tell me,thanks
  
  Reply
  1. Post
    Author
    
    Radim
    2015-12-02 at 9:36 am
    
    Such questions are best served on the gensim mailing list. The doc2vec API has changed a bit over the past year.
    
    Reply
    1. xinchun
      2016-01-22 at 3:29 am
      
      you can use:model.docvess[”] to got it . but i also have a question, how can i predict my new data using my model which i have trained before
      
      Reply
Andrew Beam
2015-04-08 at 10:41 pm

Thanks for implementing this, I’ve had fun with the new module. Is there any way to assess convergence? Can we access the log-likehood value somehow? I have no idea how many epochs I should be running this for or if my alpha size is reasonable.

Reply
Debasis Ganguly
2015-04-09 at 10:13 pm

I’m using doc2vec on one of the SEMEVAL ’14 datasets. I’ve got 755 number of lines (sentences) in a text file.
However, after executing the following code:

sentences=gensim.models.doc2vec.LabeledLineSentence(‘test.txt’)
model = gensim.models.doc2vec.Doc2Vec(sentences, size=10, window=5, min_count=5, workers=4)
model.save_word2vec_format(‘svectors.txt’)

when I wanted to check if all sentences have been stored in the vector file, to my surprise I found that some sentences are missing! More precisely,
grep -c SENT_ svectors.txt
gave me an o/p of 735.

Wondering what might be the cause of 20 sentences missing?

Reply
1. Debasis Ganguly
  2015-04-09 at 11:14 pm
  
  Setting min_count to 2 fixes this problem. It was ignoring sentences where one of the constituent words was having freq. less than the min_count.
  
  Reply
Erick
2015-04-15 at 6:04 am

I’m not sure I understood the point in having two labels in a sentence. Does it mean that sentence can be used to train the vectors of two paragraphs?

Reply
hj
2015-04-16 at 5:40 am

hello.

Thank you so much for this helpful tutorial.
I have a question about the speed.
I have a 366MB text file and wanted to create the doc2vec model. However, it seems it stopped since the log stopped at 30.23% for the last ten hours. So I was wondering, what is the max size doc2vec code can handle? Also, any suggestions on how I can increase speed and size limit?

Thanks again, have nice day:)

Reply
Silvia
2015-04-17 at 10:23 am

I am using the Doc2vec class with a corpus containing very short sentences that can have even 1 word. I observed that for many sentences, especially the short ones, Doc2vec do not provide any representations. Could you explain why? And could I solve this?

Thanks in advance!

Reply
R
2015-05-05 at 12:09 am

Hi Radim

Great port. Have you an example on how to use Doc2vec for Information Retrieval or any advice on the subject.

Reply
PA
2015-05-12 at 3:10 am

Hi Radim,

Great software. I have a question. Suppose I have created a model and wish to find similarities with new sentences. I create new labels for each new sentence. I can do the model.train, but it does not update the vocabulary from the loaded model to include the words in the new sentences. If I do

model.load(“orig.model”) # this has 100000 labels

sentences = []
currentlines = 100000
for line in lines:
currentlines += 1
label = ‘SENT_’ + str(currentlines)
sentence = models.doc2vec.LabeledSentence(unicodewords, labels=[unicode(label)])
sentences.append(sentence)

model.train(sentences)

print model.most_similar(“SENT_109900”)

Since this label is not in the original model, it complains with

KeyError: “word ‘SENT_109900’ not in vocabulary”

The mechanism updates the model with train but does not allow me to update the vocabulary. How is this possible?

Reply
1. Radim
  2015-06-20 at 10:52 pm
  
  Exactly right — the model doesn’t allow adding new vocabulary (only updating existing one).
  
  But I have good news for you! One of gensim users is creating a new pull request on github, which will allow adding new words too. This way, the model should become fully online.
  
  Reply
  1. Shima
    2016-04-30 at 5:48 pm
    
    Hi Radim,
    
    I would like to use doc2vec for classification. However, it seems we need to have the test documents as well as training documents at the time of training to build the vocabulary. Is it really the case? If yes, how such model can be used in practice to learn a model that can classify UNSEEN documents? I get KeyError when I want to get document representation of unseen docs.
    
    Reply
sky
2015-07-16 at 6:34 pm

Thanks! Although this tutorial seems to be a bit outdated as a significant update has been merged (https://github.com/piskvorky/gensim/pull/356) and LabeldInstances class is fully replaced by TaggedDocument. It leads to major changes in the way to create input sentences.

Reply
Rob
2015-07-20 at 11:27 am

Does pull 356 mean that you no longer need each document vector in RAM?

Also, which pull request was going to make doc2vec online? Is it the same one?

Thanks!! These implementations are amazing.

Reply
Pingback: How to train p(category|title) model with word2vec - codeengine
Sagar Arora
2015-08-12 at 8:02 am

Any updates on scalable version of doc2vec, the one which does not require all label vectors to be stored separately in RAM?

Looking forward to it

Reply
arandomuser
2015-09-15 at 1:47 am

Sorry but why you have published a “””tutorial””” (a)that is out of date, (b) hard to follow, (c) full of errors AND (d) you do not help people AT ALL. Someone post a link of github with a “new” tutorial. Looks that your company is funny and cannot be taken serious. I am looking for your response – if you have something to say.

Reply
1. Post
  Author
  
  Radim
  2015-09-15 at 3:59 am
  
  Yeah, we should probably update this tutorial with the new one, instead of just putting a disclaimer on top.
  
  What errors did you find here?
  
  Reply
Pingback: Python:How to calculate the sentence similarity using word2vec model of gensim with python – IT Sprite
Ola Gustafsson
2015-10-31 at 2:37 pm

I’ve done some experimenting with doc2vec for recommendation purposes, looking for documents that share similar meaning, and the results seem very promising.

However, when training on a corpus and extracting vectors, these are not the same vectors as I get from the infer method, used on the same texts. I was expecting them to be identical, but was I right to do so? There are some alpha parameters both in training a model and at the infer stage. Are these or something else the reason behind different vectors?

Reply
parisa
2015-11-11 at 8:29 am

hello
I’m new in gensim and I want to extract semantic of document’s word,but I don’t know how to do it,I don’t know how to load my dataset and other steps,your tutorial is not full for beginners I study your tutorials but I don’t understand it,please help me I need it.
thanks

Reply
1. Post
  Author
  
  Radim
  2015-11-11 at 9:35 am
  
  Hello parisa,
  
  try the mailing list: http://radimrehurek.com/gensim/support.html
  
  But if you don’t know anything at all, it’s probably best to start with the basics (programming, python, numpy…), rather than jumping into advanced research like doc2vec.
  
  Reply
  1. parisa
    2015-11-13 at 10:11 am
    
    Thanks but I don’t have any time for learning this,can you recommend me a person who can write this project for me???/
    
    Reply
Pingback: Python:Doc2vec : How to get document vectors – IT Sprite
Alex
2015-11-18 at 1:09 pm

Big thanks, Radim! Is it possible to train paragraph representation using already trained word representations on really big corpus?
When I just load pretrained model and then start to learn sentence vectors, I’m getting an error about non-existent keys in vocabulary.

Currently, I’ve just made my own version to cope with that, but I suppose I’m reinventing a wheel.

So, if I’m right, I’m ready to share a PR, and if I’m wrong — please give me some small ( but working ) example
Thank you!

Best regards,
Alex.

Reply
Post
Author

Radim
2015-11-18 at 1:20 pm

Hello Alex,

this question comes up regularly on the mailing list: for example here or here. It’s probably easiest to continue the discussion there.

Best,
Radim

Reply
1. Alex
  2015-11-18 at 1:33 pm
  
  Thank you for prompt reply, but I mean slightly different thing. I have embeddings of pictures ( obtained from ConvNet), and wanted to compute paragraph2vec for them just to see whether wordsim will cook something usable for me.
  
  That means, that I even have no “words” in usual sense ( my “words” are already embeddings ).
  
  But in theory we have no problem for training par2vec for pictures, and so my question is just how to do this easily.
  
  Thank you again in advance!
  
  Reply
Karan Singla
2016-02-05 at 11:29 am

I want to cluster the chat messages/sentences to group similar messages together. When a new comes, I want to assign a cluster to it. How would I get a vector for new message? As per one of the above questions, I will need to train the model every time I get a new sentence to obtain its vector. Is there a better approach? If I want to add 50-60k messages everyday, how to proceed? Will the model size grow linearly with the number of new messages?

Reply
PS
2016-02-09 at 1:48 am

What happened to your good old doc2vec tutorial? :-O

It explained so nicely, the gensim api: how to load documents, train, test, similarity etc. Now, you have made it to look and feel so complicated. I guess a small writeup on how to work with your own data is required.

Reply
1. Post
  Author
  
  Radim Rehurek
  2016-02-09 at 2:49 am
  
  Hello PS — do you mean the ipython notebook linked in the preamble here? Or what part concretely feels complicated?
  
  I agree the doc2vec docs could (should!) be improved… always a struggle with open source projects. Everyone just scratching their own itch, and it’s rarely the documentation.
  
  Reply
  1. PS
    2016-02-10 at 3:07 am
    
    Thanks for the quick reply Radim. I was referring to the original tutorial that you had (year and half back). I worked with doc2vec extensively when it came out and I remember, your description was good enough for me to start the *first* model. I thought of using doc2vec again for some project yesterday. And now it all looks so super complicated 😀
    
    I suppose, we could do a small write up for the most common use case: load your own data (both streaming and in-memory) , train it, get vectors, and predict new data.
    
    Reply
    1. Post
      Author
      
      Radim Rehurek
      2016-02-10 at 3:19 am
      
      Oh yes, help with the docs would be most welcome!
      
      But it’s better to discuss on the gensim mailing list, so others can chime in and comment.
      
      Reply
      1. PS
        2016-02-10 at 3:41 am
        
        Sure thing. Will do
Ken Yeung
2016-02-22 at 12:30 am

Hi Radim

Thanks for the great work on doc2vec in python.

I get a question of how to get similarity by using both word and tag? For example something like this below

model.most_similar(‘MOVIE_123’, ‘good’)

Also, how could I get the sentence as string based on the tag? if I only know the tag MOVIE_123, is it possible to get the original sentence as string?

Many thanks!

Reply
1. Post
  Author
  
  Radim Rehurek
  2016-02-22 at 7:59 am
  
  Hello Ken,
  
  have a look at the gensim mailing list, that is the standard support channel. I believe your question is answered there (e.g.
  https://groups.google.com/forum/#!topic/gensim/h5iftGRFF18).
  
  Best,
  Radim
  
  Reply
  1. Ken
    2016-02-22 at 2:29 pm
    
    Hi Radim,
    
    Seems the question in the mailing list is about multiple tags on a sentence. However, my question is more about how to query similarity with both word and tag together. do you know how it might be done with gensim? Thanks.
    
    Reply
Koorm
2016-03-21 at 8:40 am

Hi Radim. I am having a hard time getting your tutorial to run. A basic case solution will be nice: Read 2 documents from a file where each document has 1 sentence per line, and build the doc2vec model. What you have fails here and there. For example,

import gensim
import numpy
import random
import os

sentence = gensim.models.doc2vec.LabeledSentence(words=[u’some’, u’words’, u’here’], tags=[u’SENT_1′])
model = gensim.models.doc2vec.Doc2Vec(alpha=0.025, min_alpha=0.025)
model.build_vocab(sentence)

fails:

—————————————————————————
AttributeError Traceback (most recent call last)
in ()
6 sentence = gensim.models.doc2vec.LabeledSentence(words=[u’some’, u’words’, u’here’], tags=[u’SENT_1′])
7 model = gensim.models.doc2vec.Doc2Vec(alpha=0.025, min_alpha=0.025)
—-> 8 model.build_vocab(sentence)

/home/ehsan/anaconda2/lib/python2.7/site-packages/gensim-0.12.4-py2.7-linux-x86_64.egg/gensim/models/word2vec.pyc in build_vocab(self, sentences, keep_raw_vocab, trim_rule)
506
507 “””
–> 508 self.scan_vocab(sentences, trim_rule=trim_rule) # initial survey
509 self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule) # trim by min_count & precalculate downsampling
510 self.finalize_vocab() # build tables & arrays

/home/ehsan/anaconda2/lib/python2.7/site-packages/gensim-0.12.4-py2.7-linux-x86_64.egg/gensim/models/doc2vec.pyc in scan_vocab(self, documents, progress_per, trim_rule)
637 interval_start = default_timer()
638 interval_count = total_words
–> 639 document_length = len(document.words)
640
641 for tag in document.tags:

AttributeError: ‘list’ object has no attribute ‘words’

Reply
Ying
2016-04-12 at 7:35 am

if my document label is not unique, the same document may have different labels and a few documents have a same label. The form of the answer of model.docvecs[label1] is as follow:
[[[ 0.1111, 0.1111, 0.0222, 1.555,….]
[0.444, 0.555, 1.788, -0.222……..]
……
[0.555, -0.7777, ……..]]]
that is all the vectors of documents haing the label [label1]? and how can I get each of it?

Reply
1. Post
  Author
  
  Radim Řehůřek
  2016-04-12 at 8:22 am
  
  The best place for such questions is the gensim mailing list.
  
  Reply
  1. Ying
    2016-04-18 at 11:35 am
    
    what is the gensim mailing list
    
    Reply
Tim
2016-04-18 at 10:12 pm

Hi can you please let me know what exactly is the output of the build_vocab method given a corpus of tagged documents?

Reply
1. Post
  Author
  
  Radim Řehůřek
  2016-04-19 at 1:15 am
  
  Hello Tim, the “build_vocab” method has no output (“None” in Python).
  
  If you have any additional questions, gensim mailing list is the best place for them.
  
  Best,
  Radim
  
  Reply
Wayne
2016-04-20 at 11:21 am

Hi Radim.
Thank you for your great work. I am try the doc2vec for documents’ classification( use the infer_vector to get the vector for new documents
I want to ask:
1. Whether should I split one document into multiple lines?
2. If I split each document, should I set the same one tag for these lines from one document?
3. If the multiple lines have the same tag, how does the algorithm work for the tag when training the model?
Thank you very much?

Reply
1. Post
  Author
  
  Radim Řehůřek
  2016-04-21 at 1:30 am
  
  Hello Wayne,
  
  you’ll get most success asking such questions at the gensim mailing list (+searching the mailing list archive for previous answers first).
  
  Reply
ahmed mohammed
2016-05-15 at 5:18 pm

Hi Radim
nice job i have learnt a lot from your tutorials and posts. i am facing a big challenge using gensim to check for semantic similarity in 20newsgroups. assuming i have a word or sentence to check for its similarity from all categories found in 20newsgroup. how am i suppose to do this with gensim. Thank u.

Reply
Haroon
2016-09-08 at 2:16 pm

hello just discovered your blog…bookmarked!

Quick thing, I think the label attribute should is called tag in Doc2Vec now.

thanks much

Reply
Pingback: Visual Question Answer – badripatro
Pingback: Gensim Vektörel Doküman Eğitimi - gurmezin sci-tech-art
Pingback: Googling doc2vec - Luminis Amsterdam : Luminis Amsterdam
Pingback: Implementing doc2vec - Luminis Amsterdam : Luminis Amsterdam
Pingback: Twitterverse's opinion of NASA Research - Padmashri Suresh

Input

Training

Memory Usage

I/O

Comments 89

Leave a Reply Cancel reply