Doc2vec tutorial

Radim Řehůřek gensim, programming 89 Comments

The latest gensim release of 0.10.3 has a new class named Doc2Vec. All credit for this class, which is an implementation of Quoc Le & Tomáš Mikolov: “Distributed Representations of Sentences and Documents”, as well as for this tutorial, goes to the illustrious Tim Emerick.

Doc2vec (aka paragraph2vec, aka sentence embeddings) modifies the word2vec algorithm to unsupervised learning of continuous representations for larger blocks of text, such as sentences, paragraphs or entire documents.

IMPORTANT NOTE: the doc2vec functionality received a major facelift in gensim 0.12.0. The API is now cleaner, training faster, there are more tuning parameters exposed etc. While the basic ideas explained below still apply, see this IPython notebook for a more up-to-date tutorial on using doc2vec. For a commercial document similarity engine, see our scaletext.com.

Continuing in Tim’s own words:

Input

Since the Doc2Vec class extends gensim’s original Word2Vec class, many of the usage patterns are similar. You can easily adjust the dimension of the representation, the size of the sliding window, the number of workers, or almost any other parameter that you can change with the Word2Vec model.

The one exception to this rule are the parameters relating to the training method used by the model. In the word2vec architecture, the two algorithm names are “continuous bag of words” (cbow) and “skip-gram” (sg); in the doc2vec architecture, the corresponding algorithms are “distributed memory” (dm) and “distributed bag of words” (dbow). Since the distributed memory model performed noticeably better in the paper, that algorithm is the default when running Doc2Vec. You can still force the dbow model if you wish, by using the dm=0 flag in constructor.

The input to Doc2Vec is an iterator of LabeledSentence objects. Each such object represents a single sentence, and consists of two simple lists: a list of words and a list of labels:

sentence = LabeledSentence(words=[u'some', u'words', u'here'], labels=[u'SENT_1'])

The algorithm then runs through the sentences iterator twice: once to build the vocab, and once to train the model on the input data, learning a vector representation for each word and for each label in the dataset.

Although this architecture permits more than one label per sentence (and I myself have used it this way), I suspect the most popular use case would be to have a single label per sentence which is the unique identifier for the sentence. One could implement this kind of use case for a file with one sentence per line by using the following class as training data:

class LabeledLineSentence(object):
    def __init__(self, filename):
        self.filename = filename
    def __iter__(self):
        for uid, line in enumerate(open(filename)):
            yield LabeledSentence(words=line.split(), labels=['SENT_%s' % uid])

A more robust version of this LabeledLineSentence class above is also included in the doc2vec module, so you can use that. Read the doc2vec API docs for all constructor parameters.

Training

Doc2Vec learns representations for words and labels simultaneously. If you wish to only learn representations for words, you can use the flag train_lbls=False in your Doc2Vec class. Similarly, if you only wish to learn representations for labels and leave the word representations fixed, the model also has the flag train_words=False.

One caveat of the way this algorithm runs is that, since the learning rate decrease over the course of iterating over the data, labels which are only seen in a single LabeledSentence during training will only be trained with a fixed learning rate. This frequently produces less than optimal results. I have obtained better results by iterating over the data several times and either

  1. randomizing the order of input sentences, or
  2. manually controlling the learning rate over the course of several iterations.

For example, if one wanted to manually control the learning rate over the course of 10 epochs, one could use the following:

model = Doc2Vec(alpha=0.025, min_alpha=0.025)  # use fixed learning rate
model.build_vocab(sentences)
for epoch in range(10):
    model.train(sentences)
    model.alpha -= 0.002  # decrease the learning rate
    model.min_alpha = model.alpha  # fix the learning rate, no decay

The code runs on optimized C (via Cython), just like the original word2vec, so it’s fairly fast.

Note from Radim: I wanted to include the obligatory run-on-English-Wikipedia-example at this point, with some timings and code. But I couldn’t get reasonable results out of Doc2Vec, and didn’t want to delay publishing Tim’s write up any longer while I experiment. Despair not; the caravan goes on, and we’re working on a more scalable version of doc2vec, one which doesn’t require a vector in RAM for each document, and with a simpler API for inference on new documents. Ping me if you want to help.

Memory Usage

With the current implementation, all label vectors are stored separately in RAM. In the case above with a unique label per sentence, this causes memory usage to grow linearly with the size of the corpus, which may or may not be a problem depending on the size of your corpus and the amount of RAM available on your box. For example, I’ve successfully run this over a collection of over 2 million sentences with no problems whatsoever; however, when I tried to run it on 20x that much data my box ran out of RAM since it needed to create a new vector for each sentence.

I/O

The usage for Doc2Vec is the same as for gensim’s Word2Vec. One can save and load gensim Doc2Vec instances in the usual ways: directly with Python’s pickle, or using the optimized Doc2Vec.save() and Doc2Vec.load() methods:

model = Doc2Vec(sentences)
...
# store the model to mmap-able files
model.save('/tmp/my_model.doc2vec')
# load the model back
model_loaded = Doc2Vec.load('/tmp/my_model.doc2vec')

Helper functions like model.most_similar(), model.doesnt_match() and model.similarity() also exist. The raw words and label vectors are also accessible either individually via model['word'], or all at once via model.syn0.
See the docs.

The main point is, labels act in the same way as words in Doc2Vec. So, to get the most similar words/sentences to the first sentence (label SENT_0, for example), you’d do:

print model.most_similar("SENT_0")
[('SENT_48859', 0.2516525387763977),
 (u'paradox', 0.24025458097457886),
 (u'methodically', 0.2379375547170639),
 (u'tongued', 0.22196565568447113),
 (u'cosmetics', 0.21332012116909027),
 (u'Loos', 0.2114654779434204),
 (u'backstory', 0.2113303393125534),
 ('SENT_60862', 0.21070502698421478),
 (u'gobble', 0.20925869047641754),
 ('SENT_73365', 0.20847654342651367)]

or to get the raw embedding for that sentence as a NumPy vector:

print model["SENT_0"]

etc. More functionality coming soon!

Note from Radim: Get my latest machine learning tips & articles delivered straight to your inbox (it's free).

 Unsubscribe anytime, no spamming. Max 2 posts per month, if lucky.

If you liked this article, you may also enjoy the Optimizing word2vec series and the Word2vec tutorial.

Comments 89

    1. Post
      Author
      1. gj

        Hi, thanks for your tutorial. I want to get sentences’vector ,my train text is “F:\\jj\\g.txt”,but i can not write code,could you help me write a code to get sentence vector ?thank you very much

      2. GJ

        hi ,I hava a question need your help. if I train a model named “gj.bin”,but when I write code”model = Doc2Vec.load_word2vec_format(‘F:\\JJ\\gj.bin’, binary=True)”,it can show UnicodeDecodeError “utf-8′ codec can’t decode byte 0xfa in position 2: invalid start byte” what should i do?

  1. Wen

    Hi, thanks for your awesome tutorial. What I should do if I want to input new testing sentences to find similar sentences after training?

    Thank you

    1. Samuel Rönnqvist

      You should be able to add new sentences to an existing model using train(), and then run most_similar([‘SENT_XX’]), where SENT_XX is the label of one of your new sentences.

      1. Yikang

        Thanks for your awesome work. But I find that train() doesn’t add new sentences to an existing model. I get an ‘KeyError’ error while I was trying. Maybe train() just update the weight?

        1. Ólavur Mortensen

          I would very much like to know this too. I can’t figure out how to train on more data in word2vec either.

  2. dhruv shah

    Hi Radim,

    I just had a quick question.
    The doc2vec is an unsupervised algorithm. So why do we have to provide labels when we are training the model?

    1. Bach

      Hi, I think that the labels here are not the “y”, they are like tags attached to each sentence, so you can access the vector that represents the sentence.

        1. Post
          Author
          Radim Rehurek

          That’s up to you. Each label CAN be unique. Or you can use only a few labels across all documents (such as 10 target classes = labels for 1 million documents, reusing a single label for many documents).

          You would then learn 10 vectors via doc2vec, one vector to represent each class=label.

          Also note you can provide multiple labels per document (each document can be tagged with multiple “classes”).

          1. Nataly Maslova

            Hello Radim.
            I try to attend 2 labels to some sentences in my sample and get the error:
            File “C:\doc2v\mylabelseparate.py”, line 122, in
            model_dm.build_vocab(np.concatenate((x_train, x_test, unsupforest))) # build
            vocab over all reviews
            File “C:\Python27\lib\site-packages\gensim\models\word2vec.py”, line 396, in b
            uild_vocab
            vocab = self._vocab_from(sentences)
            File “C:\Python27\lib\site-packages\gensim\models\doc2vec.py”, line 200, in _v
            ocab_from
            sentence_length = len(sentence.words)
            TypeError: object of type ‘LabeledSentence’ has no len()

            I would be extremely grateful to you if you look at my post:
            http://stackoverflow.com/questions/39504130/error-object-of-type-labeledsentence-has-no-len-while-building-vocabulary-in

  3. Bach

    Hi,

    How can I get the vector representing a paragraph? My understanding is that I can access the vectors representing sentences via their labels; So how do I label a paragraph contains a few sentences?

    Thanks.

      1. Bach

        Hi,
        So will each sentence in a paragraph have 1 label for its own and 1 label for the paragraph? E.g., for paragraph 1:
        sentence.labels = [u’SENT_i’, u’PARA_1′]
        for sentence i.

        1. Samuel Rönnqvist

          You could also feed entire paragraphs (or arbitrarily long sequences of tokens) together with a PARA_X label, to reduce memory usage (i.e., number of unique labels), unless you’re interested in separating the semantics of paragraphs and individual sentences.

  4. Jeff

    Hi Radim,

    First, thank you for this tutorial. I did what you said in this tutorial, but it didn’t work. The issue is after my training, I got a key error with
    print model[‘SENT_0’].

    I have checked all of the keys, and I can’t find any labels, then I checked the source code, I find:
    self.train_words = train_words
    self.train_lbls = train_lbls
    if sentences is not None:
    self.build_vocab(sentences)
    self.train(sentences)
    In Doc2vec class, it just call the build_vocab() method inherited from Word2Vec class, I wonder how this can generate a key list include ‘SENT_0′,’SENT_1’,……

    Hoping for your reply.
    Thanks

    1. Samuel Rönnqvist

      If training was successful, you should be able to access the sentence vector by label like that. Here’s a minimal example:

      sentence = LabeledSentence(words=[u’some’, u’words’, u’here’], labels=[u’SENT_1′])
      model = Doc2Vec([sentence], min_count=0)
      model[‘SENT_1’]

      1. Mark

        That example code doesn’t work for me:

        >>> sentence = gensim.models.doc2vec.LabeledSentence(words=[u’some’, u’words’, u’here’], labels=[u’SENT_1′])
        Traceback (most recent call last):
        File “”, line 1, in
        TypeError: __new__() got an unexpected keyword argument ‘labels’

        Step one doesn’t work. But it appears that LabeledSentence now wants ‘tags’ instead of labels, so I’ll update it.

        >>> sentence = gensim.models.doc2vec.LabeledSentence(words=[u’some’, u’words’, u’here’], tags=[u’SENT_1′])
        >>> model = gensim.models.Doc2Vec([sentence], min_count=0)
        >>> model[‘SENT_1’]
        Traceback (most recent call last):
        File “”, line 1, in
        File “C:Usersu772700AppDataLocalContinuumAnaconda (x86)libsite-packagesgensimmodelsword2vec.py”, line 1204, in __getitem__
        return self.syn0[self.vocab[words].index]
        KeyError: ‘SENT_1’

        I can build the model, BUT the sentence tag doesn’t exist in the vocabulary.

        1. Post
          Author
        2. wolfgang

          I believe you need to use the model.docvecs[‘SENT_1’], the API has changed as Radim mentioned

    2. Aaron

      Something like this will work with the updated API:

      document_embedding_matrix = np.array([d.infer_vector(sents[i].words) for i in range(len(sents))])

    3. Aaron

      Something like this will work with the updated API:

      document_embedding_matrix = np.array([doc2vec_model.infer_vector(sents[i].words) for i in range(len(sents))])

  5. Zach

    I have a model trained with Doc2Vec. (It’s very cool!)

    Is there an easy way to extract JUST the vectors for all of the sentences?

    Basically, I want to save a matrix of all the original documents and their corresponding vectors.

    1. Claudio

      I think you only have to read the labels which start with “SENT_ ” because that are the labels which contains the vector representation. However, I’m experimenting with Doc2Vec and after the a successful training process my model have some missing sentences labels. Did you solve it?

      1. Nate

        Let me start off by saying I really like Doc2Vec and I appreciate the efforts that have gone into making it!

        After successfully training a Doc2Vec model, many of my sentence labels are missing in the trained model and many new labels have appeared representing individual words (presumably extracted from some of the sentences, themselves). Maybe I’m not correctly understanding the purpose of the labels, but I thought that perhaps they represented the “hooks” for extracting the vector representations for each sentence I trained on?

        1. Claudio

          Nate, how are you training your model?. I fixed the missing sentences labels by setting the parameter min_count=1. According with the documentation “min_count = ignore all words with total frequency lower than this.” However, the problem with this solution is we are keeping the noise in the dataset. (very specific words).

  6. Paul F

    Searching for golden needles in unlabeled email haystacks..I was astounded to see in Table two of the Le and Mikolov article that LDA demonstrates a 32.58% error rate compared to the authors’ Paragraph Vector model with only a 7.42%.error rate. I am searching an un-labled corpus of 10+ million emails to determine by topic what exists. We ran our initial test of LDA (single Core) a few days ago on the Enron email test file but have not had time to thoroughly analyze the results. This difference in the error rate means I likely must add Doc2Vec to the process. Under LDA our goal was a very homogeneous group of clusters with the goal of quickly eliminating the NOT RELEVANT clusters and the documents they include from further processing based upon a manual review of the top 5-10 top ranked topics in the top 10-20 top ranked documents in each cluster (recognizing that there could be MANY thousands of homogeneous clusters, we are thinking of ways that the machine can determine 80%+ of the Not Relevant clusters based on rules yet to be determined.) The NOT RELEVANT Clusters and their documents become a training set as do specific paragraphs and sentences of the POSSIBLY RELEVANT Clusters when we move to SuPervised learning using LSA. LDA gets me top ranked topics of top ranked documents in each cluster but apparently has limitation using distributed computing. LSA seems apparently more effective in Distributed Computing and is a comfortable known process for training with manually determined LSA training sets.
    Doc2Vec apparently brings sentence, paragraph and even possibly document labeling still with the benefit of sentiment analysis at an apparent vastly increased processing time. Since my 12 core machine won’t cut it on Doc2Vec and I need to acquire a considerable number of cores and build a LAN. what suggestion can anyone supply concerning the least cost per core(I am at about US$45 each) and how to determine the amount of Ram required per worker core for a good trade off between processing speed and core cost? Will each worker need to have identical processors and RAM? Since they are plentiful, I am contemplating Dell Precision T7500 with dual 6 core and Ram based upon your comments. Please confirm that under all three models only actual cores and not hyperthreaded fake cores are usable?

    Thanks folks, Your efforts are appreciated.
    Paul F.

  7. Alex

    I am interested in such an application:
    can I use existing word2vec C format binary file (e.g. GoogleNews-vectors-negative300.bin) to help training paragraph vector. I assume I should follow the steps
    step1
    model = Doc2Vector.load_word2vec_format(‘./GoogleNews-vectors-negative300.bin’, binary=True)
    step2
    model.train(sentences)

    But when I did in this way, errors pop up like
    File “”, line 1, in
    File “/homes/xx302/lib/python2.7/site-packages/gensim-0.10.3-py2.7-linux-x86_64.egg/gensim/models/word2vec.py”, line 466, in train
    total_words = total_words or int(sum(v.count * v.sample_probability for v in itervalues(self.vocab)) * self.iter)
    File “/homes/xx302/lib/python2.7/site-packages/gensim-0.10.3-py2.7-linux-x86_64.egg/gensim/models/word2vec.py”, line 466, in
    total_words = total_words or int(sum(v.count * v.sample_probability for v in itervalues(self.vocab)) * self.iter)
    AttributeError: ‘Vocab’ object has no attribute ‘sample_probability’

    So is it caused by using binary word2vector in which some information is lost?

  8. yuka

    Hello Radim, thank you for your tutorials, it is really interesting and enlightening. I am a rookie in python and therefore have some issues while manipulating doc2vec. Here it is, I have a corpora (folder) with five classes (folders) inside, and each document is simply a txt file. In this case, how should I start training all these documents? Thanks.

  9. Huo

    HI,
    In the class LabledLineSentence(), every sentence line got a unique label, which is ‘SENT_%s’ % item_no. But what if two lines have duplicated contents ? In your code, they will have two different labels. Is it reasonable? Thanks.

  10. JohnDannl

    Hi Radim,
    I really appreciate your good work!I have successfully completed the first step: training a model,but I find the same problem as YiKang posted that “the train() method doesn’t add new sentences to an existing model” when I want to predict some new sentences.According to Mikolov‘s paper:in “the inference stage” ,we can get paragraph vectors D for new paragraphs.Is there something I missed? Looking forward to your reply.

  11. Huanliang Wang

    Hi, Randim:
    I want to get the vector of input sentence after training was successful. But I can’t find any method to get it. Could you tell me how to operate gensim?

  12. Huanliang Wang

    Hi, Randim:
    After training was successful, I find some SENT_?? are not exist. such as:
    model([‘SENT_22’])
    An error is thrown: KeyError: ‘SENT_22’.
    But running both model([‘SENT_21’]) and model([‘SENT_23’]) are right.

    Is it an error? How do I avoid such error?

    1. Cathy1272015

      I came across the same problem as you when I run the program as mentioned above ,and now I don’t know why, if you get the answer please tell me,thanks

      1. Post
        Author
        1. xinchun

          you can use:model.docvess[”] to got it . but i also have a question, how can i predict my new data using my model which i have trained before

  13. Andrew Beam

    Thanks for implementing this, I’ve had fun with the new module. Is there any way to assess convergence? Can we access the log-likehood value somehow? I have no idea how many epochs I should be running this for or if my alpha size is reasonable.

  14. Debasis Ganguly

    I’m using doc2vec on one of the SEMEVAL ’14 datasets. I’ve got 755 number of lines (sentences) in a text file.
    However, after executing the following code:

    sentences=gensim.models.doc2vec.LabeledLineSentence(‘test.txt’)
    model = gensim.models.doc2vec.Doc2Vec(sentences, size=10, window=5, min_count=5, workers=4)
    model.save_word2vec_format(‘svectors.txt’)

    when I wanted to check if all sentences have been stored in the vector file, to my surprise I found that some sentences are missing! More precisely,
    grep -c SENT_ svectors.txt
    gave me an o/p of 735.

    Wondering what might be the cause of 20 sentences missing?

    1. Debasis Ganguly

      Setting min_count to 2 fixes this problem. It was ignoring sentences where one of the constituent words was having freq. less than the min_count.

  15. Erick

    I’m not sure I understood the point in having two labels in a sentence. Does it mean that sentence can be used to train the vectors of two paragraphs?

  16. hj

    hello.

    Thank you so much for this helpful tutorial.
    I have a question about the speed.
    I have a 366MB text file and wanted to create the doc2vec model. However, it seems it stopped since the log stopped at 30.23% for the last ten hours. So I was wondering, what is the max size doc2vec code can handle? Also, any suggestions on how I can increase speed and size limit?

    Thanks again, have nice day:)

  17. Silvia

    I am using the Doc2vec class with a corpus containing very short sentences that can have even 1 word. I observed that for many sentences, especially the short ones, Doc2vec do not provide any representations. Could you explain why? And could I solve this?

    Thanks in advance!

  18. R

    Hi Radim

    Great port. Have you an example on how to use Doc2vec for Information Retrieval or any advice on the subject.

  19. PA

    Hi Radim,

    Great software. I have a question. Suppose I have created a model and wish to find similarities with new sentences. I create new labels for each new sentence. I can do the model.train, but it does not update the vocabulary from the loaded model to include the words in the new sentences. If I do

    model.load(“orig.model”) # this has 100000 labels

    sentences = []
    currentlines = 100000
    for line in lines:
    currentlines += 1
    label = ‘SENT_’ + str(currentlines)
    sentence = models.doc2vec.LabeledSentence(unicodewords, labels=[unicode(label)])
    sentences.append(sentence)

    model.train(sentences)

    print model.most_similar(“SENT_109900”)

    Since this label is not in the original model, it complains with

    KeyError: “word ‘SENT_109900’ not in vocabulary”

    The mechanism updates the model with train but does not allow me to update the vocabulary. How is this possible?

    1. Radim

      Exactly right — the model doesn’t allow adding new vocabulary (only updating existing one).

      But I have good news for you! One of gensim users is creating a new pull request on github, which will allow adding new words too. This way, the model should become fully online.

      1. Shima

        Hi Radim,

        I would like to use doc2vec for classification. However, it seems we need to have the test documents as well as training documents at the time of training to build the vocabulary. Is it really the case? If yes, how such model can be used in practice to learn a model that can classify UNSEEN documents? I get KeyError when I want to get document representation of unseen docs.

  20. Rob

    Does pull 356 mean that you no longer need each document vector in RAM?

    Also, which pull request was going to make doc2vec online? Is it the same one?

    Thanks!! These implementations are amazing.

  21. Pingback: How to train p(category|title) model with word2vec - codeengine

  22. Sagar Arora

    Any updates on scalable version of doc2vec, the one which does not require all label vectors to be stored separately in RAM?

    Looking forward to it

  23. arandomuser

    Sorry but why you have published a “””tutorial””” (a)that is out of date, (b) hard to follow, (c) full of errors AND (d) you do not help people AT ALL. Someone post a link of github with a “new” tutorial. Looks that your company is funny and cannot be taken serious. I am looking for your response – if you have something to say.

    1. Post
      Author
      Radim

      Yeah, we should probably update this tutorial with the new one, instead of just putting a disclaimer on top.

      What errors did you find here?

  24. Pingback: Python:How to calculate the sentence similarity using word2vec model of gensim with python – IT Sprite

  25. Ola Gustafsson

    I’ve done some experimenting with doc2vec for recommendation purposes, looking for documents that share similar meaning, and the results seem very promising.

    However, when training on a corpus and extracting vectors, these are not the same vectors as I get from the infer method, used on the same texts. I was expecting them to be identical, but was I right to do so? There are some alpha parameters both in training a model and at the infer stage. Are these or something else the reason behind different vectors?

  26. parisa

    hello
    I’m new in gensim and I want to extract semantic of document’s word,but I don’t know how to do it,I don’t know how to load my dataset and other steps,your tutorial is not full for beginners I study your tutorials but I don’t understand it,please help me I need it.
    thanks

    1. Post
      Author
      1. parisa

        Thanks but I don’t have any time for learning this,can you recommend me a person who can write this project for me???/

  27. Pingback: Python:Doc2vec : How to get document vectors – IT Sprite

  28. Alex

    Big thanks, Radim! Is it possible to train paragraph representation using already trained word representations on really big corpus?
    When I just load pretrained model and then start to learn sentence vectors, I’m getting an error about non-existent keys in vocabulary.

    Currently, I’ve just made my own version to cope with that, but I suppose I’m reinventing a wheel.

    So, if I’m right, I’m ready to share a PR, and if I’m wrong — please give me some small ( but working ) example
    Thank you!

    Best regards,
    Alex.

  29. Post
    Author
    1. Alex

      Thank you for prompt reply, but I mean slightly different thing. I have embeddings of pictures ( obtained from ConvNet), and wanted to compute paragraph2vec for them just to see whether wordsim will cook something usable for me.

      That means, that I even have no “words” in usual sense ( my “words” are already embeddings ).

      But in theory we have no problem for training par2vec for pictures, and so my question is just how to do this easily.

      Thank you again in advance!

  30. Karan Singla

    I want to cluster the chat messages/sentences to group similar messages together. When a new comes, I want to assign a cluster to it. How would I get a vector for new message? As per one of the above questions, I will need to train the model every time I get a new sentence to obtain its vector. Is there a better approach? If I want to add 50-60k messages everyday, how to proceed? Will the model size grow linearly with the number of new messages?

  31. PS

    What happened to your good old doc2vec tutorial? :-O

    It explained so nicely, the gensim api: how to load documents, train, test, similarity etc. Now, you have made it to look and feel so complicated. I guess a small writeup on how to work with your own data is required.

    1. Post
      Author
      Radim Rehurek

      Hello PS — do you mean the ipython notebook linked in the preamble here? Or what part concretely feels complicated?

      I agree the doc2vec docs could (should!) be improved… always a struggle with open source projects. Everyone just scratching their own itch, and it’s rarely the documentation.

      1. PS

        Thanks for the quick reply Radim. I was referring to the original tutorial that you had (year and half back). I worked with doc2vec extensively when it came out and I remember, your description was good enough for me to start the *first* model. I thought of using doc2vec again for some project yesterday. And now it all looks so super complicated 😀

        I suppose, we could do a small write up for the most common use case: load your own data (both streaming and in-memory) , train it, get vectors, and predict new data.

        1. Post
          Author
  32. Ken Yeung

    Hi Radim

    Thanks for the great work on doc2vec in python.

    I get a question of how to get similarity by using both word and tag? For example something like this below

    model.most_similar(‘MOVIE_123’, ‘good’)

    Also, how could I get the sentence as string based on the tag? if I only know the tag MOVIE_123, is it possible to get the original sentence as string?

    Many thanks!

    1. Post
      Author
      1. Ken

        Hi Radim,

        Seems the question in the mailing list is about multiple tags on a sentence. However, my question is more about how to query similarity with both word and tag together. do you know how it might be done with gensim? Thanks.

  33. Koorm

    Hi Radim. I am having a hard time getting your tutorial to run. A basic case solution will be nice: Read 2 documents from a file where each document has 1 sentence per line, and build the doc2vec model. What you have fails here and there. For example,

    import gensim
    import numpy
    import random
    import os

    sentence = gensim.models.doc2vec.LabeledSentence(words=[u’some’, u’words’, u’here’], tags=[u’SENT_1′])
    model = gensim.models.doc2vec.Doc2Vec(alpha=0.025, min_alpha=0.025)
    model.build_vocab(sentence)

    fails:

    —————————————————————————
    AttributeError Traceback (most recent call last)
    in ()
    6 sentence = gensim.models.doc2vec.LabeledSentence(words=[u’some’, u’words’, u’here’], tags=[u’SENT_1′])
    7 model = gensim.models.doc2vec.Doc2Vec(alpha=0.025, min_alpha=0.025)
    —-> 8 model.build_vocab(sentence)

    /home/ehsan/anaconda2/lib/python2.7/site-packages/gensim-0.12.4-py2.7-linux-x86_64.egg/gensim/models/word2vec.pyc in build_vocab(self, sentences, keep_raw_vocab, trim_rule)
    506
    507 “””
    –> 508 self.scan_vocab(sentences, trim_rule=trim_rule) # initial survey
    509 self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule) # trim by min_count & precalculate downsampling
    510 self.finalize_vocab() # build tables & arrays

    /home/ehsan/anaconda2/lib/python2.7/site-packages/gensim-0.12.4-py2.7-linux-x86_64.egg/gensim/models/doc2vec.pyc in scan_vocab(self, documents, progress_per, trim_rule)
    637 interval_start = default_timer()
    638 interval_count = total_words
    –> 639 document_length = len(document.words)
    640
    641 for tag in document.tags:

    AttributeError: ‘list’ object has no attribute ‘words’

  34. Ying

    if my document label is not unique, the same document may have different labels and a few documents have a same label. The form of the answer of model.docvecs[label1] is as follow:
    [[[ 0.1111, 0.1111, 0.0222, 1.555,….]
    [0.444, 0.555, 1.788, -0.222……..]
    ……
    [0.555, -0.7777, ……..]]]
    that is all the vectors of documents haing the label [label1]? and how can I get each of it?

    1. Post
      Author
  35. Tim

    Hi can you please let me know what exactly is the output of the build_vocab method given a corpus of tagged documents?

    1. Post
      Author
      Radim Řehůřek

      Hello Tim, the “build_vocab” method has no output (“None” in Python).

      If you have any additional questions, gensim mailing list is the best place for them.

      Best,
      Radim

  36. Wayne

    Hi Radim.
    Thank you for your great work. I am try the doc2vec for documents’ classification( use the infer_vector to get the vector for new documents
    I want to ask:
    1. Whether should I split one document into multiple lines?
    2. If I split each document, should I set the same one tag for these lines from one document?
    3. If the multiple lines have the same tag, how does the algorithm work for the tag when training the model?
    Thank you very much?

    1. Post
      Author
  37. ahmed mohammed

    Hi Radim
    nice job i have learnt a lot from your tutorials and posts. i am facing a big challenge using gensim to check for semantic similarity in 20newsgroups. assuming i have a word or sentence to check for its similarity from all categories found in 20newsgroup. how am i suppose to do this with gensim. Thank u.

  38. Haroon

    hello just discovered your blog…bookmarked!

    Quick thing, I think the label attribute should is called tag in Doc2Vec now.

    thanks much

  39. Pingback: Visual Question Answer – badripatro

  40. Pingback: Gensim Vektörel Doküman Eğitimi - gurmezin sci-tech-art

  41. Pingback: Googling doc2vec - Luminis Amsterdam : Luminis Amsterdam

  42. Pingback: Implementing doc2vec - Luminis Amsterdam : Luminis Amsterdam

  43. Pingback: Twitterverse's opinion of NASA Research - Padmashri Suresh

Leave a Reply

Your email address will not be published. Required fields are marked *