Doc2Vec Model

Introduces Gensim’s Doc2Vec model and demonstrates its use on the Lee Corpus.

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Doc2Vec is a Model that represents each Document as a Vector. This tutorial introduces the model and demonstrates how to train and assess it.

Here’s a list of what we’ll be doing:

  1. Review the relevant models: bag-of-words, Word2Vec, Doc2Vec

  2. Load and preprocess the training and test corpora (see Corpus)

  3. Train a Doc2Vec Model model using the training corpus

  4. Demonstrate how the trained model can be used to infer a Vector

  5. Assess the model

  6. Test the model on the test corpus

Review: Bag-of-words

Note

Feel free to skip these review sections if you’re already familiar with the models.

You may be familiar with the bag-of-words model from the Vector section. This model transforms each document to a fixed-length vector of integers. For example, given the sentences:

  • John likes to watch movies. Mary likes movies too.

  • John also likes to watch football games. Mary hates football.

The model outputs the vectors:

  • [1, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0]

  • [1, 1, 1, 1, 0, 1, 0, 1, 2, 1, 1]

Each vector has 10 elements, where each element counts the number of times a particular word occurred in the document. The order of elements is arbitrary. In the example above, the order of the elements corresponds to the words: ["John", "likes", "to", "watch", "movies", "Mary", "too", "also", "football", "games", "hates"].

Bag-of-words models are surprisingly effective, but have several weaknesses.

First, they lose all information about word order: “John likes Mary” and “Mary likes John” correspond to identical vectors. There is a solution: bag of n-grams models consider word phrases of length n to represent documents as fixed-length vectors to capture local word order but suffer from data sparsity and high dimensionality.

Second, the model does not attempt to learn the meaning of the underlying words, and as a consequence, the distance between vectors doesn’t always reflect the difference in meaning. The Word2Vec model addresses this second problem.

Review: Word2Vec Model

Word2Vec is a more recent model that embeds words in a lower-dimensional vector space using a shallow neural network. The result is a set of word-vectors where vectors close together in vector space have similar meanings based on context, and word-vectors distant to each other have differing meanings. For example, strong and powerful would be close together and strong and Paris would be relatively far.

Gensim’s Word2Vec class implements this model.

With the Word2Vec model, we can calculate the vectors for each word in a document. But what if we want to calculate a vector for the entire document? We could average the vectors for each word in the document - while this is quick and crude, it can often be useful. However, there is a better way…

Introducing: Paragraph Vector

Important

In Gensim, we refer to the Paragraph Vector model as Doc2Vec.

Le and Mikolov in 2014 introduced the Doc2Vec algorithm, which usually outperforms such simple-averaging of Word2Vec vectors.

The basic idea is: act as if a document has another floating word-like vector, which contributes to all training predictions, and is updated like other word-vectors, but we will call it a doc-vector. Gensim’s Doc2Vec class implements this algorithm.

There are two implementations:

  1. Paragraph Vector - Distributed Memory (PV-DM)

  2. Paragraph Vector - Distributed Bag of Words (PV-DBOW)

Important

Don’t let the implementation details below scare you. They’re advanced material: if it’s too much, then move on to the next section.

PV-DM is analogous to Word2Vec CBOW. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a center word based an average of both context word-vectors and the full document’s doc-vector.

PV-DBOW is analogous to Word2Vec SG. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a target word just from the full document’s doc-vector. (It is also common to combine this with skip-gram testing, using both the doc-vector and nearby word-vectors to predict a single target word, but only one at a time.)

Prepare the Training and Test Data

For this tutorial, we’ll be training our model using the Lee Background Corpus included in gensim. This corpus contains 314 documents selected from the Australian Broadcasting Corporation’s news mail service, which provides text e-mails of headline stories and covers a number of broad topics.

And we’ll test our model by eye using the much shorter Lee Corpus which contains 50 documents.

import os
import gensim
# Set file names for train and test data
test_data_dir = os.path.join(gensim.__path__[0], 'test', 'test_data')
lee_train_file = os.path.join(test_data_dir, 'lee_background.cor')
lee_test_file = os.path.join(test_data_dir, 'lee.cor')

Define a Function to Read and Preprocess Text

Below, we define a function to:

  • open the train/test file (with latin encoding)

  • read the file line-by-line

  • pre-process each line (tokenize text into individual words, remove punctuation, set to lowercase, etc)

The file we’re reading is a corpus. Each line of the file is a document.

Important

To train the model, we’ll need to associate a tag/number with each document of the training corpus. In our case, the tag is simply the zero-based line number.

import smart_open

def read_corpus(fname, tokens_only=False):
    with smart_open.open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            tokens = gensim.utils.simple_preprocess(line)
            if tokens_only:
                yield tokens
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

train_corpus = list(read_corpus(lee_train_file))
test_corpus = list(read_corpus(lee_test_file, tokens_only=True))

Let’s take a look at the training corpus

print(train_corpus[:2])
[TaggedDocument(words=['hundreds', 'of', 'people', 'have', 'been', 'forced', 'to', 'vacate', 'their', 'homes', 'in', 'the', 'southern', 'highlands', 'of', 'new', 'south', 'wales', 'as', 'strong', 'winds', 'today', 'pushed', 'huge', 'bushfire', 'towards', 'the', 'town', 'of', 'hill', 'top', 'new', 'blaze', 'near', 'goulburn', 'south', 'west', 'of', 'sydney', 'has', 'forced', 'the', 'closure', 'of', 'the', 'hume', 'highway', 'at', 'about', 'pm', 'aedt', 'marked', 'deterioration', 'in', 'the', 'weather', 'as', 'storm', 'cell', 'moved', 'east', 'across', 'the', 'blue', 'mountains', 'forced', 'authorities', 'to', 'make', 'decision', 'to', 'evacuate', 'people', 'from', 'homes', 'in', 'outlying', 'streets', 'at', 'hill', 'top', 'in', 'the', 'new', 'south', 'wales', 'southern', 'highlands', 'an', 'estimated', 'residents', 'have', 'left', 'their', 'homes', 'for', 'nearby', 'mittagong', 'the', 'new', 'south', 'wales', 'rural', 'fire', 'service', 'says', 'the', 'weather', 'conditions', 'which', 'caused', 'the', 'fire', 'to', 'burn', 'in', 'finger', 'formation', 'have', 'now', 'eased', 'and', 'about', 'fire', 'units', 'in', 'and', 'around', 'hill', 'top', 'are', 'optimistic', 'of', 'defending', 'all', 'properties', 'as', 'more', 'than', 'blazes', 'burn', 'on', 'new', 'year', 'eve', 'in', 'new', 'south', 'wales', 'fire', 'crews', 'have', 'been', 'called', 'to', 'new', 'fire', 'at', 'gunning', 'south', 'of', 'goulburn', 'while', 'few', 'details', 'are', 'available', 'at', 'this', 'stage', 'fire', 'authorities', 'says', 'it', 'has', 'closed', 'the', 'hume', 'highway', 'in', 'both', 'directions', 'meanwhile', 'new', 'fire', 'in', 'sydney', 'west', 'is', 'no', 'longer', 'threatening', 'properties', 'in', 'the', 'cranebrook', 'area', 'rain', 'has', 'fallen', 'in', 'some', 'parts', 'of', 'the', 'illawarra', 'sydney', 'the', 'hunter', 'valley', 'and', 'the', 'north', 'coast', 'but', 'the', 'bureau', 'of', 'meteorology', 'claire', 'richards', 'says', 'the', 'rain', 'has', 'done', 'little', 'to', 'ease', 'any', 'of', 'the', 'hundred', 'fires', 'still', 'burning', 'across', 'the', 'state', 'the', 'falls', 'have', 'been', 'quite', 'isolated', 'in', 'those', 'areas', 'and', 'generally', 'the', 'falls', 'have', 'been', 'less', 'than', 'about', 'five', 'millimetres', 'she', 'said', 'in', 'some', 'places', 'really', 'not', 'significant', 'at', 'all', 'less', 'than', 'millimetre', 'so', 'there', 'hasn', 'been', 'much', 'relief', 'as', 'far', 'as', 'rain', 'is', 'concerned', 'in', 'fact', 'they', 've', 'probably', 'hampered', 'the', 'efforts', 'of', 'the', 'firefighters', 'more', 'because', 'of', 'the', 'wind', 'gusts', 'that', 'are', 'associated', 'with', 'those', 'thunderstorms'], tags=[0]), TaggedDocument(words=['indian', 'security', 'forces', 'have', 'shot', 'dead', 'eight', 'suspected', 'militants', 'in', 'night', 'long', 'encounter', 'in', 'southern', 'kashmir', 'the', 'shootout', 'took', 'place', 'at', 'dora', 'village', 'some', 'kilometers', 'south', 'of', 'the', 'kashmiri', 'summer', 'capital', 'srinagar', 'the', 'deaths', 'came', 'as', 'pakistani', 'police', 'arrested', 'more', 'than', 'two', 'dozen', 'militants', 'from', 'extremist', 'groups', 'accused', 'of', 'staging', 'an', 'attack', 'on', 'india', 'parliament', 'india', 'has', 'accused', 'pakistan', 'based', 'lashkar', 'taiba', 'and', 'jaish', 'mohammad', 'of', 'carrying', 'out', 'the', 'attack', 'on', 'december', 'at', 'the', 'behest', 'of', 'pakistani', 'military', 'intelligence', 'military', 'tensions', 'have', 'soared', 'since', 'the', 'raid', 'with', 'both', 'sides', 'massing', 'troops', 'along', 'their', 'border', 'and', 'trading', 'tit', 'for', 'tat', 'diplomatic', 'sanctions', 'yesterday', 'pakistan', 'announced', 'it', 'had', 'arrested', 'lashkar', 'taiba', 'chief', 'hafiz', 'mohammed', 'saeed', 'police', 'in', 'karachi', 'say', 'it', 'is', 'likely', 'more', 'raids', 'will', 'be', 'launched', 'against', 'the', 'two', 'groups', 'as', 'well', 'as', 'other', 'militant', 'organisations', 'accused', 'of', 'targetting', 'india', 'military', 'tensions', 'between', 'india', 'and', 'pakistan', 'have', 'escalated', 'to', 'level', 'not', 'seen', 'since', 'their', 'war'], tags=[1])]

And the testing corpus looks like this:

print(test_corpus[:2])
[['the', 'national', 'executive', 'of', 'the', 'strife', 'torn', 'democrats', 'last', 'night', 'appointed', 'little', 'known', 'west', 'australian', 'senator', 'brian', 'greig', 'as', 'interim', 'leader', 'shock', 'move', 'likely', 'to', 'provoke', 'further', 'conflict', 'between', 'the', 'party', 'senators', 'and', 'its', 'organisation', 'in', 'move', 'to', 'reassert', 'control', 'over', 'the', 'party', 'seven', 'senators', 'the', 'national', 'executive', 'last', 'night', 'rejected', 'aden', 'ridgeway', 'bid', 'to', 'become', 'interim', 'leader', 'in', 'favour', 'of', 'senator', 'greig', 'supporter', 'of', 'deposed', 'leader', 'natasha', 'stott', 'despoja', 'and', 'an', 'outspoken', 'gay', 'rights', 'activist'], ['cash', 'strapped', 'financial', 'services', 'group', 'amp', 'has', 'shelved', 'million', 'plan', 'to', 'buy', 'shares', 'back', 'from', 'investors', 'and', 'will', 'raise', 'million', 'in', 'fresh', 'capital', 'after', 'profits', 'crashed', 'in', 'the', 'six', 'months', 'to', 'june', 'chief', 'executive', 'paul', 'batchelor', 'said', 'the', 'result', 'was', 'solid', 'in', 'what', 'he', 'described', 'as', 'the', 'worst', 'conditions', 'for', 'stock', 'markets', 'in', 'years', 'amp', 'half', 'year', 'profit', 'sank', 'per', 'cent', 'to', 'million', 'or', 'share', 'as', 'australia', 'largest', 'investor', 'and', 'fund', 'manager', 'failed', 'to', 'hit', 'projected', 'per', 'cent', 'earnings', 'growth', 'targets', 'and', 'was', 'battered', 'by', 'falling', 'returns', 'on', 'share', 'markets']]

Notice that the testing corpus is just a list of lists and does not contain any tags.

Training the Model

Now, we’ll instantiate a Doc2Vec model with a vector size with 50 dimensions and iterating over the training corpus 40 times. We set the minimum word count to 2 in order to discard words with very few occurrences. (Without a variety of representative examples, retaining such infrequent words can often make a model worse!) Typical iteration counts in the published Paragraph Vector paper results, using 10s-of-thousands to millions of docs, are 10-20. More iterations take more time and eventually reach a point of diminishing returns.

However, this is a very very small dataset (300 documents) with shortish documents (a few hundred words). Adding training passes can sometimes help with such small datasets.

model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)
2022-12-07 10:59:00,578 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec<dm/m,d50,n5,w5,mc2,s0.001,t3>', 'datetime': '2022-12-07T10:59:00.540082', 'gensim': '4.2.1.dev0', 'python': '3.8.10 (default, Jun 22 2022, 20:18:18) \n[GCC 9.4.0]', 'platform': 'Linux-5.4.0-135-generic-x86_64-with-glibc2.29', 'event': 'created'}

Build a vocabulary

model.build_vocab(train_corpus)
2022-12-07 10:59:00,806 : INFO : collecting all words and their counts
2022-12-07 10:59:00,808 : INFO : PROGRESS: at example #0, processed 0 words (0 words/s), 0 word types, 0 tags
2022-12-07 10:59:00,850 : INFO : collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words
2022-12-07 10:59:00,850 : INFO : Creating a fresh vocabulary
2022-12-07 10:59:00,887 : INFO : Doc2Vec lifecycle event {'msg': 'effective_min_count=2 retains 3955 unique words (56.65% of original 6981, drops 3026)', 'datetime': '2022-12-07T10:59:00.886953', 'gensim': '4.2.1.dev0', 'python': '3.8.10 (default, Jun 22 2022, 20:18:18) \n[GCC 9.4.0]', 'platform': 'Linux-5.4.0-135-generic-x86_64-with-glibc2.29', 'event': 'prepare_vocab'}
2022-12-07 10:59:00,887 : INFO : Doc2Vec lifecycle event {'msg': 'effective_min_count=2 leaves 55126 word corpus (94.80% of original 58152, drops 3026)', 'datetime': '2022-12-07T10:59:00.887466', 'gensim': '4.2.1.dev0', 'python': '3.8.10 (default, Jun 22 2022, 20:18:18) \n[GCC 9.4.0]', 'platform': 'Linux-5.4.0-135-generic-x86_64-with-glibc2.29', 'event': 'prepare_vocab'}
2022-12-07 10:59:00,917 : INFO : deleting the raw counts dictionary of 6981 items
2022-12-07 10:59:00,918 : INFO : sample=0.001 downsamples 46 most-common words
2022-12-07 10:59:00,918 : INFO : Doc2Vec lifecycle event {'msg': 'downsampling leaves estimated 42390.98914085061 word corpus (76.9%% of prior 55126)', 'datetime': '2022-12-07T10:59:00.918276', 'gensim': '4.2.1.dev0', 'python': '3.8.10 (default, Jun 22 2022, 20:18:18) \n[GCC 9.4.0]', 'platform': 'Linux-5.4.0-135-generic-x86_64-with-glibc2.29', 'event': 'prepare_vocab'}
2022-12-07 10:59:00,965 : INFO : estimated required memory for 3955 words and 50 dimensions: 3679500 bytes
2022-12-07 10:59:00,965 : INFO : resetting layer weights

Essentially, the vocabulary is a list (accessible via model.wv.index_to_key) of all of the unique words extracted from the training corpus. Additional attributes for each word are available using the model.wv.get_vecattr() method, For example, to see how many times penalty appeared in the training corpus:

print(f"Word 'penalty' appeared {model.wv.get_vecattr('penalty', 'count')} times in the training corpus.")
Word 'penalty' appeared 4 times in the training corpus.

Next, train the model on the corpus. In the usual case, where Gensim installation found a BLAS library for optimized bulk vector operations, this training on this tiny 300 document, ~60k word corpus should take just a few seconds. (More realistic datasets of tens-of-millions of words or more take proportionately longer.) If for some reason a BLAS library isn’t available, training uses a fallback approach that takes 60x-120x longer, so even this tiny training will take minutes rather than seconds. (And, in that case, you should also notice a warning in the logging letting you know there’s something worth fixing.) So, be sure your installation uses the BLAS-optimized Gensim if you value your time.

model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)
2022-12-07 10:59:01,272 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 3 workers on 3955 vocabulary and 50 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2022-12-07T10:59:01.271863', 'gensim': '4.2.1.dev0', 'python': '3.8.10 (default, Jun 22 2022, 20:18:18) \n[GCC 9.4.0]', 'platform': 'Linux-5.4.0-135-generic-x86_64-with-glibc2.29', 'event': 'train'}
2022-12-07 10:59:01,408 : INFO : EPOCH 0: training on 58152 raw words (42665 effective words) took 0.1s, 335294 effective words/s
2022-12-07 10:59:01,462 : INFO : EPOCH 1: training on 58152 raw words (42755 effective words) took 0.1s, 816420 effective words/s
2022-12-07 10:59:01,521 : INFO : EPOCH 2: training on 58152 raw words (42692 effective words) took 0.1s, 745004 effective words/s
2022-12-07 10:59:01,573 : INFO : EPOCH 3: training on 58152 raw words (42670 effective words) took 0.1s, 841368 effective words/s
2022-12-07 10:59:01,627 : INFO : EPOCH 4: training on 58152 raw words (42685 effective words) took 0.1s, 815442 effective words/s
2022-12-07 10:59:01,703 : INFO : EPOCH 5: training on 58152 raw words (42709 effective words) took 0.1s, 578402 effective words/s
2022-12-07 10:59:01,753 : INFO : EPOCH 6: training on 58152 raw words (42594 effective words) took 0.0s, 864899 effective words/s
2022-12-07 10:59:01,804 : INFO : EPOCH 7: training on 58152 raw words (42721 effective words) took 0.0s, 864073 effective words/s
2022-12-07 10:59:01,881 : INFO : EPOCH 8: training on 58152 raw words (42622 effective words) took 0.1s, 566867 effective words/s
2022-12-07 10:59:01,932 : INFO : EPOCH 9: training on 58152 raw words (42770 effective words) took 0.0s, 862066 effective words/s
2022-12-07 10:59:02,006 : INFO : EPOCH 10: training on 58152 raw words (42739 effective words) took 0.1s, 587035 effective words/s
2022-12-07 10:59:02,058 : INFO : EPOCH 11: training on 58152 raw words (42612 effective words) took 0.1s, 850879 effective words/s
2022-12-07 10:59:02,135 : INFO : EPOCH 12: training on 58152 raw words (42655 effective words) took 0.1s, 566216 effective words/s
2022-12-07 10:59:02,187 : INFO : EPOCH 13: training on 58152 raw words (42749 effective words) took 0.1s, 844125 effective words/s
2022-12-07 10:59:02,265 : INFO : EPOCH 14: training on 58152 raw words (42748 effective words) took 0.1s, 556136 effective words/s
2022-12-07 10:59:02,347 : INFO : EPOCH 15: training on 58152 raw words (42748 effective words) took 0.1s, 530528 effective words/s
2022-12-07 10:59:02,398 : INFO : EPOCH 16: training on 58152 raw words (42737 effective words) took 0.0s, 871200 effective words/s
2022-12-07 10:59:02,485 : INFO : EPOCH 17: training on 58152 raw words (42697 effective words) took 0.1s, 499981 effective words/s
2022-12-07 10:59:02,584 : INFO : EPOCH 18: training on 58152 raw words (42747 effective words) took 0.1s, 440730 effective words/s
2022-12-07 10:59:02,672 : INFO : EPOCH 19: training on 58152 raw words (42739 effective words) took 0.1s, 497651 effective words/s
2022-12-07 10:59:02,761 : INFO : EPOCH 20: training on 58152 raw words (42782 effective words) took 0.1s, 499103 effective words/s
2022-12-07 10:59:02,851 : INFO : EPOCH 21: training on 58152 raw words (42580 effective words) took 0.1s, 489515 effective words/s
2022-12-07 10:59:02,939 : INFO : EPOCH 22: training on 58152 raw words (42687 effective words) took 0.1s, 496560 effective words/s
2022-12-07 10:59:03,023 : INFO : EPOCH 23: training on 58152 raw words (42667 effective words) took 0.1s, 517527 effective words/s
2022-12-07 10:59:03,156 : INFO : EPOCH 24: training on 58152 raw words (42678 effective words) took 0.1s, 328575 effective words/s
2022-12-07 10:59:03,322 : INFO : EPOCH 25: training on 58152 raw words (42743 effective words) took 0.2s, 261440 effective words/s
2022-12-07 10:59:03,486 : INFO : EPOCH 26: training on 58152 raw words (42692 effective words) took 0.2s, 266564 effective words/s
2022-12-07 10:59:03,627 : INFO : EPOCH 27: training on 58152 raw words (42774 effective words) took 0.1s, 310530 effective words/s
2022-12-07 10:59:03,770 : INFO : EPOCH 28: training on 58152 raw words (42706 effective words) took 0.1s, 305665 effective words/s
2022-12-07 10:59:03,901 : INFO : EPOCH 29: training on 58152 raw words (42658 effective words) took 0.1s, 334228 effective words/s
2022-12-07 10:59:04,028 : INFO : EPOCH 30: training on 58152 raw words (42746 effective words) took 0.1s, 344379 effective words/s
2022-12-07 10:59:04,159 : INFO : EPOCH 31: training on 58152 raw words (42676 effective words) took 0.1s, 334291 effective words/s
2022-12-07 10:59:04,295 : INFO : EPOCH 32: training on 58152 raw words (42763 effective words) took 0.1s, 322886 effective words/s
2022-12-07 10:59:04,488 : INFO : EPOCH 33: training on 58152 raw words (42647 effective words) took 0.2s, 224522 effective words/s
2022-12-07 10:59:04,629 : INFO : EPOCH 34: training on 58152 raw words (42720 effective words) took 0.1s, 310616 effective words/s
2022-12-07 10:59:04,764 : INFO : EPOCH 35: training on 58152 raw words (42775 effective words) took 0.1s, 323299 effective words/s
2022-12-07 10:59:04,899 : INFO : EPOCH 36: training on 58152 raw words (42662 effective words) took 0.1s, 322458 effective words/s
2022-12-07 10:59:05,032 : INFO : EPOCH 37: training on 58152 raw words (42656 effective words) took 0.1s, 329126 effective words/s
2022-12-07 10:59:05,162 : INFO : EPOCH 38: training on 58152 raw words (42720 effective words) took 0.1s, 337238 effective words/s
2022-12-07 10:59:05,308 : INFO : EPOCH 39: training on 58152 raw words (42688 effective words) took 0.1s, 299620 effective words/s
2022-12-07 10:59:05,308 : INFO : Doc2Vec lifecycle event {'msg': 'training on 2326080 raw words (1708074 effective words) took 4.0s, 423332 effective words/s', 'datetime': '2022-12-07T10:59:05.308684', 'gensim': '4.2.1.dev0', 'python': '3.8.10 (default, Jun 22 2022, 20:18:18) \n[GCC 9.4.0]', 'platform': 'Linux-5.4.0-135-generic-x86_64-with-glibc2.29', 'event': 'train'}

Now, we can use the trained model to infer a vector for any piece of text by passing a list of words to the model.infer_vector function. This vector can then be compared with other vectors via cosine similarity.

vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])
print(vector)
[-0.10196274 -0.36020595 -0.10973375  0.28432116 -0.00792601  0.01950991
  0.01309869  0.1045896  -0.2011485  -0.12135196  0.15298457  0.05421316
 -0.06486023 -0.00131951 -0.2237759  -0.08489189  0.05889525  0.27961093
  0.08121023 -0.06200862 -0.00651888 -0.06831821  0.13001564  0.04539844
 -0.01659351 -0.02359444 -0.22276032  0.06692155 -0.11293832 -0.08056813
  0.38737044  0.05470002  0.19902836  0.19122775  0.17020799  0.10668964
  0.01216549 -0.3049222  -0.05198798  0.00130251  0.04994885 -0.0069596
 -0.06367141 -0.11740001  0.14623125  0.10109582 -0.06466878 -0.06512908
  0.17817481 -0.00934212]

Note that infer_vector() does not take a string, but rather a list of string tokens, which should have already been tokenized the same way as the words property of original training document objects.

Also note that because the underlying training/inference algorithms are an iterative approximation problem that makes use of internal randomization, repeated inferences of the same text will return slightly different vectors.

Assessing the Model

To assess our new model, we’ll first infer new vectors for each document of the training corpus, compare the inferred vectors with the training corpus, and then returning the rank of the document based on self-similarity. Basically, we’re pretending as if the training corpus is some new unseen data and then seeing how they compare with the trained model. The expectation is that we’ve likely overfit our model (i.e., all of the ranks will be less than 2) and so we should be able to find similar documents very easily. Additionally, we’ll keep track of the second ranks for a comparison of less similar documents.

ranks = []
second_ranks = []
for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)

    second_ranks.append(sims[1])

Let’s count how each document ranks with respect to the training corpus

NB. Results vary between runs due to random seeding and very small corpus

import collections

counter = collections.Counter(ranks)
print(counter)
Counter({0: 292, 1: 8})

Basically, greater than 95% of the inferred documents are found to be most similar to itself and about 5% of the time it is mistakenly most similar to another document. Checking the inferred-vector against a training-vector is a sort of ‘sanity check’ as to whether the model is behaving in a usefully consistent manner, though not a real ‘accuracy’ value.

This is great and not entirely surprising. We can take a look at an example:

print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))
Document (299): «australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not very pretty tennis there isn too many consistent bounces you are playing like said bit of classic old grass court rafter said rafter levelled the score after lleyton hewitt shock five set loss to nicholas escude in the first singles rubber but rafter says he felt no added pressure after hewitt defeat knew had good team to back me up even if we were down he said knew could win on the last day know the boys can win doubles so even if we were down still feel we are good enough team to win and vice versa they are good enough team to beat us as well»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec<dm/m,d50,n5,w5,mc2,s0.001,t3>:

MOST (299, 0.9564058780670166): «australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not very pretty tennis there isn too many consistent bounces you are playing like said bit of classic old grass court rafter said rafter levelled the score after lleyton hewitt shock five set loss to nicholas escude in the first singles rubber but rafter says he felt no added pressure after hewitt defeat knew had good team to back me up even if we were down he said knew could win on the last day know the boys can win doubles so even if we were down still feel we are good enough team to win and vice versa they are good enough team to beat us as well»

SECOND-MOST (104, 0.7868924140930176): «australian cricket captain steve waugh has supported fast bowler brett lee after criticism of his intimidatory bowling to the south african tailenders in the first test in adelaide earlier this month lee was fined for giving new zealand tailender shane bond an unsportsmanlike send off during the third test in perth waugh says tailenders should not be protected from short pitched bowling these days you re earning big money you ve got responsibility to learn how to bat he said mean there no times like years ago when it was not professional and sort of bowlers code these days you re professional our batsmen work very hard at their batting and expect other tailenders to do likewise meanwhile waugh says his side will need to guard against complacency after convincingly winning the first test by runs waugh says despite the dominance of his side in the first test south africa can never be taken lightly it only one test match out of three or six whichever way you want to look at it so there lot of work to go he said but it nice to win the first battle definitely it gives us lot of confidence going into melbourne you know the big crowd there we love playing in front of the boxing day crowd so that will be to our advantage as well south africa begins four day match against new south wales in sydney on thursday in the lead up to the boxing day test veteran fast bowler allan donald will play in the warm up match and is likely to take his place in the team for the second test south african captain shaun pollock expects much better performance from his side in the melbourne test we still believe that we didn play to our full potential so if we can improve on our aspects the output we put out on the field will be lot better and we still believe we have side that is good enough to beat australia on our day he said»

MEDIAN (119, 0.24808582663536072): «australia is continuing to negotiate with the united states government in an effort to interview the australian david hicks who was captured fighting alongside taliban forces in afghanistan mr hicks is being held by the united states on board ship in the afghanistan region where the australian federal police and australian security intelligence organisation asio officials are trying to gain access foreign affairs minister alexander downer has also confirmed that the australian government is investigating reports that another australian has been fighting for taliban forces in afghanistan we often get reports of people going to different parts of the world and asking us to investigate them he said we always investigate sometimes it is impossible to find out we just don know in this case but it is not to say that we think there are lot of australians in afghanistan the only case we know is hicks mr downer says it is unclear when mr hicks will be back on australian soil but he is hopeful the americans will facilitate australian authorities interviewing him»

LEAST (216, -0.11085141450166702): «senior taliban official confirmed the islamic militia would begin handing over its last bastion of kandahar to pashtun tribal leaders on friday this agreement was that taliban should surrender kandahar peacefully to the elders of these areas and we should guarantee the lives and the safety of taliban authorities and all the taliban from tomorrow should start this program former taliban ambassador to pakistan abdul salam zaeef told cnn in telephone interview he insisted that the taliban would not surrender to hamid karzai the new afghan interim leader and pashtun elder who has been cooperating with the united states to calm unrest among the southern tribes the taliban will surrender to elders not to karzai karzai and other persons which they want to enter kandahar by the support of america they don allow to enter kandahar city he said the taliban will surrender the weapons the ammunition to elders»

Notice above that the most similar document (usually the same text) is has a similarity score approaching 1.0. However, the similarity score for the second-ranked documents should be significantly lower (assuming the documents are in fact different) and the reasoning becomes obvious when we examine the text itself.

We can run the next cell repeatedly to see a sampling other target-document comparisons.

# Pick a random document from the corpus and infer a vector from the model
import random
doc_id = random.randint(0, len(train_corpus) - 1)

# Compare and print the second-most-similar document
print('Train Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
sim_id = second_ranks[doc_id]
print('Similar Document {}: «{}»\n'.format(sim_id, ' '.join(train_corpus[sim_id[0]].words)))
Train Document (198): «authorities are trying to track down the crew of vessel that landed undetected at cocos islands carrying asylum seekers the group of sri lankan men was found aboard their boat moored to the south of the islands yesterday afternoon shire president ron grant says investigations are underway as to the whereabouts of the crew after the asylum seekers told authorities they had left in another boat after dropping them off unfortunately for them there two aircraft the royal australian air force here at the moment and one getting prepared to fly off and obviously they will be looking to see if there is another boat he said mr grant says the sri lankans have not yet been brought ashore»

Similar Document (89, 0.7137947082519531): «after the torching of more than buildings over the past three days the situation at the woomera detention centre overnight appeared relatively calm there was however tension inside the south australian facility with up to detainees breaking into prohibited zone the group became problem for staff after breaching fence within the centre at one point staff considered using water cannon to control the detainees it is not known if they actually resorted to any tough action but group of men wearing riot gear possibly star force police officers brought in on standby could be seen in one of the compounds late yesterday government authorities confirmed that two detainees had committed acts of self harm one of them needed stitches and is believed to have been taken away in an ambulance no other details have been released»

Testing the Model

Using the same approach above, we’ll infer the vector for a randomly chosen test document, and compare the document to our model by eye.

# Pick a random document from the test corpus and infer a vector from the model
doc_id = random.randint(0, len(test_corpus) - 1)
inferred_vector = model.infer_vector(test_corpus[doc_id])
sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))

# Compare and print the most/median/least similar documents from the train corpus
print('Test Document ({}): «{}»\n'.format(doc_id, ' '.join(test_corpus[doc_id])))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))
Test Document (17): «the united nations world food program estimates that up to million people in seven countries malawi mozambique zambia angola swaziland lesotho and zimbabwe face death by starvation unless there is massive international response in malawi as many as people may have already died the signs of malnutrition swollen stomachs stick thin arms light coloured hair are everywhere»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec<dm/m,d50,n5,w5,mc2,s0.001,t3>:

MOST (86, 0.8239533305168152): «argentina economy minister domingo cavallo is reported to have resigned in the face of mounting unrest over the country crumbling economy the reports in number of local media outlets could not be officially confirmed the news comes as police used teargas to disperse tens of thousands of people who had massed near the presidential palace in buenos aires and in other parts of the city to protest against the declaration of state of emergency it was declared after mounting popular discontent and widespread looting in the past few days with people over the state of the economy which has been in recession for four years»

MEDIAN (221, 0.40627941489219666): «reserve bank governor ian macfarlane says he is confident australia will ride through the current world economic slump largely brought on by the united states mr macfarlane told gathering in sydney last night australia growth is remarkably good by world standards and inflation should come down in the next months he predicts the united states economy will show signs of recovery from mid year and that as result it is highly unlikely that the reserve bank will raise interest rates in the next six months calendar year has been difficult one for the world economy and the first half of looks like remaining weak before recovery gets underway therefore this period will be classified as world recession like those of the mid the early and the early mr macfarlane said the australian economy has got through the first half of it in reasonably good shape»

LEAST (37, -0.06813289225101471): «australia quicks and opening batsmen have put the side in dominant position going into day three of the boxing day test match against south africa at the mcg australia is no wicket for only runs shy of south africa after andy bichel earlier starred as the tourists fell for when play was abandoned due to rain few overs short of scheduled stumps yesterday justin langer was not out and matthew hayden the openers went on the attack from the start with langer innings including six fours and hayden eight earlier shaun pollock and nantie haywood launched vital rearguard action to help south africa to respectable first innings total the pair put on runs for the final wicket to help the tourists to the south africans had slumped to for through combination of australia good bowling good fielding and good luck after resuming at for yesterday morning the tourists looked to be cruising as jacques kallis and neil mckenzie added without loss but then bichel suddenly had them reeling after snatching two wickets in two balls first he had jacques kallis caught behind for although kallis could consider himself very unlucky as replays showed his bat was long way from the ball on the next ball bichel snatched sharp return catch to dismiss lance klusener first ball and have shot at hat trick bichel missed out on the hat trick and mark boucher and neil mckenzie again steadied the south african innings adding before the introduction of part timer mark waugh to the attack paid off for australia waugh removed boucher for caught by bichel brett lee then chipped in trapping mckenzie leg before for with perfect inswinger bichel continued his good day in the field running out claude henderson for with direct hit from the in field lee roared in to allan donald bouncing him and then catching the edge with rising delivery which ricky ponting happily swallowed at third slip to remove the returning paceman for duck bichel did not get his hat trick but ended with the best figures of the australian bowlers after also picking up the final wicket of nantie haywood for lee took for and glenn mcgrath for»

Conclusion

Let’s review what we’ve seen in this tutorial:

  1. Review the relevant models: bag-of-words, Word2Vec, Doc2Vec

  2. Load and preprocess the training and test corpora (see Corpus)

  3. Train a Doc2Vec Model model using the training corpus

  4. Demonstrate how the trained model can be used to infer a Vector

  5. Assess the model

  6. Test the model on the test corpus

That’s it! Doc2Vec is a great way to explore relationships between documents.

Additional Resources

If you’d like to know more about the subject matter of this tutorial, check out the links below.

Total running time of the script: ( 0 minutes 16.509 seconds)

Estimated memory usage: 48 MB

Gallery generated by Sphinx-Gallery