Topic Modeling for Fun and Profit

In this notebook:

  • index the documents in their semantic representation (topics)
  • retrieve "most similar documents" efficiently
  • write a small server for serving similarities over HTTP
In [2]:
# import & logging preliminaries
import logging
import itertools

import gensim
from gensim.parsing.preprocessing import STOPWORDS

logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO

def tokenize(text):
    return [token for token in gensim.utils.simple_preprocess(text) if token not in STOPWORDS]
In [3]:
# load the corpora created in the previous notebook
tfidf_corpus = gensim.corpora.MmCorpus('./data/wiki_tfidf.mm')
lsi_corpus = gensim.corpora.MmCorpus('./data/wiki_lsa.mm')
print(tfidf_corpus)
print(lsi_corpus)
INFO:gensim.corpora.indexedcorpus:loaded corpus index from ./data/wiki_tfidf.mm.index
INFO:gensim.matutils:initializing corpus reader from ./data/wiki_tfidf.mm
INFO:gensim.matutils:accepted corpus with 48356 documents, 26645 features, 4743136 non-zero entries
INFO:gensim.corpora.indexedcorpus:loaded corpus index from ./data/wiki_lsa.mm.index
INFO:gensim.matutils:initializing corpus reader from ./data/wiki_lsa.mm
INFO:gensim.matutils:accepted corpus with 48356 documents, 200 features, 9671200 non-zero entries

MmCorpus(48356 documents, 26645 features, 4743136 non-zero entries)
MmCorpus(48356 documents, 200 features, 9671200 non-zero entries)

In [4]:
# load the models too
id2word_wiki = gensim.corpora.Dictionary.load('./data/wiki.dictionary')
lda_model = gensim.models.LdaModel.load('./data/lda_wiki.model')
tfidf_model = gensim.models.TfidfModel.load('./data/tfidf_wiki.model')
lsi_model = gensim.models.LsiModel.load('./data/lsi_wiki.model')
INFO:gensim.utils:loading Dictionary object from ./data/wiki.dictionary
INFO:gensim.utils:loading LdaModel object from ./data/lda_wiki.model
INFO:gensim.utils:setting ignored attribute state to None
INFO:gensim.utils:setting ignored attribute dispatcher to None
INFO:gensim.utils:loading LdaModel object from ./data/lda_wiki.model.state
INFO:gensim.utils:loading TfidfModel object from ./data/tfidf_wiki.model
INFO:gensim.utils:loading LsiModel object from ./data/lsi_wiki.model
INFO:gensim.utils:setting ignored attribute projection to None
INFO:gensim.utils:setting ignored attribute dispatcher to None
INFO:gensim.utils:loading LsiModel object from ./data/lsi_wiki.model.projection

Document indexing

Gensim contains three classes for indexing:

  • gensim.similarities.MatrixSimilarity: for an efficient, memory-mapped index -- dense NumPy implementation
  • gensim.similarities.SparseMatrixSimilarity: for an efficient, memory-mapped index -- sparse SciPy implementation
  • gensim.similarities.Similarity: for an efficient out-of-core sharded index (auto-selects MatrixSimilarity or SparseMatrixSimilarity for each shard internally, based on the shard density); this is the most flexible class and should be your first choice.

Let's see each in action:

In [5]:
from gensim.similarities import MatrixSimilarity, SparseMatrixSimilarity, Similarity

# initialize the index
%time index_dense = MatrixSimilarity(lsi_corpus, num_features=lsi_corpus.num_terms)
INFO:gensim.similarities.docsim:creating matrix for 48356 documents and 200 features

CPU times: user 1min 11s, sys: 119 ms, total: 1min 11s
Wall time: 1min 11s

In [6]:
%time index_sparse = SparseMatrixSimilarity(tfidf_corpus, num_features=tfidf_corpus.num_terms)
INFO:gensim.similarities.docsim:creating sparse index
INFO:gensim.matutils:creating sparse matrix from corpus
INFO:gensim.matutils:PROGRESS: at document #0/48356
INFO:gensim.matutils:PROGRESS: at document #10000/48356
INFO:gensim.matutils:PROGRESS: at document #20000/48356
INFO:gensim.matutils:PROGRESS: at document #30000/48356
INFO:gensim.matutils:PROGRESS: at document #40000/48356
INFO:gensim.similarities.docsim:created <48356x26645 sparse matrix of type '<type 'numpy.float32'>'
	with 4743136 stored elements in Compressed Sparse Row format>

CPU times: user 36.7 s, sys: 102 ms, total: 36.8 s
Wall time: 36.8 s

In [7]:
%time index = Similarity('./data/wiki_index', lsi_corpus, num_features=lsi_corpus.num_terms)
INFO:gensim.similarities.docsim:starting similarity index under ./data/wiki_index
INFO:gensim.similarities.docsim:PROGRESS: fresh_shard size=10000
INFO:gensim.similarities.docsim:PROGRESS: fresh_shard size=20000
INFO:gensim.similarities.docsim:PROGRESS: fresh_shard size=30000
INFO:gensim.similarities.docsim:creating matrix for 32768 documents and 200 features
INFO:gensim.similarities.docsim:creating dense shard #0
INFO:gensim.similarities.docsim:saving index shard to ./data/wiki_index.0
INFO:gensim.utils:saving MatrixSimilarity object under ./data/wiki_index.0, separately None
INFO:gensim.utils:loading MatrixSimilarity object from ./data/wiki_index.0
INFO:gensim.similarities.docsim:PROGRESS: fresh_shard size=0
INFO:gensim.similarities.docsim:PROGRESS: fresh_shard size=10000

CPU times: user 1min 12s, sys: 267 ms, total: 1min 12s
Wall time: 1min 12s

Adding new documents to the index

The Similarity class supports adding new documents to the index dynamically:

In [8]:
print(index)
# add the same documents into the index again, effectively doubling its size
index.add_documents(lsi_corpus)
print(index)
INFO:gensim.similarities.docsim:PROGRESS: fresh_shard size=20000
INFO:gensim.similarities.docsim:PROGRESS: fresh_shard size=30000
INFO:gensim.similarities.docsim:creating matrix for 32768 documents and 200 features
INFO:gensim.similarities.docsim:creating dense shard #1
INFO:gensim.similarities.docsim:saving index shard to ./data/wiki_index.1
INFO:gensim.utils:saving MatrixSimilarity object under ./data/wiki_index.1, separately None
INFO:gensim.utils:loading MatrixSimilarity object from ./data/wiki_index.1
INFO:gensim.similarities.docsim:PROGRESS: fresh_shard size=0
INFO:gensim.similarities.docsim:PROGRESS: fresh_shard size=10000
INFO:gensim.similarities.docsim:PROGRESS: fresh_shard size=20000
INFO:gensim.similarities.docsim:PROGRESS: fresh_shard size=30000

Similarity index with 48356 documents in 1 shards (stored under ./data/wiki_index)
Similarity index with 96712 documents in 2 shards (stored under ./data/wiki_index)

Exercise (5min): convert the document text = "A blood cell, also called a hematocyte, is a cell produced by hematopoiesis and normally found in blood." into LSI space and index it into index.

The documents have to come from the same semantic space (=the same model), of course. You can't mix bag-of-words with LDA or LSI documents in the same index.

The other two classes, MatrixSimilarity and SparseMatrixSimilarity, don't support dynamic additions. The documents you supply during their construction is all they'll ever contain.

All indexes can be persisted to disk using the now-familiar save/load syntax:

In [13]:
# store to disk
index.save('./data/wiki_index.index')

# load back = same index
index = Similarity.load('./data/wiki_index.index')
print(index)
INFO:gensim.utils:saving Similarity object under ./data/wiki_index.index, separately None
INFO:gensim.utils:loading Similarity object from ./data/wiki_index.index

Similarity index with 96713 documents in 3 shards (stored under ./data/wiki_index)

Semantic queries

An index can be used to find "most similar documents":

In [14]:
query = "April is the fourth month of the year, and comes between March \
and May. It has 30 days. April begins on the same day of week as July in \
all years and also January in leap years."

# vectorize the text into bag-of-words and tfidf
query_bow = id2word_wiki.doc2bow(tokenize(query))
query_tfidf = tfidf_model[query_bow]
query_lsi = lsi_model[query_tfidf]
In [15]:
# query the TFIDF index
index_sparse.num_best = None
print(index_sparse[query_tfidf])
[ 0.45899147  0.08899394  0.         ...,  0.00997756  0.          0.03254421]

This output is an array with one float per one indexed doc. Each float tells how similar the query is to that document. The higher the similarity score, the more similar the query to the document at the given index. The particular similarity measure used is cosine similarity.

We can also request only the top-N most similar documents:

In [16]:
index_sparse.num_best = 3
print(index_sparse[query_tfidf])
[(194, 0.54339718818664551), (0, 0.45899146795272827), (828, 0.38626673817634583)]

Let's find the five most similar articles according to MatrixSimilarity LSI index:

In [17]:
# query the LSI index
index_dense.num_best = 5
print(index_dense[query_lsi])
[(4028, 0.82495784759521484), (13582, 0.8166358470916748), (0, 0.80658835172653198), (85, 0.8048851490020752), (115, 0.79446637630462646)]

Similarity index works exactly the same:

In [18]:
index.num_best = 10
print(index[query_lsi])
INFO:gensim.utils:loading MatrixSimilarity object from ./data/wiki_index.0
INFO:gensim.utils:loading MatrixSimilarity object from ./data/wiki_index.1
INFO:gensim.utils:loading MatrixSimilarity object from ./data/wiki_index.2

[(4028, 0.82495784759521484), (52384, 0.82495784759521484), (13582, 0.8166358470916748), (61938, 0.8166358470916748), (0, 0.80658835172653198), (48356, 0.80658835172653198), (85, 0.8048851490020752), (48441, 0.8048851490020752), (115, 0.79446637630462646), (48471, 0.79446637630462646)]

(you should see an identical result, except each top-N result is duplicated here, because we indexed the same LSI corpus twice into index a few lines above)

The size of the index structures scales linearly with the number of non-zero features in the corpus. For example, a dense LSI corpus of 1 million documents and 200 topics will consume ~800MB. Querying is fairly fast if you have a fast BLAS library installed:

In [19]:
print(index)
%timeit index[query_lsi]
Similarity index with 96713 documents in 3 shards (stored under ./data/wiki_index)
10 loops, best of 3: 23.8 ms per loop

LSI demo on the full Wikipedia

I've recently benchmarked available Python libraries for similarity search in high-dimensional spaces. For the benchmark I used the full English Wikipedia (~3.5 million documents), using code very similar to what we've done in these tutorials.

There's also a frontend web app for this index, which lets you find similar Wikipedia articles from your browser: http://radimrehurek.com/2014/01/performance-shootout-of-nearest-neighbours-querying/#wikisim

You can see the exact pipeline, including the web server part, at github.

Summary

In this notebook, we learned how to:

  • index arbitrary corpora
  • ask for top-N most similar documents, using the index
  • add new documents to a Similarity index
  • load and save indexes

Next

That's the end of this tutorial! Congratulations :)

We skipped some more advanced topics like parallelization / distributed computations, and some advanced models, but you should have a clear idea of how streamed input works in gensim.

If you have questions, use the gensim mailing list.

Gensim resides on github and has a homepage with API reference, tutorials, testimonials etc. Happy coding!