In this notebook:
# import & logging preliminaries
import logging
import itertools
import gensim
from gensim.parsing.preprocessing import STOPWORDS
logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO
def tokenize(text):
return [token for token in gensim.utils.simple_preprocess(text) if token not in STOPWORDS]
# load the corpora created in the previous notebook
tfidf_corpus = gensim.corpora.MmCorpus('./data/wiki_tfidf.mm')
lsi_corpus = gensim.corpora.MmCorpus('./data/wiki_lsa.mm')
print(tfidf_corpus)
print(lsi_corpus)
# load the models too
id2word_wiki = gensim.corpora.Dictionary.load('./data/wiki.dictionary')
lda_model = gensim.models.LdaModel.load('./data/lda_wiki.model')
tfidf_model = gensim.models.TfidfModel.load('./data/tfidf_wiki.model')
lsi_model = gensim.models.LsiModel.load('./data/lsi_wiki.model')
Gensim contains three classes for indexing:
gensim.similarities.MatrixSimilarity
: for an efficient, memory-mapped index -- dense NumPy implementationgensim.similarities.SparseMatrixSimilarity
: for an efficient, memory-mapped index -- sparse SciPy implementationgensim.similarities.Similarity
: for an efficient out-of-core sharded index (auto-selects MatrixSimilarity
or SparseMatrixSimilarity
for each shard internally, based on the shard density); this is the most flexible class and should be your first choice.Let's see each in action:
from gensim.similarities import MatrixSimilarity, SparseMatrixSimilarity, Similarity
# initialize the index
%time index_dense = MatrixSimilarity(lsi_corpus, num_features=lsi_corpus.num_terms)
%time index_sparse = SparseMatrixSimilarity(tfidf_corpus, num_features=tfidf_corpus.num_terms)
%time index = Similarity('./data/wiki_index', lsi_corpus, num_features=lsi_corpus.num_terms)
The Similarity
class supports adding new documents to the index dynamically:
print(index)
# add the same documents into the index again, effectively doubling its size
index.add_documents(lsi_corpus)
print(index)
Exercise (5min): convert the document text = "A blood cell, also called a hematocyte, is a cell produced by hematopoiesis and normally found in blood."
into LSI space and index it into index
.
The documents have to come from the same semantic space (=the same model), of course. You can't mix bag-of-words with LDA or LSI documents in the same index.
The other two classes, MatrixSimilarity
and SparseMatrixSimilarity
, don't support dynamic additions. The documents you supply during their construction is all they'll ever contain.
All indexes can be persisted to disk using the now-familiar save
/load
syntax:
# store to disk
index.save('./data/wiki_index.index')
# load back = same index
index = Similarity.load('./data/wiki_index.index')
print(index)
An index can be used to find "most similar documents":
query = "April is the fourth month of the year, and comes between March \
and May. It has 30 days. April begins on the same day of week as July in \
all years and also January in leap years."
# vectorize the text into bag-of-words and tfidf
query_bow = id2word_wiki.doc2bow(tokenize(query))
query_tfidf = tfidf_model[query_bow]
query_lsi = lsi_model[query_tfidf]
# query the TFIDF index
index_sparse.num_best = None
print(index_sparse[query_tfidf])
This output is an array with one float per one indexed doc. Each float tells how similar the query is to that document. The higher the similarity score, the more similar the query to the document at the given index. The particular similarity measure used is cosine similarity.
We can also request only the top-N most similar documents:
index_sparse.num_best = 3
print(index_sparse[query_tfidf])
Let's find the five most similar articles according to MatrixSimilarity
LSI index:
# query the LSI index
index_dense.num_best = 5
print(index_dense[query_lsi])
Similarity
index works exactly the same:
index.num_best = 10
print(index[query_lsi])
(you should see an identical result, except each top-N result is duplicated here, because we indexed the same LSI corpus twice into index
a few lines above)
The size of the index structures scales linearly with the number of non-zero features in the corpus. For example, a dense LSI corpus of 1 million documents and 200 topics will consume ~800MB. Querying is fairly fast if you have a fast BLAS library installed:
print(index)
%timeit index[query_lsi]
I've recently benchmarked available Python libraries for similarity search in high-dimensional spaces. For the benchmark I used the full English Wikipedia (~3.5 million documents), using code very similar to what we've done in these tutorials.
There's also a frontend web app for this index, which lets you find similar Wikipedia articles from your browser: http://radimrehurek.com/2014/01/performance-shootout-of-nearest-neighbours-querying/#wikisim
You can see the exact pipeline, including the web server part, at github.
In this notebook, we learned how to:
Similarity
indexThat's the end of this tutorial! Congratulations :)
We skipped some more advanced topics like parallelization / distributed computations, and some advanced models, but you should have a clear idea of how streamed input works in gensim.
If you have questions, use the gensim mailing list.
Gensim resides on github and has a homepage with API reference, tutorials, testimonials etc. Happy coding!