Continuing the benchmark of libraries for nearest-neighbour similarity search, part 2. What is the best software out there for similarity search in high dimensional vector spaces?
Document Similarity @ English Wikipedia
I’m not very fond of benchmarks on artificial datasets, and similarity search in particular is sensitive to actual data densities and distance profiles. Using fake “random gaussian datasets” seemed unfair to the more advanced indexing algorithms (brute force linear scans don’t care, of course). So I decided to fix the domain as similar document retrieval, with the English Wikipedia as our text corpus and benchmark dataset.
The data was prepared as follows: first download the latest Wikipedia dump (9.7GB) & extract plain text from each article:
def process_article((title, text)): """ Parse a wikipedia article, returning its content as `(title, list of tokens)`, all utf8. """ text = gensim.corpora.wikicorpus.filter_wiki(text) # remove markup, get plain text # tokenize plain text, throwing away sentence structure, short words etc return title.encode('utf8'), gensim.utils.simple_preprocess(text) def convert_wiki(infile, processes=multiprocessing.cpu_count()): """ Yield articles from a bz2 Wikipedia dump `infile` as (title, tokens) 2-tuples. Only articles of sufficient length are returned (short articles & redirects etc are ignored). Uses multiple processes to speed up the parsing in parallel. """ pool = multiprocessing.Pool(processes) texts = gensim.corpora.wikicorpus._extract_pages(bz2.BZ2File(infile)) # generator # process the corpus in smaller chunks of docs, because multiprocessing.Pool # is dumb and would try to load the entire dump into RAM... for group in gensim.utils.chunkize(texts, chunksize=10 * processes): for title, tokens in pool.imap(process_article, group): if len(tokens) >= 50 and not title.startswith('Wikipedia:'): yield title.replace('\t', ' '), tokens pool.terminate() for title, tokens in covert_wiki('enwiki-latest-pages-articles.xml.bz2'): print "%s\t%s" % (title, ' '.join(tokens))
This outputs each article on one line, with its title followed by tab and a space-separated list of article tokens (wiki markup, stubs and redirects removed):
|#1||Anarchism||anarchism is political philosophy that advocates stateless societies based on non hierarchical free associations …|
|#2||Autism||autism is disorder of neural development characterized by impaired social interaction and verbal …|
|#3||A||named plural aes is the first letter and vowel in the iso basic latin alphabet it is similar to the ancient greek …|
|#4||Alabama||alabama is state located in the southeastern region of the united states it is bordered by tennessee …|
|#5||Abraham Lincoln||abraham lincoln february april was the th president of the united states serving from march until his assassination …|
|#3,651,419||Doug McCarthy (ice hockey)||doug mccarthy born november is canadian former professional ice hockey player and former …|
|#3,651,420||Csák I Hahót||csák from the kindred hahót died after was hungarian noble who held several secular positions during the reign …|
|#3,651,421||Fair and Warmer (1919 film)||fair and warmer is american silent film directed by henry otto starring may allison and eugene pallette …|
|#3,651,422||Russell Jones (orientalist)||dr russell jones is content to be called an orientalist of the old mould he has devoted his professional life to …|
Using gensim’s memory-friendly streaming API I then converted these plain text tokens to TF-IDF vectors, ran Singular Value Decomposition (SVD) on this TF-IDF matrix to build a latent semantic analysis (LSA) model and finally stored each Wikipedia document as a 500-dimensional LSA vector to disk. These are the vectors that will be used for indexing and querying in the benchmarks:
class ShootoutCorpus(gensim.corpora.TextCorpus): """Turn the plain text from above into a streamed corpus of sparse vectors.""" def get_texts(self): for line in open(self.input): yield line.split('\t').split() # return tokens (ignore the title) self.length = lineno + 1 # lazy-load (stream) the corpus of 3.7M documents generated by covert_wiki above corpus = ShootoutCorpus('title_tokens.txt') # remove too rare/too common words, then keep 50k most common words corpus.dictionary.filter_extremes(no_below=20, no_above=0.1, keep_n=50000) # build tf-idf and lsi models on the wiki corpus tfidf = gensim.models.TfidfModel(corpus) lsi = gensim.models.LsiModel(tfidf[corpus], id2word=corpus.dictionary, num_topics=500) # convert wiki to LSA space & store vectors to disk in MatrixMarket format gensim.corpora.MmCorpus.serialize('wiki_vectors.mm', lsi[tfidf[corpus]])
I also published this preprocessing code on GitHub, including logging and gzipping the output, to save disk space. The output is a 3,651,422 x 500 dense matrix serialized to disk in MatrixMarket format (48GB; gzipped 12GB). There was no special NLP preprocessing, such as named entity extraction or model tuning — we only care about the final vectors here, from a benchmarking perspective.
When picking the competitor libraries for similarity search, I placed two constraints: 1. the software has to be callable from Python (no time to write custom bindings) and 2. up and running in half an hour or less (no time/patience for unruly or unmaintained software). Note that I’ve been creating&hacking&porting sundry software in sundry languages for sundry platforms for years and years, so half an hour of mine is not as cut-throat as it may sound.
Examples of libraries that didn’t make it include VLFeat (I managed to whip the Python wrappers into obedience, just shy of 30mins, only to realize they don’t expose the similarity routines); Inria’s Yael (atrocious documentation, couldn’t figure out how it works), scikits-ann (wraps ANN; discontinued?) and Jubatus (very promising, but missed the 30min up&running limit; mental note to check back later).
|library||NN algo||install ease||API & docs||indexing speed||index size||shared mem & loading||incremental indexing|
FLANN is a “library for performing fast approximate nearest neighbor (ANN) searches in high dimensional spaces”. Used version 1.8.4 (=the latest), smooth installation on both OS X and Debian Linux. Pros: Possibly the most widely used ANN library (e.g. internally by OpenCV). Indexing takes a while, but that’s due to an automatic selection of an optimal algorithm for the input data. This turns developer-tunes-indexing time into computer-tunes-indexing time, a happy tradeoff and a great feature. User-tunable tradeoff between retrieval accuracy, build speed and query speed. FLANN contains fast algos for brute force (linear) scan, hierarchical k-means or a forest of kd-trees (+their parameters); auto-tuning on wiki chooses the hierarchical k-means. Supports batch queries (=submitting several queries at once, for faster processing). Cons: Requires the original dataset to be stored&loaded separately. Lack of index memory sharing between processes (query parallelization).
Yahoo/LSH: ANN via “efficient implementation of locality-sensitive hashing (LSH)”. Constructs several random projections to hash input vectors into bins; vectors falling into the same (or nearby) bins are considered similar. Uses NumPy for vector manipulations internally. Pros: unlike all the other libs, only stores a few fingerprints (=hashes=dict entries) per input vector => low memory footprint. A single .py file, there’s no installation nor versioning. Lots of potential for runtime optimization. Cons: Academic implementation: unsightly coding style, weird library design (prepares data for matlab code as well…), one-letter variable names, no sane default values. Biggest problem is that it returns an arbitrary number of nearest neighbours — whatever falls into the same bins as the query. Which can be anything between zero and the entire dataset. Taming the “k” in k-NN requires insight and some non-obvious parameter tuning.
Scikit-learn: “machine learning in Python”. Used the latest 0.14.1 version, smooth installation. Contains optimized Ball Tree + KD-Tree + brute-force algos for exact NN search. Pros: “Everything-for-everybody” approach, covering various domains and algorithms in one coherent, well documented package. Auto-tune API param for choosing the best NN algo, similar to FLANN (chooses kd-tree on wiki subset). Cons: Blows up with memory errors much sooner than other libs (can only process small datasets). Stored index consumes 3x as much memory as the next most memory hungry lib, FLANN.
Annoy: “approximate nearest neighbors something something” :-). C++ wrapped with Boost, version 1.0, one minor installation hiccup. Annoy builds a forest of trees for ANN search. Each internal tree node separates the dataset recursively, down to a fixed leaf size (like in a kd-tree), except the hyperplanes are chosen randomly. Pros: Optimized for memory usage. Stores index in a format that is memory-mapped back on load, for fast index sharing between processes. A single parameter controlling precision vs. speed. Clean, no bullshit, clearly built for practical use. Cons: not much… the dependency on Boost, maybe.
Gensim: “topic modelling for humans”, version 0.8.8, by your truly. Pros: Indexes input into shards, which are memory-mapped for multiprocessing (like Annoy). Can index datasets larger than RAM. Each index shard can be independently processed/distributed. Optimized batch queries (like FLANN). Can update shards with new documents incrementally. Cons: Only supports exact NN via a linear brute force search with cosine similarity.
Neither yahoo/lsh nor scikit-learn support memory sharing, as far as I could tell. But both are built on top of NumPy, so I expect hacking them to use mmap would be straightforward. Annoy has the same potential to use very little memory like yahoo/lsh: the original input vectors are only ever used for sorting the (few) neighbour candidates, which may be unnecessary in some applications. Annoy code is fairly clean and to the point, so hacking it to not sort candidate neighbours (or sort them by how many trees they appear in, as an approximation) should be straightforward too. For packages that don’t support cosine similarity directly, I normalize the input vectors to unit euclidean length & use euclidean distance, which achieves the same thing. For motivation & academic links & theoretic background into sim search in vector spaces, see Part 1.
I know it’s a bit of a cliff-hanger, but I’ll postpone graphs of accuracy & query performance to the next episode. This one got longer than I anticipated (again!), but at least we got the preliminaries out of the way. Stay tuned.
EDIT: Part III with hard numbers here! Who won?