Continuing the benchmark of libraries for nearest-neighbour similarity search, part 2. What is the best software out there for similarity search in high dimensional vector spaces?
Violent as the title sounds, I’ll be actually benchmarking software packages that realize the nearest-neighbour search in high dimensional vector spaces. Which approach is the fastest, easiest to use, the best? No neighbours got harmed writing this post.
The final installment on optimizing word2vec in Python: how to make use of multicore machines.
The original C toolkit allows setting a “-threads N” parameter, which effectively splits the training corpus into N parts, each to be processed by a separate thread in parallel. The result is a nice speed-up: 1.9x for N=2 threads, 3.2x for N=4.
Last weekend, I ported Google’s word2vec into Python. The result was a clean, concise and readable code that plays well with other Python NLP packages. One problem remained: the performance was 20x slower than the original C code, even after all the obvious NumPy optimizations.
Neural networks have been a bit of a punching bag historically: neither particularly fast, nor robust or accurate, nor open to introspection by humans curious to gain insights from them. But things have been changing lately, with deep learning becoming a hot topic in academia with spectacular results. I decided to check out one deep learning algorithm via gensim.
EDIT: went live on 8th September, comments and suggestions welcome :-)