Word2vec in Python, Part Two: Optimizing

Last weekend, I ported Google’s word2vec into Python. The result was a clean, concise and readable code that plays well with other Python NLP packages. One problem remained: the performance was 20x slower than the original C code, even after all the obvious NumPy optimizations.

Selecting the hotspots

There are two major optimization directions: re-obfuscate (parts of) the Python code by converting it back into C, and parallelizing the computation (the original C tool uses threads).

Selecting which part to optimize was an easy task — even without profiling, it’s clear the bulk of the work is done in the nested loop that goes through each sentence, and for each sentence position (word) tries to predict all the other words within its window. It’s a tiny part of the overall code, but accounts for most of the time spent — mere three lines (namely 480, 487 and 489 in r26) eat up 90% of the entire runtime!

The rest of the code is mostly there to prepare the input words and sentences. In the original C word2vec, the words are assumed to reside in a single file on disk, one sentence per line, with words delimited by whitespace. In Python, the sentences can come from anywhere, and be pre-processed in any way you like. The methods accept an iterable of sentences, so we can plug in the Brown Corpus from NLTK like this:

>>> class BrownCorpus(object):
...     """Iterate over sentences from the Brown corpus (part of NLTK data)."""
...     def __init__(self, dirname):
...         self.dirname = dirname
... 
...     def __iter__(self):
...         for fname in os.listdir(self.dirname):
...             fname = os.path.join(self.dirname, fname)
...             if not os.path.isfile(fname):
...                 continue
...             for line in open(fname):
...                 # each file line is a single sentence in the Brown corpus
...                 # each token is WORD/POS_TAG
...                 token_tags = [t.split('/') for t in line.split() if len(t.split('/')) == 2]
...                 # ignore words with non-alphabetic tags like ",", "!" etc (punctuation, weird stuff)
...                 words = ["%s/%s" % (token.lower(), tag[:2]) for token, tag in token_tags if tag[:2].isalpha()]
...                 if not words:  # don't bother sending out empty sentences
...                     continue
...                 yield words

>>> for sentence in BrownCorpus('/Users/kofola/nltk_data/corpora/brown'): 
...     print sentence
['the/at', 'fulton/np', 'county/nn', 'grand/jj', 'jury/nn', 'said/vb', 'friday/nr', 'an/at', 'investigation/nn', 'of/in', "atlanta's/np", 'recent/jj', 'primary/nn', 'election/nn', 'produced/vb', 'no/at', 'evidence/nn', 'that/cs', 'any/dt', 'irregularities/nn', 'took/vb', 'place/nn']
['the/at', 'jury/nn', 'further/rb', 'said/vb', 'in/in', 'term-end/nn', 'presentments/nn', 'that/cs', 'the/at', 'city/nn', 'executive/jj', 'committee/nn', 'which/wd', 'had/hv', 'over-all/jj', 'charge/nn', 'of/in', 'the/at', 'election/nn', 'deserves/vb', 'the/at', 'praise/nn', 'and/cc', 'thanks/nn', 'of/in', 'the/at', 'city/nn', 'of/in', 'atlanta/np', 'for/in', 'the/at', 'manner/nn', 'in/in', 'which/wd', 'the/at', 'election/nn', 'was/be', 'conducted/vb']
['the/at', 'september-october/np', 'term/nn', 'jury/nn', 'had/hv', 'been/be', 'charged/vb', 'by/in', 'fulton/np', 'superior/jj', 'court/nn', 'judge/nn', 'durwood/np', 'pye/np', 'to/to', 'investigate/vb', 'reports/nn', 'of/in', 'possible/jj', 'irregularities/nn', 'in/in', 'the/at', 'hard-fought/jj', 'primary/nn', 'which/wd', 'was/be', 'won/vb', 'by/in', 'mayor-nominate/nn', 'ivan/np', 'allen/np', 'jr./np']
...

The sentences are constructed on the fly, from various files in the Brown Corpus directory, without any extra preprocessing steps or loading everything into RAM. Plugging in different training data is pretty straightforward: adjust the iterator to fit the data format. Gensim’s word2vec module already contains this BrownCorpus iterator, plus a few others, as blueprint examples.

Is it faster?

For testing the training speed, I’ll be using the Brown corpus above. With ~1 million words in 57k sentences, it’s too small for any quality training, but already big enough so we can evaluate improvements/regressions in performance. The speed is tested with a hidden layer size of 200 and with ignoring vocabulary which appears less than 5 times (size=200, min_count=5 constructor parameters). The final vocabulary contains 15,079 words, so the projection weights form a 15,079×200 matrix.

All results below are a best-of-ten on my MacbookPro 2.3GHz laptop, meaning each test was run 10 times and only the best result is reported, to mitigate random noise/OS multitasking influences.

Cython

The baseline performance of the NumPy code under these conditions is 1.4k words per second. Rewriting the training loop in Cython improves this to 33.3k words/sec, which is a 24x speedup.

EDIT: I originally used memoryviews to pass arrays around in Cython, and converting to a memoryview is a lot slower than plain casting arrays to pointers. I’ve since updated the code, re-ran all tests and updated the timings in this post.

BLAS

While rewriting the loops in Cython, I noticed the logic contained therein could be expressed in terms of BLAS. BLAS (Basic Linear Algebra Subprograms) are routines for stuff like vector_y += scalar_alpha * vector_x, with funky names like saxpy or dgemm. The reason why people use BLAS is that these basic operations can be heavily optimized, so that calling saxpy is way faster than what a generic compiler would produce from an equivalent, naive C loop.

NumPy itself links against BLAS internally, but unfortunately offers no way to plug directly into the C routines (or I couldn’t find a way). Calling BLAS via NumPy would be crazy slow, because it would have to go through Python calls, acquiring GIL and all.

Fortunately, SciPy contains a little known gem, hidden inside scipy.linalg.blas, which allows us to call the C BLAS routines directly. After some poking in SciPy’s bowels, scratching my head and googling around, I came up with:

from cpython cimport PyCObject_AsVoidPtr
from scipy.linalg.blas import cblas

ctypedef void (*saxpy_ptr) (const int *N, const float *alpha, const float *X, const int *incX, float *Y, const int *incY) nogil

cdef saxpy_ptr saxpy=<saxpy_ptr>PyCObject_AsVoidPtr(cblas.saxpy._cpointer)

Which gives us saxpy, aka vectorized y = alpha*x in single precision. After plugging in saxpy, sdot & co. in place of lines 480, 487 and 489, I got a ~3x speedup in the crunch code, which translated into 89.8k words/s in the overall training speed. This is a 64x improvement over NumPy, and a 2.7x improvement over plain Cython.

Important note on BLAS: This improvement hinges on the quality of the BLAS library installed. On my MacBook Pro, SciPy automatically links against Apple’s vecLib, which contains an excellent BLAS. Similarly, Intel’s MKL, AMD’s AMCL, Sun’s SunPerf or the automatically tuned ATLAS are all good choices. Installing a generic, pre-packaged BLAS which is not optimized for your machine (no SSE, unknown CPU cache sizes…) is not a good choice.

Note from Radim: Get my latest machine learning tips & articles delivered straight to your inbox (it's free).

 Unsubscribe anytime, no spamming. Max 2 posts per month, if lucky.

Precomputed sigmoid table

The original C code contains an extra optimization, where instead of computing y = 1 / (1 + e-x), it looks up the value of y directly, in a precomputed table in RAM. This is a classic optimization from the times when float operations were expensive: a trade-off between using more memory & less precision vs. freeing up CPU. With dedicated FPUs and complex cache architectures in modern computers, it’s not such a clear hit anymore.

I actually expected no performance gain at all, but profiling proved me wrong. Using the precomputed EXP_TABLE with Cython, I got 34.7k words/s, a 4% improvement. Using both the BLAS optimization above and the precomputed sigmoid table, it’s 101.8k words/s. This is the best result so far, almost 73x faster than the NumPy code I started with. It’s also 3.5 times faster than the original C code, which runs at 29k words/s on this corpus (with a single thread).

Summary

All the timings above in one table:

optimizationwords per secondspeed-up
NumPy baseline1.4k1.0x
original C word2vec29.0k20.7x
Cython33.3k23.8x
Cython + BLAS89.8k64.1x
Cython + sigmoid table34.7k24.8x
Cython + BLAS + sigmoid table101.8k72.7x


Where next?

With the hotspot optimized, gensim’s word2vec is now both fast and easy to use. The cost is an extra dependency on Cython. I realize this may be an issue for Windows users, so I added fallback code where if the fast Cython fails to compile (because there’s no compiler or no Cython…), it will use the slower, NumPy code. This way, noob users can still use word2vec without going into compilation setup and technicalities.

The next step is making the code run in parallel, to make use of multicore machines. This will require some thought, because very fine parallelization may not work well because of thread collisions (there is no locking in the C word2vec, so it’s best if the threads work on very different words and sentences and avoid stepping on each other’s toes). On the other hand, having coarse-grained threading blocks like in the original C word2vec will not work in Python, because GIL can only be released in the low-level methods. I’ll have to give this some thought :-)

Part III: Parallelizing word2vec in Python

39 Comments

  1. Michal Illich September 27, 2013 4:32 pm  Reply

    Congratulations – seems like a miracle that your Python single threaded code is actually faster than paralel C code from Google :)

    • Radim September 27, 2013 4:42 pm  Reply

      Hah, only than the single-threaded C code. Parallelization is coming next weekend — this weekend I’m going home to Břeclav for Hody :)

  2. Stefan September 28, 2013 6:21 pm  Reply

    Hi Radim,

    thanks for publishing this! Interesting read and full of useful information! Id like to check your code, did you release it already somewhere?

    greets
    Stefan

    • Radim September 28, 2013 6:28 pm  Reply

      Hello Stefan, sure — search for “word2vec module” on this page.

  3. Rami September 29, 2013 8:09 pm  Reply

    Hi,

    Nice work :).

    I do not think you need to implement any locking mechanism. Locking does not buy you anything when it comes to stochastic optimization. Also if you use adaptaive learning rates (adagrad), most of your concents will be mitigated.

    Check hogwild for the benfits of lock free implementation. (http://pages.cs.wisc.edu/~brecht/papers/hogwildTR.pdf)

  4. Gordon Mohr September 30, 2013 10:42 pm  Reply

    Google’s pre-trained vectors (eg “freebase-vectors-skipgram1000-en.bin.gz”) gzip-compress so well (2.4GB vs. 5.4 GB)… I wonder if perhaps a compressed in-memory format might be another opportunity for optimization, lowering RAM requirements and maybe even speeding lookups/calculations (via better cache-efficiency and/or core utilization). Just a vague idea at this point, though.

  5. jeffery September 30, 2013 11:22 pm  Reply

    Just test your code on my machine.
    Numpy baseline: 1.5k
    original C code: 159k (1 thread)
    Cython + BLAS + sigmoid table: 67.8k
    Did I get anything wrong?

    • Radim October 1, 2013 8:20 am  Reply

      Hard to say. Can you describe your setup/data on the gensim mailing list? Let’s look into this, I’m curious.

  6. jeffery September 30, 2013 11:27 pm  Reply

    When you specify -threads 8, the per-thread speed is Words/thread/sec: 76.96k. However the total running time becomes real 0m31.470s from real 1m48.107s.

  7. Patrick Snape October 1, 2013 5:41 pm  Reply

    Hi,

    When you say that ditching memoryviews ended up with a performance gain – how much are we talking? The pyarray casting is really ugly and makes the code a lot less readable. However, if it results in a massive speed increase then I could learn to live with it.

    Cheers

    • Radim October 1, 2013 6:03 pm  Reply

      17k words/sec with memoryviews, where it now says 33.3k.

      But note that this was in a tight loop (hundreds of thousands of calls per second), where even a little overhead makes a big difference. If you do more work inside the called fnc, I expect the difference will be much less pronounced. I decided uglifying ~10 lines of crunch C code was worth the gain.

      • Patrick Snape October 1, 2013 6:35 pm  Reply

        Thanks for the quick response. Interesting to note. I don’t usually deal with data quite that massive, but I do frequently iterate in tight loops. I’ll have to see how much of a win I get!

        Thanks again!

      • jeffery October 5, 2013 6:19 pm  Reply

        Hi. Radim
        Since I have a problem on the speed, I downloaded the dev version of gensim and did the following to install.

        sudo python setup.py install

        A problem happened to me is word2vec_inner.pyx did not go into the models fold. So I have to copy as follows: sudo cp gensim/models/word2vec_inner.pyx /usr/local/lib/python2.7/dist-packages/gensim-0.8.7-py2.7.egg/gensim/models/

        After that, I can run the program, but I have even serious speed problem. The single thread speed drops to 21k (previously version 67.8k). When I set 8 workers, it goes up to 140K. The corpus for all the test are text8.

        Another trivial thing to speed up I would like to mention is, comiple word2vec.py to c-so lib, I got around 50% speed-up. Say 8 workers, from 140K to 210K.

        • Radim October 5, 2013 6:37 pm  Reply

          Hi jeffery, the easiest way is to just add the gensim develop dir to your PYTHONPATH, or install it with pip install -e .. The code will change a lot until the next release (unit tests, packaging), so there’s no point doing a “hard” install.

          Why your speed varies I don’t know. Like I said, for support come to the gensim mailing list.

          • jeffery October 5, 2013 7:13 pm 

            Hi. Radim
            Thanks for your quick response. I filed that on gensim google group.

  8. Benjamin October 2, 2013 12:38 am  Reply

    Heads up for anyone trying the code – matutils.zeros_aligned is only on the develop branch, so if you update gensim, it still won’t run. I installed gensim from the zip file of the develop branch, and it ran. 103k words/sec, whohoo!

  9. Leon November 19, 2013 10:45 am  Reply

    Hi Radim, nice works. Thanks for all these stuffs.
    Recently, I still have a problem about the speed. I first installed the cython (“easy_install cython”) to use optimized word2vec training, but when i run the word2vec in gensim, the value of “FAST_VERSION” is -1, which indicates that the optimized one is not used. And I can’t figure out how to solve this.

    Any suggestions?? Thanks..

    • Radim November 19, 2013 12:41 pm  Reply

      Hi Leon, hard to say, not enough info. Come on the gensim mailing list or open a github issue. Those are better mediums for debugging.

      • Leon November 21, 2013 4:49 am  Reply

        Hi, Radim, thanks for your response. Sorry to disturb you again, i am a fresher in python and gensim.
        I run the program on the win7 (64bit) operating systems and the cython is installed before the code running. And the speed is similar with the NumPy baseline case (1.4k/s).
        And When I debug the code “from word2vec_inner import train_sentence, FAST_VERSION”, an exception is throwed: “ImportError: No module named word2vec_inner”. (The word2vec_inner.pyx file is also in the models directory).

        Thanks..

        • Radim November 21, 2013 8:45 am  Reply

          Hi again Leon, for the mailing list, google “gensim mailing list”.

        • luopuya December 10, 2013 11:08 am  Reply

          Hello Leon, I come across the same problem as you, did you finally solve it?

  10. Florian November 20, 2013 4:56 pm  Reply

    Hi Radim, nice work, porting this over to SciPy!

    I can [not - pls read on...] reproduce your claim of your port being faster on a 4-processor (Nehalem Xeon X5550; cuad-cores @ 2.67GHz) machine with “unlimited” RAM (64 GB).

    First, whether I run your PYX code (using anaconda with their provided atlas package statically linked) or the original C code, the training time with 16 threads (“workers” in your code) is more or less the same for both word2vec and word2vec.py on text8 with otherwise all (word2vec) default parameters (~1 min). To be precise, the python version is actually a tiny bit slower (approx. 12 sec, for a time of ~1.2 min).

    Next, if I train single-threaded/with one worker only, the outcome is that the C version requires ~5.5 min, but your PYX code only takes 4.2 min to train on text8, so indeed your version it is a bit faster in this scenario.

    Third, I checked training on a single CPU with four threads/workers. In this scenario, both implementations take virtually the same time (~2.25 min).

    Last but not least, I tried to use your blas.patch (on the word2vec mailing list) with the latest OpenBLAS from GitHub, using NEHALEM as architecture target an compiling with the “-msse4.2″ instruction set. I added the openblas_set_num_threads call to the main routine, and ran word2vec using 4 threads. Still I could not reproduce your speed improvements – as a matter of fact, word2vec performance is quite a bit *slower* when using the BLAS routines on 4 cores (~3 vs. 2.25 min).

    Finally, I checked the runtime with a single threaded run using OpenBLAS via your patch, and voilá, I finally could reproduce your 2x speed improvement: ~2.83 vs 5.5 min!

    Care to elaborate? It seems, with OpenBLAS you do get a speed improvement, but it does not improve speed when using multiple threads (actually, it got worse!), and neither does the anaconda/atlas-based version of Python “beat” the native version in multi-threaded mode…

    • Radim November 20, 2013 5:16 pm  Reply

      Hi Florian, thanks for the performance report. The port doesn’t really use SciPy (except for the BLAS bindings).

      The exact performance depends on a million factors: your HW, OS, compiler… It’s also quite possible you’re hitting your RAM throughput limits. All threads update the same matrix concurrently (both the C and py version), so there’s a lot of cache contention.

      In any case, ±similar performance is what I was going for. Actually, even 2x slower than C wouldn’t bother me — check out the motivation section in blog post Part 1. If it’s a little bit faster, so much the better :)

      • Florian November 20, 2013 10:50 pm  Reply

        True; I secretly was hoping there is something I did wrong, but probably not… By now I’ve even tried to recompile numpy/scipy with the latest OpenBLAS on because I did not trust anaconda, but it did not make your code any faster, either.

        However, as you state, the real point is that it is always impressive when it turns out that Python can keep up with a pure C library. Great work!

        BTW, I’ve “ported” your word2vec.py and _inner.pyx code to make it work “standalone” (without all of gensim) in Py3k; I’ll ping you on GitHub once I put it online. Maybe it’s worth it distributing “word2vec for Py3″ via PyPi on its own or are you close to having all of gensim ported to Py3?

        • Florian November 20, 2013 10:55 pm  Reply

          * I should add to my reply above that using OpenBLAS (compiled with SSE4.2 and NEHALEM optimizations) + numpy with your code in single threaded mode did finally generate the speedup you report (about twice as fast as word2vec), but this advantage is lost as soon as I use multiple threads.

        • Radim November 21, 2013 3:15 am  Reply

          I’m not thrilled about a py3k fork of a (part of) gensim, because it just means more support and headaches =) But of course, as long as you respect the license, it’s your call.

          • Radim May 25, 2014 12:44 pm 

            For the record, gensim is now (as of 0.10.0) fully python 3 compatible. Including word2vec.

  11. Jin March 15, 2014 8:01 am  Reply

    Great work!
    when I trained the data, I got 101k words/s speed on Macbook Pro 2.4GHZ Dual Core, But when I ran the model.most_similar(), I sometimes got the python crashed with the following report:
    Thread 0 Crashed:: Dispatch queue: com.apple.root.default-priority
    0 libBLAS.dylib
    Do u know why? I didn’t use openblas.

    • Radim March 15, 2014 2:29 pm  Reply

      No idea; maybe faulty RAM? Problems with your SciPy installation? Post your specs/log at the gensim mailing list.

  12. Jie Fu May 1, 2014 10:44 am  Reply

    Hi Radim, have you tried NumbaPro (support GPU computing) to parallelize the code?

    • Radim May 1, 2014 11:41 am  Reply

      No. But I’d be curious to hear the results — if you try, please let me know.

      • Jie Fu May 1, 2014 2:05 pm  Reply

        Sure, I’m modifying a SGD algorithm using NumbaPro (GPU). After this, I will try to modify word2vec and send you the code.

  13. VIshal VIshal May 24, 2014 12:56 pm  Reply

    Hello ,

    I am a huge follower of you work.
    I am trying to develop UI for this work.But I am not an expert in python so I have a littl e issue.
    I tried saving the model to a file but encountered the following error :

    Traceback (most recent call last):
    File “word2Vec_impl.py”, line 39, in
    model.save(fo)
    File “C:\Python27\lib\site-packages\gensim-0.9.0-py2.7.egg\gensim\models\word2
    vec.py”, line 667, in save
    super(Word2Vec, self).save(*args, **kwargs)
    File “C:\Python27\lib\site-packages\gensim-0.9.0-py2.7.egg\gensim\utils.py”, l
    ine 269, in save
    pickle(self, fname)
    File “C:\Python27\lib\site-packages\gensim-0.9.0-py2.7.egg\gensim\utils.py”, l
    ine 608, in pickle
    with smart_open(fname, ‘wb’) as fout: # ‘b’ for binary, needed on Windows
    File “C:\Python27\lib\site-packages\gensim-0.9.0-py2.7.egg\gensim\utils.py”, l
    ine 596, in smart_open
    _, ext = path.splitext(fname)
    File “C:\Python27\lib\ntpath.py”, line 190, in splitext
    return genericpath._splitext(p, sep, altsep, extsep)
    File “C:\Python27\lib\genericpath.py”, line 91, in _splitext
    sepIndex = p.rfind(sep)
    AttributeError: ‘file’ object has no attribute ‘rfind’

    Any idea what could be the solution?

    • Radim May 24, 2014 1:09 pm  Reply

      Hi, you need to pass a filename (=string) to `save()`, not a file object.

  14. Leonid Boytsov May 29, 2014 3:49 am  Reply

    Exponents are still very expensive unless you use GPU or the Intel compiler. May take dozens of CPU cycles.

Leave a Reply

Current day month ye@r *