Radim Řehůřek : Articles

Christmas in Vietnam: Mui Ne and Saigon

posted on February 15, 2016 by Radim | 8 Comments

For our vacation this year, we picked Vietnam: Christmas in the touristy region of Mui Ne, then New Year in Saigon (aka Ho Chi Minh City). This post is a short travel log with pictures, highlights, travel costs and tips from our stay there. My hope is this may help someone planning a similar trip.

Making sense of word2vec

posted on December 23, 2014 by Radim | 21 Comments

One year ago, Tomáš Mikolov (together with his colleagues at Google) made some ripples by releasing word2vec, an unsupervised algorithm for learning the meaning behind words. In this blog post, I’ll evaluate some extensions that have appeared over the year.

Read more on Making sense of word2vec…

Doc2vec tutorial

posted on December 15, 2014 by Radim | 36 Comments

The latest gensim release of 0.10.3 has a new class named Doc2Vec. All credit for this class, which is an implementation of Quoc Le & Tomáš Mikolov: “Distributed Representations of Sentences and Documents”, as well as for this tutorial, goes to the illustrious Tim Emerick.

Doc2vec (aka paragraph2vec, aka sentence embeddings) modifies the word2vec algorithm to unsupervised learning of continuous representations for larger blocks of text, such as sentences, paragraphs or entire documents.

Multicore LDA in Python: from over-night to over-lunch

posted on September 21, 2014 by Radim | 4 Comments

Latent Dirichlet Allocation (LDA), one of the most used modules in gensim, has received a major performance revamp recently. Using all your machine cores at once now, chances are the new LdaMulticore class is limited by the speed you can feed it input data. Make sure your CPU fans are in working order!

Gran Canaria in winter

posted on April 4, 2014 by Radim | 7 Comments

Now that I have a blog, I figured I could start posting more info about our travels. So here’s a little digest from one of our recent trips. I’m hoping it will be useful to other tourists looking to visit Gran Canaria, especially in the same season we went (February).

Data streaming in Python: generators, iterators, iterables

posted on March 31, 2014 by Radim | 6 Comments

There are tools and concepts in computing that are very powerful but potentially confusing to novices. One such concept is data streaming (aka lazy evaluation), which can be realized neatly and natively in Python. Do you know when and how to use generators, iterators and iterables?

Tutorial on Mallet in Python

posted on March 20, 2014 by Radim | 9 Comments

MALLET, “MAchine Learning for LanguagE Toolkit” is a brilliant software tool. Unlike gensim, “topic modelling for humans”, which uses Python, MALLET is written in Java and spells “topic modeling” with a single “l”. Dandy.

Word2vec Tutorial

posted on February 2, 2014 by Radim | 67 Comments

I never got round to writing a tutorial on how to use word2vec in gensim. It’s simple enough and the API docs are straightforward, but I know some people prefer more verbose formats. Let this post be a tutorial and a reference example.

Performance Shootout of Nearest Neighbours: Querying

posted on January 12, 2014 by Radim | 27 Comments

Previous posts explained the whys & whats of nearest-neighbour search, the available OSS libraries and Python wrappers. We converted the English Wikipedia to vector space, to be used as our testing dataset for retrieving “similar articles”. In this post, I finally get to some hard performance numbers.

Asymmetric LDA Priors, Christmas Edition

posted on December 21, 2013 by Radim | 1 Comment

The end of the year is proving crazy busy as usual, but gensim acquired a cool new feature that I just had to blog about. Ben Trahan sent a patch that allows automatic tuning of Latent Dirichlet Allocation (LDA) hyperparameters in gensim. This means that an optimal, asymmetric alpha can now be trained directly from your data.