Gensim is a popular machine learning library for text clustering. It targets large-scale automated thematic analysis of unstructured (aka “natural language”) text.
Gensim’s tagline: “Topic Modelling for Humans“
Who, where, when
I created this library while living in Thailand, finishing my Ph.D. thesis, in 2010-2011. It is closely tied to the algorithms I developed in that thesis: while the algorithms are fairly useful and practical (see the “Impact” section below), I thought they’ll never see much daylight if they stay only as mathematical pseudo-code in some obscure manuscript.
So I decided to implement them efficiently, and release the result as open source.
Gensim wraps some low-level mathematical Fortran and C libraries in Python, using them as building blocks for higher level algorithms that analyse free-style text.
The analysis happens on topical level: What are the prevalent themes in this corpus of texts? To what degree is this theme represented in this document?
Such thematic representation of documents can then be used to find texts on similar themes: What other document deals with these topics? What are the most similar documents, with similarity measured by abstract themes, rather than direct, menial overlap of keywords?
The field of topic modelling has seen something of a spike in both academic and commercial interest recently. As a result, gensim has been used by multiple companies, universities and individual students/researchers. The library itself is free (under the LGPL license), so people use it as a tool to create concrete applications, serving concrete goals. Gensim has collected over 250 academic citations, plus some positive testimonials, which I’ll take the proud liberty of quoting here:
“We are using gensim every day. Over 15 thousand times per day to be precise. Gensim’s LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it’s all about. It simply works.”
Andrius Butkus, Data Scientist, Issuu
“Here at Tailwind, we use Gensim to help our customers post interesting and relevant content to Pinterest. No fuss, no muss. Just fast, scalable language processing.”
Waylon Flinn, Lead Data Scientist, Tailwind
“We used gensim in several text mining projects at Sports Authority. The data were from free-form text fields in customer surveys, as well as social media sources. Having gensim significantly sped our time to development, and it is still my go-to package for topic modeling with large retail data sets.”
Josh Hemann, Group Manager – Marketing Analytics, Sports Authority
“Semantic analysis is a hot topic in online marketing, but there are few products on the market that are truly powerful. Gensim is undoubtedly one of the best frameworks that efficiently implement algorithms for statistical analysis. Few products, even commercial, have this level of quality.”
Bruno Champion, CEO at DynAdmic
“I used gensim at Ghent university. I found it easy to build prototypes with various models, extend it with additional features and gain empirical insights quickly. It’s a reliable library that can be used beyond prototyping too.”
Dieter Plaetinck, researcher, Ghent University
“Gensim hits the sweetest spot of being a simple yet powerful way to access some incredibly complex NLP goodness.”
Alan J. Salmoni, Director, Roistr.com
“We have been using gensim in several DTU courses related to digital media engineering and find it immensely useful as the tutorial material provides students an excellent introduction to quickly understand the underlying principles in topic modeling based on both LSA and LDA.”
Michael Kai Petersen, professor, Technical University of Denmark
“Gensim provides powerful and efficient tools for semantic similarity and topic modelling behind a consistent, idiomatic, well-thought-out, and well-documented API. It’s an excellent example of what a programming library should be, in Python or in any language.”
William Bert, happy user
Used toolsGit, Python, C, Tox, Sphinx… and Google Scholar, for lots and lots of academic research.
Machine Learning / Open source / Semantic Analysis