What is Gensim?¶
Gensim is a free open-source Python library for representing documents as semantic vectors, as efficiently (computer-wise) and painlessly (human-wise) as possible.
Gensim is designed to process raw, unstructured digital texts (“plain text”) using unsupervised machine learning algorithms.
The algorithms in Gensim, such as Word2Vec
, FastText
,
Latent Semantic Indexing (LSI, LSA, LsiModel
), Latent Dirichlet
Allocation (LDA, LdaModel
) etc, automatically discover the semantic
structure of documents by examining statistical
co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised,
which means no human input is necessary – you only need a corpus of plain text documents.
Once these statistical patterns are found, any plain text documents (sentence, phrase, word…) can be succinctly expressed in the new, semantic representation and queried for topical similarity against other documents (words, phrases…).
Note
If the previous paragraphs left you confused, you can read more about the Vector Space Model and unsupervised document analysis on Wikipedia.
Design principles¶
We built Gensim from scratch for:
Practicality – as industry experts, we focus on proven, battle-hardened algorithms to solve real industry problems. More focus on engineering, less on academia.
Memory independence – there is no need for the whole training corpus to reside fully in RAM at any one time. Can process large, web-scale corpora using data streaming.
Performance – highly optimized implementations of popular vector space algorithms using C, BLAS and memory-mapping.
Who are “we”? Check the People behind Gensim.
Installation¶
Gensim is a Python library, so you need Python. Gensim supports all Python versions that haven’t reached their end-of-life.
If you need with an older Python (such as Python 2.7), you must install an older version of Gensim (such as Gensim 3.8.3).
To install gensim, simply run:
pip install --upgrade gensim
Alternatively, you can download the source code from Github or the Python Package Index.
After installation, learn how to use Gensim from its Core Concepts tutorials.
Licensing¶
Gensim is licensed under the OSI-approved GNU LGPLv2.1 license. This means that it’s free for both personal and commercial use, but if you make any modification to Gensim that you distribute to other people, you have to disclose the source code of these modifications.
Apart from that, you are free to redistribute Gensim in any way you like, though you’re not allowed to modify its license (doh!).
If LGPL doesn’t fit your bill, you can ask for Commercial support.
Academic citing¶
Gensim has been used in over two thousand research papers and student theses.
When citing Gensim, please use this BibTeX entry:
@inproceedings{rehurek_lrec,
title = {{Software Framework for Topic Modelling with Large Corpora}},
author = {Radim {\v R}eh{\r u}{\v r}ek and Petr Sojka},
booktitle = {{Proceedings of the LREC 2010 Workshop on New
Challenges for NLP Frameworks}},
pages = {45--50},
year = 2010,
month = May,
day = 22,
publisher = {ELRA},
address = {Valletta, Malta},
note={\url{http://is.muni.cz/publication/884893/en}},
language={English}
}
Gensim = “Generate Similar”¶
Historically, Gensim started off as a collection of Python scripts for the Czech Digital Mathematics Library dml.cz project, back in 2008. The scripts served to generate a short list of the most similar math articles to a given article.
I (Radim) also wanted to try these fancy “Latent Semantic Methods”, but the libraries that realized the necessary computation were not much fun to work with.
Naturally, I set out to reinvent the wheel. Our 2010 LREC publication describes the initial design decisions behind Gensim: clarity, efficiency and scalability. It is fairly representative of how Gensim works even today.
Later versions of Gensim improved this efficiency and scalability tremendously. In fact, I made algorithmic scalability of distributional semantics the topic of my PhD thesis.
By now, Gensim is—to my knowledge—the most robust, efficient and hassle-free piece of software to realize unsupervised semantic modelling from plain text. It stands in contrast to brittle homework-assignment-implementations that do not scale on one hand, and robust java-esque projects that take forever just to run “hello world”.
In 2011, I moved Gensim’s source code to Github and created the Gensim website. In 2013 Gensim got its current logo, and in 2020 a website redesign.