Gensim relies on your donations for sustenance. If you like Gensim, please consider donating.

Topic modelling
for humans
Gensim is a FREE Python library

✔   Train large-scale semantic NLP models

✔   Represent text as semantic vectors

✔   Find semantically related documents

from gensim import corpora, models, similarities, downloader

# Stream a training corpus directly from S3.
corpus = corpora.MmCorpus("s3://path/to/corpus")

# Train Latent Semantic Indexing with 200D vectors.
lsi = models.LsiModel(corpus, num_topics=200)

# Convert another corpus to the LSI space and index it.
index = similarities.MatrixSimilarity(lsi[another_corpus])

# Compute similarity of a query vs indexed documents.
sims = index[query]

Why Gensim?

Super fast


The fastest library for training of vector embeddings – Python or otherwise. The core algorithms in Gensim use battle-hardened, highly optimized & parallelized C routines.

Data Streaming


Gensim can process arbitrarily large corpora, using data-streamed algorithms. There are no "dataset must fit in RAM" limitations.

Platform independent


Gensim runs on Linux, Windows and OS X, as well as any other platform that supports Python and NumPy.

Proven


With thousands of companies using Gensim every day, over 2600 academic citations and 1M downloads per week, Gensim is one of the most mature ML libraries.

Open source


All Gensim source code is hosted on Github under the GNU LGPL license, maintained by its open source community. For commercial arrangements, see Business Support.

Ready-to-use models and corpora


The Gensim community also publishes pretrained models for specific domains like legal or health, via the Gensim-data project.

Installation


Quick install

Run in your terminal (recommended):

pip install --upgrade gensim

or, alternatively for conda environments:

conda install -c conda-forge gensim

That's it! Congratulations, you can proceed to the tutorials.

Code dependencies

Gensim runs on Linux, Windows and Mac OS X, and should run on any other platform that supports Python 3.8+ and NumPy. Gensim depends on the following software:

  • Python, tested with versions 3.8, 3.9, 3.10 and 3.11.
  • NumPy for number crunching.
  • smart_open for transparently opening files on remote storages or compressed files.

Testing Gensim

Gensim uses continuous integration, automatically running a full test suite on each pull request:

CI service Task Build status
Github Actions Run tests on Linux and Mac, plus check code-style Github Action
AppVeyor Run tests on Windows AppVeyor
CircleCI Build documentation CircleCI

Or, to install and test Gensim locally:


                      pip install -e .  # compile and install Gensim from the current directory
                    

                      pytest gensim     # run the tests
                    

Who is using Gensim?

Doing something interesting with Gensim? Sponsor Gensim and ask to be featured among adopters.

  • “Here at Tailwind, we use Gensim to help our customers post interesting and relevant content to Pinterest. No fuss, no muss. Just fast, scalable language processing.”

    Waylon Flinn
    Tailwind
  • “We are using Gensim every day. Over 15 thousand times per day to be precise. Gensim’s LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it’s all about. It simply works.”

    Andrius Butkus
    Issuu
  • “Gensim hits the sweetest spot of being a simple yet powerful way to access some incredibly complex NLP goodness.”

    Alan J. Salmoni
    Roistr.com
  • “I used Gensim at Ghent university. I found it easy to build prototypes with various models, extend it with additional features and gain empirical insights quickly. It's a reliable library that can be used beyond prototyping too.”

  • “We used Gensim in several text mining projects at Sports Authority. The data were from free-form text fields in customer surveys, as well as social media sources. Having Gensim significantly sped our time to development, and it is still my go-to package for topic modeling with large retail data sets.”

    Josh Hemann
    Sports Authority
  • “Semantic analysis is a hot topic in online marketing, but there are few products on the market that are truly powerful. Gensim is undoubtedly one of the best frameworks that efficiently implement algorithms for statistical analysis. Few products, even commercial, have this level of quality.”

    Bruno Champion
    DynAdmic
  • “Based on our experience with Gensim on DML-CZ, we naturally opted to use it on a much bigger scale for similarity of fulltexts of scientific papers in the European Digital Mathematics Library. In evaluation with other approaches, Gensim became a clear winner, especially because of speed, scalability and ease of use.”

    Petr Sojka
    EuDML
  • “We have been using Gensim in several DTU courses related to digital media engineering and find it immensely useful as the tutorial material provides students an excellent introduction to quickly understand the underlying principles in topic modeling based on both LSA and LDA.”

Fork on Github