Gensim is a free Python framework designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible.
Gensim aims at processing raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections, discover semantic structure of documents, by examining word statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.
Once these statistical patterns are found, any plain text documents can be succinctly expressed in the new, semantic representation, and queried for topical similarity against other documents.
Creation of gensim was motivated by a perceived lack of available, scalable software frameworks that realize topic modelling, and/or their overwhelming internal complexity (hail java!). You can read more about the motivation in our LREC 2010 workshop paper. If you want to cite gensim in your own work, please refer to that article (BibTeX).
You’re welcome to share your results and experiments on the mailing list.
The principal design objectives behind gensim are:
If you’re interested in document indexing/similarity retrieval, I also maintain a higher-level package of document similarity server. It uses gensim internally.
See the install page for more info on gensim deployment.
In the Vector Space Model (VSM), each document is represented by an array of features. For example, a single feature may be thought of as a question-answer pair:
The question is usually represented only by its integer id (such as 1, 2 and 3 here), so that the representation of this document becomes a series of pairs like (1, 0.0), (2, 2.0), (3, 5.0). If we know all the questions in advance, we may leave them implicit and simply write (0.0, 2.0, 5.0). This sequence of answers can be thought of as a high-dimensional (in this case 3-dimensional) vector. For practical purposes, only questions to which the answer is (or can be converted to) a single real number are allowed.
The questions are the same for each document, so that looking at two vectors (representing two documents), we will hopefully be able to make conclusions such as “The numbers in these two vectors are very similar, and therefore the original documents must be similar, too”. Of course, whether such conclusions correspond to reality depends on how well we picked our questions.
Typically, the answer to most questions will be 0.0. To save space, we omit them from the document’s representation, and write only (2, 2.0), (3, 5.0) (note the missing (1, 0.0)). Since the set of all questions is known in advance, all the missing features in a sparse representation of a document can be unambiguously resolved to zero, 0.0.
Gensim is specific in that it doesn’t prescribe any specific corpus format; a corpus is anything that, when iterated over, successively yields these sparse vectors. For example, set([(2, 2.0), (3, 5.0)], ([0, -1.0], [3, -1.0])) is a trivial corpus of two documents, each with two non-zero feature-answer pairs.
For some examples on how this works out in code, go to tutorials.