{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Topics and Transformations\n\nIntroduces transformations and demonstrates their use on a toy corpus.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import logging\nlogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial, I will show how to transform documents from one vector representation\ninto another. This process serves two goals:\n\n1. To bring out hidden structure in the corpus, discover relationships between\n words and use them to describe the documents in a new and\n (hopefully) more semantic way.\n2. To make the document representation more compact. This both improves efficiency\n (new representation consumes less resources) and efficacy (marginal data\n trends are ignored, noise-reduction).\n\n## Creating the Corpus\n\nFirst, we need to create a corpus to work with.\nThis step is the same as in the previous tutorial;\nif you completed it, feel free to skip to the next section.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from collections import defaultdict\nfrom gensim import corpora\n\ndocuments = [\n \"Human machine interface for lab abc computer applications\",\n \"A survey of user opinion of computer system response time\",\n \"The EPS user interface management system\",\n \"System and human system engineering testing of EPS\",\n \"Relation of user perceived response time to error measurement\",\n \"The generation of random binary unordered trees\",\n \"The intersection graph of paths in trees\",\n \"Graph minors IV Widths of trees and well quasi ordering\",\n \"Graph minors A survey\",\n]\n\n# remove common words and tokenize\nstoplist = set('for a of the and to in'.split())\ntexts = [\n [word for word in document.lower().split() if word not in stoplist]\n for document in documents\n]\n\n# remove words that appear only once\nfrequency = defaultdict(int)\nfor text in texts:\n for token in text:\n frequency[token] += 1\n\ntexts = [\n [token for token in text if frequency[token] > 1]\n for text in texts\n]\n\ndictionary = corpora.Dictionary(texts)\ncorpus = [dictionary.doc2bow(text) for text in texts]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating a transformation\n\nThe transformations are standard Python objects, typically initialized by means of\na :dfn:`training corpus`:\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from gensim import models\n\ntfidf = models.TfidfModel(corpus) # step 1 -- initialize a model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We used our old corpus from tutorial 1 to initialize (train) the transformation model. Different\ntransformations may require different initialization parameters; in case of TfIdf, the\n\"training\" consists simply of going through the supplied corpus once and computing document frequencies\nof all its features. Training other models, such as Latent Semantic Analysis or Latent Dirichlet\nAllocation, is much more involved and, consequently, takes much more time.\n\n
Transformations always convert between two specific vector\n spaces. The same vector space (= the same set of feature ids) must be used for training\n as well as for subsequent vector transformations. Failure to use the same input\n feature space, such as applying a different string preprocessing, using different\n feature ids, or using bag-of-words input vectors where TfIdf vectors are expected, will\n result in feature mismatch during transformation calls and consequently in either\n garbage output and/or runtime exceptions.
Calling ``model[corpus]`` only creates a wrapper around the old ``corpus``\n document stream -- actual conversions are done on-the-fly, during document iteration.\n We cannot convert the entire corpus at the time of calling ``corpus_transformed = model[corpus]``,\n because that would mean storing the result in main memory, and that contradicts gensim's objective of memory-indepedence.\n If you will be iterating over the transformed ``corpus_transformed`` multiple times, and the\n transformation is costly, `serialize the resulting corpus to disk first