Topic Modeling for Fun and Profit

In this notebook:

  • you learn efficient patterns for processing large corpora (streamed processing)
  • I try to convince you iterators and generators are a useful (and joyful!) tool, not black magic
  • you write your own streamed processing pipeline for large corpora, incl. basic NLP: collocation detection, lemmatization, stopwords
In [1]:
# import and setup modules we'll be using in this notebook
import logging
import os
import sys
import re
import tarfile
import itertools

import nltk
from nltk.collocations import TrigramCollocationFinder
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures

import gensim
from gensim.parsing.preprocessing import STOPWORDS

logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO  # ipython sometimes messes up the logging setup; restore

Data preprocessing

Previously, we have downloaded the 20newsgroups dataset and left it under ./data/ as a zipped tarball:

In [2]:
!ls -l ./data/20news-bydate.tar.gz
-rw-r--r--  1 kofola  staff  14464277 Jul 22 21:19 ./data/20news-bydate.tar.gz

Let's use this file for some topic modeling. Instead of decompressing its files, let's access them directly from Python:

In [3]:
with tarfile.open('./data/20news-bydate.tar.gz', 'r:gz') as tf:
    # get information (metadata) about all files in the tarball
    file_infos = [file_info for file_info in tf if file_info.isfile()]
    
    # print one of them; for example, the first one
    message = tf.extractfile(file_infos[0]).read()
    print(message)
From: Nanci Ann Miller <[email protected]>
Subject: Re: Amusing atheists and agnostics
Organization: Sponsored account, School of Computer Science, Carnegie Mellon, Pittsburgh, PA
Lines: 33
NNTP-Posting-Host: po4.andrew.cmu.edu
In-Reply-To: <timmbake.735196735@mcl>

[email protected] (Bake Timmons) writes:
> There lies the hypocrisy, dude.  Atheism takes as much faith as theism.  
> Admit it!

Some people might think it takes faith to be an atheist... but faith in
what?  Does it take some kind of faith to say that the Great Invisible Pink
Unicorn does not exist?  Does it take some kind of faith to say that Santa
Claus does not exist?  If it does (and it may for some people I suppose) it
certainly isn't as big a leap of faith to say that these things (and god)
DO exist.  (I suppose it depends on your notion and definition of "faith".)

Besides... not believing in a god means one doesn't have to deal with all
of the extra baggage that comes with it!  This leaves a person feeling
wonderfully free, especially after beaten over the head with it for years!
I agree that religion and belief is often an important psychological healer
for many people and for that reason I think it's important.  However,
trying to force a psychological fantasy (I don't mean that in a bad way,
but that's what it really is) on someone else who isn't interested is
extremely rude.  What if I still believed in Santa Claus and said that my
belief in Santa did wonderful things for my life (making me a better
person, allowing me to live without guilt, etc...) and then tried to get
you to believe in Santa too just 'cuz he did so much for me?  You'd call
the men in white coats as soon as you could get to a phone.

> --
> Bake Timmons, III

Nanci  (just babbling... :-))
.........................................................................
If you know (and are SURE of) the author of this quote, please send me
email ([email protected]):
Spring is nature's way of saying, 'Let's party!'



This text is typical of real-world data. It contains a mix of relevant text, metadata (email headers), and downright noise. Even its relevant content is unstructured, with email addresses, people's names, quotations etc.

Most machine learning methods, topic modeling included, are only as good as the data you give it. At this point, we generally want to clean the data as much as possible. While the subsequent steps in the machine learning pipeline are more or less automated, handling the raw data should reflect the intended purpose of the application, its business logic, idiosyncracies, sanity check (aren't we accidentally receiving and parsing image data instead of plain text?). As always with automated processing it's garbage in, garbage out.

As an example, let's write a function that aims to extract only the chunk of relevant text from each message, ignoring email headers:

In [4]:
def process_message(message):
    """
    Preprocess a single 20newsgroups message, returning the result as
    a unicode string.
    
    """
    message = gensim.utils.to_unicode(message, 'latin1').strip()
    blocks = message.split(u'\n\n')
    # skip email headers (first block) and footer (last block)
    content = u'\n\n'.join(blocks[1:])
    return content

print process_message(message)
[email protected] (Bake Timmons) writes:
> There lies the hypocrisy, dude.  Atheism takes as much faith as theism.  
> Admit it!

Some people might think it takes faith to be an atheist... but faith in
what?  Does it take some kind of faith to say that the Great Invisible Pink
Unicorn does not exist?  Does it take some kind of faith to say that Santa
Claus does not exist?  If it does (and it may for some people I suppose) it
certainly isn't as big a leap of faith to say that these things (and god)
DO exist.  (I suppose it depends on your notion and definition of "faith".)

Besides... not believing in a god means one doesn't have to deal with all
of the extra baggage that comes with it!  This leaves a person feeling
wonderfully free, especially after beaten over the head with it for years!
I agree that religion and belief is often an important psychological healer
for many people and for that reason I think it's important.  However,
trying to force a psychological fantasy (I don't mean that in a bad way,
but that's what it really is) on someone else who isn't interested is
extremely rude.  What if I still believed in Santa Claus and said that my
belief in Santa did wonderful things for my life (making me a better
person, allowing me to live without guilt, etc...) and then tried to get
you to believe in Santa too just 'cuz he did so much for me?  You'd call
the men in white coats as soon as you could get to a phone.

> --
> Bake Timmons, III

Feel free to modify this function and test out other ideas for clean up. The flexibility Python gives you in processing text is superb -- it'd be a crime to hide the processing behind opaque APIs, exposing only one or two tweakable parameters.

There are a handful of handy Python libraries for text cleanup: jusText removes HTML boilerplate and extracts "main text" of a web page. NLTK, Pattern and TextBlob are good for tokenization, POS tagging, sentence splitting and generic NLP, with a nice Pythonic interface. None of them scales very well though, so keep the inputs small.

Exercise (5 min): Modify the process_message function to ignore message footers, too.

It's a good practice to inspect your data visually, at each point as it passes through your data processing pipeline. Simple printing (logging) a few arbitrary entries, ala UNIX head, does wonders for spotting unexpected bugs. Oh, bad encoding! What is Chinese doing there, we were told all texts are English only? Do these rubbish tokens come from embedded images? How come everything's empty? Taking a text with "hey, let's tokenize it into a bag of words blindly, like they do in the tutorials, push it through this magical unicorn machine learning library and hope for the best" is ill advised.

Another good practice is to keep internal strings as Unicode, and only encode/decode on IO (preferably using UTF8). As of Python 3.3, there is practically no memory penalty for using Unicode over UTF8 byte strings (PEP 393).

Data streaming

Let's write a function to go over all messages in the 20newsgroups archive:

In [5]:
def iter_20newsgroups(fname, log_every=None):
    """
    Yield plain text of each 20 newsgroups message, as a unicode string.

    The messages are read from raw tar.gz file `fname` on disk (e.g. `./data/20news-bydate.tar.gz`)

    """
    extracted = 0
    with tarfile.open(fname, 'r:gz') as tf:
        for file_number, file_info in enumerate(tf):
            if file_info.isfile():
                if log_every and extracted % log_every == 0:
                    logging.info("extracting 20newsgroups file #%i: %s" % (extracted, file_info.name))
                content = tf.extractfile(file_info).read()
                yield process_message(content)
                extracted += 1

This uses the process_message() we wrote above, to process each message in turn. The messages are extracted on-the-fly, one after another, using a generator.

Such data streaming is a very important pattern: real data is typically too large to fit into RAM, and we don't need all of it in RAM at the same time anyway -- that's just wasteful. With streamed data, we can process arbitrarily large input, reading the data from a file on disk, SQL database, shared network disk, or even more exotic remote network protocols.

In [6]:
# itertools is an inseparable friend with data streaming (Python built-in library)
import itertools

# let's only parse and print the first three messages, lazily
# `list(stream)` materializes the stream elements into a plain Python list
message_stream = iter_20newsgroups('./data/20news-bydate.tar.gz', log_every=2)
print(list(itertools.islice(message_stream, 3)))
INFO:root:extracting 20newsgroups file #0: 20news-bydate-test/alt.atheism/53265
INFO:root:extracting 20newsgroups file #2: 20news-bydate-test/alt.atheism/53260

[u'[email protected] (Bake Timmons) writes:\n> There lies the hypocrisy, dude.  Atheism takes as much faith as theism.  \n> Admit it!\n\nSome people might think it takes faith to be an atheist... but faith in\nwhat?  Does it take some kind of faith to say that the Great Invisible Pink\nUnicorn does not exist?  Does it take some kind of faith to say that Santa\nClaus does not exist?  If it does (and it may for some people I suppose) it\ncertainly isn\'t as big a leap of faith to say that these things (and god)\nDO exist.  (I suppose it depends on your notion and definition of "faith".)\n\nBesides... not believing in a god means one doesn\'t have to deal with all\nof the extra baggage that comes with it!  This leaves a person feeling\nwonderfully free, especially after beaten over the head with it for years!\nI agree that religion and belief is often an important psychological healer\nfor many people and for that reason I think it\'s important.  However,\ntrying to force a psychological fantasy (I don\'t mean that in a bad way,\nbut that\'s what it really is) on someone else who isn\'t interested is\nextremely rude.  What if I still believed in Santa Claus and said that my\nbelief in Santa did wonderful things for my life (making me a better\nperson, allowing me to live without guilt, etc...) and then tried to get\nyou to believe in Santa too just \'cuz he did so much for me?  You\'d call\nthe men in white coats as soon as you could get to a phone.\n\n> --\n> Bake Timmons, III', u'Did that FAQ ever got modified to re-define strong atheists as not those who\nassert the nonexistence of God, but as those who assert that they BELIEVE in \nthe nonexistence of God?  There was a thread on this earlier, but I didn\'t get\nthe outcome...\n\n-- Adam "No Nickname" Cooper\n\n', u'[email protected] (Gregg Jaeger) writes:\n>In article <[email protected]> [email protected] (Robert\n>Beauchaine) writes:\n>>Bennett, Neil.  "How BCCI adapted the Koran rules of banking".  The \n>>Times.  August 13, 1991.\n> \n> So, let\'s see. If some guy writes a piece with a title that implies\n> something is the case then it must be so, is that it?\n\nGregg, you haven\'t provided even a title of an article to support *your*\ncontention.\n\n>>  This is how you support a position if you intend to have anyone\n>>  respect it, Gregg.  Any questions?  And I even managed to include\n>>  the above reference with my head firmly engaged in my ass.  What\'s\n>>  your excuse?\n> \n> This supports nothing. I have no reason to believe that this is \n> piece is anything other than another anti-Islamic slander job.\n\nYou also have no reason to believe it *is* an anti-Islamic slander job, apart\nfrom your own prejudices.\n\n> I have no respect for titles, only for real content. I can look\n> up this article if I want, true. But I can tell you BCCI was _not_\n> an Islamic bank.\n\nWhy, yes.  What\'s a mere report in The Times stating that BCCI followed\nIslamic banking rules?  Gregg *knows* Islam is good, and he *knows* BCCI were\nbad, therefore BCCI *cannot* have been Islamic.  Anyone who says otherwise is\nobviously spreading slanderous propaganda.\n\n>                                      If someone wants to discuss\n> the issue more seriously then I\'d be glad to have a real discussion,\n> providing references, etc.\n\nI see.  If someone wants to provide references to articles you agree with,\nyou will also respond with references to articles you agree with?  Mmm, yes,\nthat would be a very intellectually stimulating debate.  Doubtless that\'s how\nyou spend your time in soc.culture.islam.\n\nI\'ve got a special place for you in my...\n\x0c\n...kill file.  Right next to Bobby.  Want to join him?\n\nThe more you post, the more I become convinced that it is simply a waste of\ntime to try and reason with Moslems.  Is that what you are hoping to achieve?']

Note: Data streaming saves memory, but does nothing to save time. A common pattern for speeding up data processing is parallelization, via Python's multi-processing and multi-threading support. We don't have time to cover processing parallelization or cluster distribution in this tutorial. For an example, see this Wikipedia parsing code from my Similarity Shootout benchmark.

A generator only gives us a single pass through the data:

In [7]:
# print the next two messages; the three messages printed above are already gone
print(list(itertools.islice(message_stream, 2)))
INFO:root:extracting 20newsgroups file #4: 20news-bydate-test/alt.atheism/53333

[u'In article <[email protected]>, [email protected] (Todd Kelley) writes:\n> In light of what happened in Waco, I need to get something of my\n> chest.\n\nSadly understandable...\n\n> \n> Faith and dogma are dangerous.  \n\nYes.\n\n> \n> Religion inherently encourages the implementation of faith and dogma, and\n> for that reason, I scorn religion.\n> \nTo be fair, you should really qualify this as semitic-western religions, but\nyou basically go ahead and do this later on anyway.\n\n> I have expressed this notion in the past.  Some Christians debated\n> with me whether Christianity leaves any room for reasoning.  I claimed\n> rationality is quelled out of Christianity by faith and dogma.\n\nAgain, this should really be evaluated at a personal level.  For example, there\nwas only one Jesus (presumably), and he probably didn\'t say all that many\nthings, and yet (seemingly) billions and billions of Christian sects have\narisen.  Perhaps there is one that is totally dedicated to rationalism and\nbelieves in Christ as in pantheism.  It would seem to go against the Bible, but\nit is amazing what people come up with under the guise of "personal\ninterpretation".\n\n> A philosopher cannot be a Christian because a philosopher can change his mind,\n> whereas a Christian cannot, due to the nature of faith and dogma present\n> in any religion.\n\nThis is a good point.  We have here the quintessential Christian: he sets up a\nsystem of values/beliefs for himself, which work very well, and every\nevent/experience is understandable and deablable within the framework of this\nsystem.  However, we also have an individual who has the inability (at least\nnot without some difficulty) to change, which is important, because the problem\nwith such a system is the same as with any system: one cannot be open minded to\nthe point of "testing hypotheses" against the basic premise of the system\nwithout destroying whatever faith is invested therein, unless of course, all\nthe tests fail.  In other words, the *fairer* way would be to test and evaluate\nmoralities without the bias/responsibility of losing/retaining a system.\n\n> \n> I claimed that a ``Christian philosopher\'\' is not a Christian,\n> but is a person whose beliefs at the moment correspond with those\n> of Christianity. Consider that a person visiting or guarding a prison\n> is not a prisoner, unless you define a prisoner simply to be someone\n> in a prison.\n> Can we define a prisoner to be someone who at the moment is in a prison?\n> Can we define a Christian to be someone who at the moment has Christian\n> beliefs?  No, because if a person is free to go, he is not a prisoner.\n> Similarly, if a person is not constrained by faith and dogma, he is not\n> a Christian.\n\nInteresting, but again, when it seems to basically boil down to individual\nnuances (although not always, I will admit, and probably it is the\nmass-oriented divisions which are the most appalling), it becomes irrelevant,\nunfortunately.\n\n> \n> I admit it\'s a word game.\n> I\'m going by the dictionary definition of religion:\n>    ``religion n. 1. concern over what exists beyond the visible world,\n>      differentiated from philosophy in that it operates through faith\n>      or intuition rather than reason, ...\'\'\n>                                    --Webster\'s\n> \n> Now let\'s go beyond the word game.  I don\'t claim that religion\n> causes genocide.  I think that if all humans were atheist, there\n> would still be genocide.  There will always be humans who don\'t think.\n> There will always be humans who don\'t ask themselves what is\n> the REAL difference between themselves and people with different\n> colored skin, or a different language, or different beliefs.\n> \n\nGranted\n\n> Religion is like the gun that doesn\'t kill anybody.  Religion encourages\n> faith and dogma and although it doesn\'t directly condemn people,\n> it encourages the use of ``just because\'\' thinking.  It is\n> ``just because\'\' thinking that kills people.\n> \n\nIn which case the people become the bullets, and the religion, as the gun,\nmerely offers them a way to more adequately do some harm with themselves, if I\nmay be so bold as to extend your similie?\n\n> Sure, religion has many good qualities.  It encourages benevolence\n> and philanthropy.  OK, so take out only the bad things: like faith,\n> dogma, and tradition.  Put in the good things, like careful reasoning,\n> and science.  The result is secular humanism.  Wouldn\'t it\n> be nice if everyone were a secular humanist?   To please the\n> supernaturalists, you might even leave God in there, but the secular\n> emphasis would cause the supernaturalists to start thinking, and\n> they too would realize that a belief in a god really doesn\'t put\n> anyone further ahead in understanding the universe (OK, I\'m just\n> poking fun at the supernaturalists :-).\n\nAlso understandable... ;)\n\n> \n> Of course, not all humans are capable of thought, and we\'d still\n> have genocide and maybe even some mass suicide...but not as much.\n> I\'m willing to bet on that.\n> \n> Todd\n> -- \n> Todd Kelley                       [email protected]\n> Department of Computer Science\n> University of Toronto\n-- \n\nbest regards,', u'In article <[email protected]>, mathew <[email protected]> writes:\n> [email protected] (Mark McCullough) writes:\n\nFrom a parallel thread.  Much about definitions of bombs, etc. deleted.\n[...]\n\n> \n>> Aaaahhh.  Tell me, how many innocents were killed in concentration camps?\n>> mm-hmm.  Now, how many more were scheduled to enter concentration camps\n>> had they not been shut down because they were captured by the allies?\n>> mm-hmm.  Now, civilians died in that war.  So no matter what you do,\n>> civilians die.  What is the proper course?\n> \n> Don\'t sell the bastard arms and information in the first place.  Ruthlessly\n> hunt down those who do.  Especially if they\'re in positions of power.\n> \n\nMathew, I agree.  This, it seems, is the crux of your whole position,\nisn\'t it?  That the US shouldn\'t have supported Hussein and sold him arms\nto fight Iran?  I agree.  And I agree in ruthlessly hunting down those\nwho did or do.  But we *did* sell arms to Hussein, and it\'s a done deal.\nNow he invades Kuwait.  So do we just sit back and say, "Well, we sold\nhim all those arms, I suppose he just wants to use them now.  Too bad\nfor Kuwait."  No, unfortunately, sitting back and "letting things be"\nis not the way to correct a former mistake.  Destroying Hussein\'s\nmilitary potential as we did was the right move.  But I agree with\nyour statement, Reagan and Bush made a grave error in judgment to\nsell arms to Hussein.  So it\'s really not the Gulf War you abhor\nso much, it was the U.S.\'s and the West\'s shortsightedness in selling\narms to Hussein which ultimately made the war inevitable, right?\n\nIf so, then I agree.\n\n[more deleted.]\n> \n> mathew\n\nRegards,']

Let's wrap the generator inside an object's __iter__ method, so we can iterate over the stream multiple times:

In [8]:
class Corpus20News(object):
    def __init__(self, fname):
        self.fname = fname

    def __iter__(self):
        for text in iter_20newsgroups(self.fname):
            # tokenize each message; simply lowercase & match alphabetic chars, for now
            yield list(gensim.utils.tokenize(text, lower=True))

tokenized_corpus = Corpus20News('./data/20news-bydate.tar.gz')

# print the first two tokenized messages
print(list(itertools.islice(tokenized_corpus, 2)))
[[u'timmbake', u'mcl', u'ucsb', u'edu', u'bake', u'timmons', u'writes', u'there', u'lies', u'the', u'hypocrisy', u'dude', u'atheism', u'takes', u'as', u'much', u'faith', u'as', u'theism', u'admit', u'it', u'some', u'people', u'might', u'think', u'it', u'takes', u'faith', u'to', u'be', u'an', u'atheist', u'but', u'faith', u'in', u'what', u'does', u'it', u'take', u'some', u'kind', u'of', u'faith', u'to', u'say', u'that', u'the', u'great', u'invisible', u'pink', u'unicorn', u'does', u'not', u'exist', u'does', u'it', u'take', u'some', u'kind', u'of', u'faith', u'to', u'say', u'that', u'santa', u'claus', u'does', u'not', u'exist', u'if', u'it', u'does', u'and', u'it', u'may', u'for', u'some', u'people', u'i', u'suppose', u'it', u'certainly', u'isn', u't', u'as', u'big', u'a', u'leap', u'of', u'faith', u'to', u'say', u'that', u'these', u'things', u'and', u'god', u'do', u'exist', u'i', u'suppose', u'it', u'depends', u'on', u'your', u'notion', u'and', u'definition', u'of', u'faith', u'besides', u'not', u'believing', u'in', u'a', u'god', u'means', u'one', u'doesn', u't', u'have', u'to', u'deal', u'with', u'all', u'of', u'the', u'extra', u'baggage', u'that', u'comes', u'with', u'it', u'this', u'leaves', u'a', u'person', u'feeling', u'wonderfully', u'free', u'especially', u'after', u'beaten', u'over', u'the', u'head', u'with', u'it', u'for', u'years', u'i', u'agree', u'that', u'religion', u'and', u'belief', u'is', u'often', u'an', u'important', u'psychological', u'healer', u'for', u'many', u'people', u'and', u'for', u'that', u'reason', u'i', u'think', u'it', u's', u'important', u'however', u'trying', u'to', u'force', u'a', u'psychological', u'fantasy', u'i', u'don', u't', u'mean', u'that', u'in', u'a', u'bad', u'way', u'but', u'that', u's', u'what', u'it', u'really', u'is', u'on', u'someone', u'else', u'who', u'isn', u't', u'interested', u'is', u'extremely', u'rude', u'what', u'if', u'i', u'still', u'believed', u'in', u'santa', u'claus', u'and', u'said', u'that', u'my', u'belief', u'in', u'santa', u'did', u'wonderful', u'things', u'for', u'my', u'life', u'making', u'me', u'a', u'better', u'person', u'allowing', u'me', u'to', u'live', u'without', u'guilt', u'etc', u'and', u'then', u'tried', u'to', u'get', u'you', u'to', u'believe', u'in', u'santa', u'too', u'just', u'cuz', u'he', u'did', u'so', u'much', u'for', u'me', u'you', u'd', u'call', u'the', u'men', u'in', u'white', u'coats', u'as', u'soon', u'as', u'you', u'could', u'get', u'to', u'a', u'phone', u'bake', u'timmons', u'iii'], [u'did', u'that', u'faq', u'ever', u'got', u'modified', u'to', u're', u'define', u'strong', u'atheists', u'as', u'not', u'those', u'who', u'assert', u'the', u'nonexistence', u'of', u'god', u'but', u'as', u'those', u'who', u'assert', u'that', u'they', u'believe', u'in', u'the', u'nonexistence', u'of', u'god', u'there', u'was', u'a', u'thread', u'on', u'this', u'earlier', u'but', u'i', u'didn', u't', u'get', u'the', u'outcome', u'adam', u'no', u'nickname', u'cooper']]

In [9]:
# the same two tokenized messages (not the next two!)
# each call to __iter__ "resets" the stream, by creating a new generator object internally
print(list(itertools.islice(tokenized_corpus, 2)))
[[u'timmbake', u'mcl', u'ucsb', u'edu', u'bake', u'timmons', u'writes', u'there', u'lies', u'the', u'hypocrisy', u'dude', u'atheism', u'takes', u'as', u'much', u'faith', u'as', u'theism', u'admit', u'it', u'some', u'people', u'might', u'think', u'it', u'takes', u'faith', u'to', u'be', u'an', u'atheist', u'but', u'faith', u'in', u'what', u'does', u'it', u'take', u'some', u'kind', u'of', u'faith', u'to', u'say', u'that', u'the', u'great', u'invisible', u'pink', u'unicorn', u'does', u'not', u'exist', u'does', u'it', u'take', u'some', u'kind', u'of', u'faith', u'to', u'say', u'that', u'santa', u'claus', u'does', u'not', u'exist', u'if', u'it', u'does', u'and', u'it', u'may', u'for', u'some', u'people', u'i', u'suppose', u'it', u'certainly', u'isn', u't', u'as', u'big', u'a', u'leap', u'of', u'faith', u'to', u'say', u'that', u'these', u'things', u'and', u'god', u'do', u'exist', u'i', u'suppose', u'it', u'depends', u'on', u'your', u'notion', u'and', u'definition', u'of', u'faith', u'besides', u'not', u'believing', u'in', u'a', u'god', u'means', u'one', u'doesn', u't', u'have', u'to', u'deal', u'with', u'all', u'of', u'the', u'extra', u'baggage', u'that', u'comes', u'with', u'it', u'this', u'leaves', u'a', u'person', u'feeling', u'wonderfully', u'free', u'especially', u'after', u'beaten', u'over', u'the', u'head', u'with', u'it', u'for', u'years', u'i', u'agree', u'that', u'religion', u'and', u'belief', u'is', u'often', u'an', u'important', u'psychological', u'healer', u'for', u'many', u'people', u'and', u'for', u'that', u'reason', u'i', u'think', u'it', u's', u'important', u'however', u'trying', u'to', u'force', u'a', u'psychological', u'fantasy', u'i', u'don', u't', u'mean', u'that', u'in', u'a', u'bad', u'way', u'but', u'that', u's', u'what', u'it', u'really', u'is', u'on', u'someone', u'else', u'who', u'isn', u't', u'interested', u'is', u'extremely', u'rude', u'what', u'if', u'i', u'still', u'believed', u'in', u'santa', u'claus', u'and', u'said', u'that', u'my', u'belief', u'in', u'santa', u'did', u'wonderful', u'things', u'for', u'my', u'life', u'making', u'me', u'a', u'better', u'person', u'allowing', u'me', u'to', u'live', u'without', u'guilt', u'etc', u'and', u'then', u'tried', u'to', u'get', u'you', u'to', u'believe', u'in', u'santa', u'too', u'just', u'cuz', u'he', u'did', u'so', u'much', u'for', u'me', u'you', u'd', u'call', u'the', u'men', u'in', u'white', u'coats', u'as', u'soon', u'as', u'you', u'could', u'get', u'to', u'a', u'phone', u'bake', u'timmons', u'iii'], [u'did', u'that', u'faq', u'ever', u'got', u'modified', u'to', u're', u'define', u'strong', u'atheists', u'as', u'not', u'those', u'who', u'assert', u'the', u'nonexistence', u'of', u'god', u'but', u'as', u'those', u'who', u'assert', u'that', u'they', u'believe', u'in', u'the', u'nonexistence', u'of', u'god', u'there', u'was', u'a', u'thread', u'on', u'this', u'earlier', u'but', u'i', u'didn', u't', u'get', u'the', u'outcome', u'adam', u'no', u'nickname', u'cooper']]

Text processing

Lemmatization, stemming

Lemmatization is type of normalization that treats different inflected forms of a word as a single unit ("work", "working", "works", "worked", "working" => same lemma: "work"):

In [10]:
import gensim

print(gensim.utils.lemmatize("worked"))
print(gensim.utils.lemmatize("working"))
print(gensim.utils.lemmatize("I was working with a working class hero."))
['work/VB']
['work/VB']
['be/VB', 'work/VB', 'working/JJ', 'class/NN', 'hero/NN']

There's a part of speech (POS) tag included in each token: lemma/POS. Note how articles and prepositions, such as "The", "a", or "over", have been filtered out from the result. Only word categories that traditionally carry the most meaning, such as nouns, adjectives and verbs, are left. Gensim uses the pattern library internally, because its lemmatization performs (much) better than alternatives such as NLTK.

In [11]:
class Corpus20News_Lemmatize(object):
    def __init__(self, fname):
        self.fname = fname

    def __iter__(self):
        for message in iter_20newsgroups(self.fname):
            yield self.tokenize(message)

    def tokenize(self, text):
        """Break text into a list of lemmatized words."""
        return gensim.utils.lemmatize(text)
    
lemmatized_corpus = Corpus20News_Lemmatize('./data/20news-bydate.tar.gz')
print(list(itertools.islice(lemmatized_corpus, 2)))
[['timmbake/NN', 'mcl/NN', 'ucsb/NN', 'edu/NN', 'bake/JJ', 'timmon/NN', 'write/VB', 'there/RB', 'lie/VB', 'hypocrisy/NN', 'dude/NN', 'atheism/NN', 'take/VB', 'as/RB', 'much/JJ', 'faith/NN', 'theism/NN', 'admit/VB', 'person/NN', 'think/VB', 'take/VB', 'faith/NN', 'be/VB', 'atheist/JJ', 'faith/NN', 'do/VB', 'take/VB', 'kind/NN', 'faith/NN', 'say/VB', 'great/JJ', 'invisible/JJ', 'pink/JJ', 'unicorn/NN', 'do/VB', 'not/RB', 'exist/NN', 'do/VB', 'take/VB', 'kind/NN', 'faith/NN', 'say/VB', 'santa/NN', 'claus/NN', 'do/VB', 'not/RB', 'exist/VB', 'do/VB', 'person/NN', 'suppose/VB', 'certainly/RB', 'isn/JJ', 'big/JJ', 'leap/NN', 'faith/NN', 'say/VB', 'thing/NN', 'god/NN', 'do/VB', 'exist/VB', 'suppose/VB', 'depend/VB', 'notion/NN', 'definition/NN', 'faith/NN', 'not/RB', 'believe/VB', 'god/NN', 'mean/VB', 'doesn/JJ', 'have/VB', 'deal/VB', 'extra/JJ', 'baggage/NN', 'come/VB', 'leave/VB', 'person/NN', 'feel/VB', 'wonderfully/RB', 'free/JJ', 'especially/RB', 'beaten/JJ', 'head/NN', 'year/NN', 'agree/VB', 'religion/NN', 'belief/NN', 'be/VB', 'often/RB', 'important/JJ', 'psychological/JJ', 'healer/NN', 'many/JJ', 'person/NN', 'reason/NN', 'think/VB', 'important/JJ', 'however/RB', 'try/VB', 'force/VB', 'psychological/JJ', 'fantasy/NN', 'don/VB', 'mean/VB', 'bad/JJ', 'way/NN', 'really/RB', 'be/VB', 'someone/NN', 'else/RB', 'isn/VB', 'interested/JJ', 'be/VB', 'extremely/RB', 'rude/JJ', 'still/RB', 'believe/VB', 'santa/NN', 'claus/NN', 'say/VB', 'belief/NN', 'santa/NN', 'do/VB', 'wonderful/JJ', 'thing/NN', 'life/NN', 'make/VB', 'better/JJ', 'person/NN', 'allow/VB', 'live/VB', 'guilt/NN', 'etc/NN', 'then/RB', 'try/VB', 'get/VB', 'believe/VB', 'santa/NN', 'too/RB', 'just/RB', 'do/VB', 'so/RB', 'much/JJ', 'call/VB', 'man/NN', 'white/JJ', 'coat/NN', 'soon/RB', 'get/VB', 'phone/NN', 'bake/JJ', 'timmon/NN', 'iii/NN'], ['do/VB', 'faq/NN', 'ever/RB', 'get/VB', 'modify/VB', 're/NN', 'define/VB', 'strong/JJ', 'atheist/NN', 'not/RB', 'assert/VB', 'nonexistence/NN', 'god/NN', 'assert/VB', 'believe/VB', 'nonexistence/NN', 'god/NN', 'be/VB', 'thread/NN', 'earlier/RB', 'didn/VB', 'get/VB', 'outcome/NN', 'nickname/NN', 'cooper/NN']]

Exercise (10 min): Modify tokenize() to ignore (=not return) generic words, such as "do", "then", "be", "as"... These are called stopwords and we may want to remove them because some topic modeling algorithms are sensitive to their presence. An example of common stopwords set for English is in from gensim.parsing.preprocessing import STOPWORDS.

Collocations and Named Entity Recognition

Collocation is a "sequence of words or terms that co-occur more often than would be expected by chance."

Named entity recognition (NER) is the task of locating chunks of text that refer to people, locations, organizations etc.

Detecting collocations and named entities often has a significant business value: "General Electric" stays a single entity (token), rather than two words "general" and "electric". Same with "Marathon Petroleum", "George Bush" etc -- a topic model doesn't confuse its topics via words coming from unrelated entities, such as "Korea" and "Carolina" via "North".

In [12]:
import nltk
from nltk.collocations import TrigramCollocationFinder
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures

def best_ngrams(words, top_n=1000, min_freq=100):
    """
    Extract `top_n` most salient collocations (bigrams and trigrams),
    from a stream of words. Ignore collocations with frequency
    lower than `min_freq`.

    This fnc uses NLTK for the collocation detection itself -- not very scalable!

    Return the detected ngrams as compiled regular expressions, for their faster
    detection later on.

    """
    tcf = TrigramCollocationFinder.from_words(words)
    tcf.apply_freq_filter(min_freq)
    trigrams = [' '.join(w) for w in tcf.nbest(TrigramAssocMeasures.chi_sq, top_n)]
    logging.info("%i trigrams found: %s..." % (len(trigrams), trigrams[:20]))

    bcf = tcf.bigram_finder()
    bcf.apply_freq_filter(min_freq)
    bigrams = [' '.join(w) for w in bcf.nbest(BigramAssocMeasures.pmi, top_n)]
    logging.info("%i bigrams found: %s..." % (len(bigrams), bigrams[:20]))

    pat_gram2 = re.compile('(%s)' % '|'.join(bigrams), re.UNICODE)
    pat_gram3 = re.compile('(%s)' % '|'.join(trigrams), re.UNICODE)

    return pat_gram2, pat_gram3
In [13]:
from gensim.parsing.preprocessing import STOPWORDS

class Corpus20News_Collocations(object):
    def __init__(self, fname):
        self.fname = fname
        logging.info("collecting ngrams from %s" % self.fname)
        # generator of documents; one element = list of words
        documents = (self.split_words(text) for text in iter_20newsgroups(self.fname, log_every=1000))
        # generator: concatenate (chain) all words into a single sequence, lazily
        words = itertools.chain.from_iterable(documents)
        self.bigrams, self.trigrams = best_ngrams(words)

    def split_words(self, text, stopwords=STOPWORDS):
        """
        Break text into a list of single words. Ignore any token that falls into
        the `stopwords` set.

        """
        return [word
                for word in gensim.utils.tokenize(text, lower=True)
                if word not in STOPWORDS and len(word) > 3]

    def tokenize(self, message):
        """
        Break text (string) into a list of Unicode tokens.
        
        The resulting tokens can be longer phrases (collocations) too,
        e.g. `new_york`, `real_estate` etc.

        """
        text = u' '.join(self.split_words(message))
        text = re.sub(self.trigrams, lambda match: match.group(0).replace(u' ', u'_'), text)
        text = re.sub(self.bigrams, lambda match: match.group(0).replace(u' ', u'_'), text)
        return text.split()

    def __iter__(self):
        for message in iter_20newsgroups(self.fname):
            yield self.tokenize(message)

%time collocations_corpus = Corpus20News_Collocations('./data/20news-bydate.tar.gz')
print(list(itertools.islice(collocations_corpus, 2)))
INFO:root:collecting ngrams from ./data/20news-bydate.tar.gz
INFO:root:extracting 20newsgroups file #0: 20news-bydate-test/alt.atheism/53265
INFO:root:extracting 20newsgroups file #1000: 20news-bydate-test/comp.os.ms-windows.misc/10790
INFO:root:extracting 20newsgroups file #2000: 20news-bydate-test/comp.windows.x/67509
INFO:root:extracting 20newsgroups file #3000: 20news-bydate-test/rec.autos/103718
INFO:root:extracting 20newsgroups file #4000: 20news-bydate-test/rec.sport.hockey/54204
INFO:root:extracting 20newsgroups file #5000: 20news-bydate-test/sci.electronics/54349
INFO:root:extracting 20newsgroups file #6000: 20news-bydate-test/soc.religion.christian/21580
INFO:root:extracting 20newsgroups file #7000: 20news-bydate-test/talk.politics.misc/178645
INFO:root:extracting 20newsgroups file #8000: 20news-bydate-train/alt.atheism/53440
INFO:root:extracting 20newsgroups file #9000: 20news-bydate-train/comp.os.ms-windows.misc/9812
INFO:root:extracting 20newsgroups file #10000: 20news-bydate-train/comp.sys.mac.hardware/51599
INFO:root:extracting 20newsgroups file #11000: 20news-bydate-train/misc.forsale/74758
INFO:root:extracting 20newsgroups file #12000: 20news-bydate-train/rec.autos/102769
INFO:root:extracting 20newsgroups file #13000: 20news-bydate-train/rec.sport.baseball/104515
INFO:root:extracting 20newsgroups file #14000: 20news-bydate-train/sci.crypt/15228
INFO:root:extracting 20newsgroups file #15000: 20news-bydate-train/sci.electronics/53869
INFO:root:extracting 20newsgroups file #16000: 20news-bydate-train/sci.space/60951
INFO:root:extracting 20newsgroups file #17000: 20news-bydate-train/talk.politics.guns/54119
INFO:root:extracting 20newsgroups file #18000: 20news-bydate-train/talk.politics.mideast/76353
INFO:root:12 trigrams found: [u'serdar argic article', u'newsletter page volume', u'medical newsletter page', u'hicnet medical newsletter', u'page volume number', u'volume number april', u'center policy research', u'magnus ohio state', u'article athos rutgers', u'article geneva rutgers', u'article news uiuc', u'writes article news']...
INFO:root:104 bigrams found: [u'midway uchicago', u'serdar argic', u'cleveland freenet', u'magnus ohio', u'clayton cramer', u'athos rutgers', u'geneva rutgers', u'hicnet medical', u'export contrib', u'newsletter page', u'medical newsletter', u'greatly appreciated', u'access digex', u'holy spirit', u'bear arms', u'united states', u'usenet cwru', u'david sternlight', u'attorney general', u'public domain']...

CPU times: user 1min 3s, sys: 1.1 s, total: 1min 4s
Wall time: 1min 3s
[[u'timmbake', u'ucsb', u'bake', u'timmons', u'writes', u'lies', u'hypocrisy', u'dude', u'atheism', u'takes', u'faith', u'theism', u'admit', u'people_think', u'takes', u'faith', u'atheist', u'faith', u'kind', u'faith', u'great', u'invisible', u'pink', u'unicorn', u'exist', u'kind', u'faith', u'santa', u'claus', u'exist', u'people', u'suppose', u'certainly', u'leap', u'faith', u'things', u'exist', u'suppose', u'depends', u'notion', u'definition', u'faith', u'believing', u'means', u'deal', u'extra', u'baggage', u'comes', u'leaves', u'person', u'feeling', u'wonderfully', u'free', u'especially', u'beaten', u'head', u'years', u'agree', u'religion', u'belief', u'important', u'psychological', u'healer', u'people', u'reason', u'think', u'important', u'trying', u'force', u'psychological', u'fantasy', u'mean', u'interested', u'extremely', u'rude', u'believed', u'santa', u'claus', u'said', u'belief', u'santa', u'wonderful', u'things', u'life', u'making', u'better', u'person', u'allowing', u'live', u'guilt', u'tried', u'believe', u'santa', u'white', u'coats', u'soon', u'phone', u'bake', u'timmons'], [u'modified', u'define', u'strong', u'atheists', u'assert', u'nonexistence', u'assert', u'believe', u'nonexistence', u'thread', u'earlier', u'didn', u'outcome', u'adam', u'nickname', u'cooper']]

Instead of detecting collocations by frequency, we can run (shallow) syntactic parsing. This tags each word with its part-of-speech (POS) category, and suggests phrases based on chunks of "noun phrases":

In [14]:
from textblob import TextBlob

def head(stream, n=10):
    """Convenience fnc: return the first `n` elements of the stream, as plain list."""
    return list(itertools.islice(stream, n))

def best_phrases(document_stream, top_n=1000, prune_at=50000):
    """Return a set of `top_n` most common noun phrases."""
    np_counts = {}
    for docno, doc in enumerate(document_stream):
        # prune out infrequent phrases from time to time, to save RAM.
        # the result may not be completely accurate because of this step
        if docno % 1000 == 0:
            sorted_phrases = sorted(np_counts.iteritems(), key=lambda item: -item[1])
            np_counts = dict(sorted_phrases[:prune_at])
            logging.info("at document #%i, considering %i phrases: %s..." %
                         (docno, len(np_counts), head(sorted_phrases)))
        
        # how many times have we seen each noun phrase?
        for np in TextBlob(doc).noun_phrases:
            # only consider multi-word NEs where each word contains at least one letter
            if u' ' not in np:
                continue
            # ignore phrases that contain too short/non-alphabetic words
            if all(word.isalpha() and len(word) > 2 for word in np.split()):
                np_counts[np] = np_counts.get(np, 0) + 1

    sorted_phrases = sorted(np_counts, key=lambda np: -np_counts[np])
    return set(head(sorted_phrases, top_n))
In [15]:
class Corpus20News_NE(object):
    def __init__(self, fname):
        self.fname = fname
        logging.info("collecting entities from %s" % self.fname)
        doc_stream = itertools.islice(iter_20newsgroups(self.fname), 10000)
        self.entities = best_phrases(doc_stream)
        logging.info("selected %i entities: %s..." %
                     (len(self.entities), list(self.entities)[:10]))

    def __iter__(self):
        for message in iter_20newsgroups(self.fname):
            yield self.tokenize(message)

    def tokenize(self, message, stopwords=STOPWORDS):
        """
        Break text (string) into a list of Unicode tokens.
        
        The resulting tokens can be longer phrases (named entities) too,
        e.g. `new_york`, `real_estate` etc.

        """
        result = []
        for np in TextBlob(message).noun_phrases:
            if u' ' in np and np not in self.entities:
                # only consider multi-word phrases we detected in the constructor
                continue
            token = u'_'.join(part for part in gensim.utils.tokenize(np) if len(part) > 2)
            if len(token) < 4 or token in stopwords:
                # ignore very short phrases and stop words
                continue
            result.append(token)
        return result

%time ne_corpus = Corpus20News_NE('./data/20news-bydate.tar.gz')
print(head(ne_corpus, 5))
INFO:root:collecting entities from ./data/20news-bydate.tar.gz
INFO:root:at document #0, considering 0 phrases: []...
INFO:root:at document #1000, considering 6928 phrases: [(u'jon livesey', 29), (u'bill conner', 28), (u'mirror sites', 26), (u'current version', 25), (u'source code', 24), (u'anonymous ftp', 24), (u'image processing', 22), (u'gamma correction', 20), (u'image quality', 18), (u'mike cobb', 17)]...
INFO:root:at document #2000, considering 10190 phrases: [(u'hard drive', 34), (u'anonymous ftp', 33), (u'jon livesey', 29), (u'bill conner', 28), (u'mirror sites', 28), (u'hard disk', 27), (u'source code', 26), (u'current version', 25), (u'image processing', 22), (u'image quality', 22)]...
INFO:root:at document #3000, considering 14454 phrases: [(u'hard drive', 52), (u'anonymous ftp', 44), (u'ethernet card', 33), (u'source code', 32), (u'hard disk', 31), (u'network software', 30), (u'latest version', 30), (u'jon livesey', 29), (u'current version', 29), (u'bill conner', 28)]...
INFO:root:at document #4000, considering 18289 phrases: [(u'hard drive', 52), (u'anonymous ftp', 44), (u'previous article', 35), (u'ethernet card', 33), (u'source code', 32), (u'hard disk', 31), (u'network software', 30), (u'latest version', 30), (u'jon livesey', 29), (u'current version', 29)]...
INFO:root:at document #5000, considering 22672 phrases: [(u'david sternlight', 56), (u'hard drive', 55), (u'anonymous ftp', 47), (u'source code', 46), (u'previous article', 41), (u'clipper chip', 40), (u'hard disk', 38), (u'ethernet card', 33), (u'network software', 30), (u'latest version', 30)]...
INFO:root:at document #6000, considering 30154 phrases: [(u'david sternlight', 56), (u'newsletter page', 55), (u'hard drive', 55), (u'anonymous ftp', 53), (u'source code', 46), (u'previous article', 45), (u'solar system', 43), (u'long time', 40), (u'clipper chip', 40), (u'hard disk', 38)]...
INFO:root:at document #7000, considering 37407 phrases: [(u'previous article', 66), (u'david sternlight', 61), (u'serdar argic', 57), (u'newsletter page', 55), (u'hard drive', 55), (u'anonymous ftp', 53), (u'long time', 52), (u'los angeles', 47), (u'source code', 46), (u'bosnian muslims', 44)]...
INFO:root:at document #8000, considering 42841 phrases: [(u'jon livesey', 89), (u'david koresh', 86), (u'previous article', 77), (u'long time', 62), (u'david sternlight', 61), (u'clayton cramer', 59), (u'jesus christ', 58), (u'serdar argic', 57), (u'newsletter page', 55), (u'hard drive', 55)]...
INFO:root:at document #9000, considering 45451 phrases: [(u'jon livesey', 89), (u'previous article', 89), (u'david koresh', 87), (u'anonymous ftp', 71), (u'source code', 70), (u'long time', 68), (u'hard drive', 62), (u'david sternlight', 61), (u'clayton cramer', 59), (u'jesus christ', 58)]...
INFO:root:selected 1000 entities: [u'red book', u'john franjione', u'image enhancement', u'mark pundurs', u'jody levine', u'serious problem', u'harry mamaysky', u'homosexual behavior', u'dorothy denning', u'bosnian muslims']...

CPU times: user 3min 6s, sys: 1.01 s, total: 3min 7s
Wall time: 3min 7s
[[u'timmons', u'atheism', u'admit', u'santa', u'santa', u'timmons'], [u'strong_atheists', u'believe', u'adam', u'nickname', u'cooper'], [u'gregg_jaeger', u'robert', u'beauchaine', u'bennett', u'neil', u'bcci', u'koran', u'gregg', u'gregg', u'bcci', u'islamic', u'bcci', u'islamic', u'gregg', u'islam', u'bcci', u'bcci', u'islamic', u'doubtless', u'right', u'moslems'], [u'waco', u'sadly', u'faith', u'religion', u'christians', u'christianity', u'christianity', u'jesus', u'christ', u'bible', u'similarly', u'webster', u'real', u'different_language', u'granted', u'religion', u'religion', u'sure', u'bad_things', u'good_things', u'todd', u'toronto'], [u'mark_mccullough', u'aaaahhh', u'ruthlessly', u'especially', u'mathew', u'hussein', u'iran', u'hussein', u'kuwait', u'kuwait', u'reagan', u'bush', u'hussein', u'gulf_war', u'hussein', u'regards']]

Too much work?

What we've just done, cleaning up raw input as the first step to more advanced processing, is actually the most "varied" and the most challenging part of building up machine learning pipelines (along with the last step on the other end of the pipeline: evaluation).

You typically have to know what overall goal you're trying to achieve to choose the correct preprocessing+evaluation approach. There is no "one best way" to preprocess text -- different applications require different steps, all the way down to custom tokenizers and lemmatizers. This is especially true for other (non-English) languages. Always log liberally and check the output coming out of your pipeline at various steps, to spot potential unforeseen problems.

Now that we have the data in a common format & ready to be vectorized, subsequent notebooks will be more straightforward & run of the mill.

Summary

Flow of data preparation:

  • extract a text stream from (raw) data
  • clean up texts, depending on business logic
  • break sanitized text into features of interest: words, collocations, detect named entities...
  • keep the data flow streamed and flexible
  • sprinkle code with sanity prints & checks (DEBUG/INFO logs)

Next

In the next notebook, we'll learn how to plug such preprocessed data streams into gensim, a library for topic modeling and information retrieval.

Continue with opening the next ipython notebook, 2 - Topic Modeling.