In this notebook:
# import and setup modules we'll be using in this notebook
import logging
import os
import sys
import re
import tarfile
import itertools
import nltk
from nltk.collocations import TrigramCollocationFinder
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures
import gensim
from gensim.parsing.preprocessing import STOPWORDS
logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO # ipython sometimes messes up the logging setup; restore
Previously, we have downloaded the 20newsgroups dataset and left it under ./data/
as a zipped tarball:
!ls -l ./data/20news-bydate.tar.gz
Let's use this file for some topic modeling. Instead of decompressing its files, let's access them directly from Python:
with tarfile.open('./data/20news-bydate.tar.gz', 'r:gz') as tf:
# get information (metadata) about all files in the tarball
file_infos = [file_info for file_info in tf if file_info.isfile()]
# print one of them; for example, the first one
message = tf.extractfile(file_infos[0]).read()
print(message)
This text is typical of real-world data. It contains a mix of relevant text, metadata (email headers), and downright noise. Even its relevant content is unstructured, with email addresses, people's names, quotations etc.
Most machine learning methods, topic modeling included, are only as good as the data you give it. At this point, we generally want to clean the data as much as possible. While the subsequent steps in the machine learning pipeline are more or less automated, handling the raw data should reflect the intended purpose of the application, its business logic, idiosyncracies, sanity check (aren't we accidentally receiving and parsing image data instead of plain text?). As always with automated processing it's garbage in, garbage out.
As an example, let's write a function that aims to extract only the chunk of relevant text from each message, ignoring email headers:
def process_message(message):
"""
Preprocess a single 20newsgroups message, returning the result as
a unicode string.
"""
message = gensim.utils.to_unicode(message, 'latin1').strip()
blocks = message.split(u'\n\n')
# skip email headers (first block) and footer (last block)
content = u'\n\n'.join(blocks[1:])
return content
print process_message(message)
Feel free to modify this function and test out other ideas for clean up. The flexibility Python gives you in processing text is superb -- it'd be a crime to hide the processing behind opaque APIs, exposing only one or two tweakable parameters.
There are a handful of handy Python libraries for text cleanup: jusText removes HTML boilerplate and extracts "main text" of a web page. NLTK, Pattern and TextBlob are good for tokenization, POS tagging, sentence splitting and generic NLP, with a nice Pythonic interface. None of them scales very well though, so keep the inputs small.
Exercise (5 min): Modify the process_message
function to ignore message footers, too.
It's a good practice to inspect your data visually, at each point as it passes through your data processing pipeline. Simple printing (logging) a few arbitrary entries, ala UNIX head
, does wonders for spotting unexpected bugs. Oh, bad encoding! What is Chinese doing there, we were told all texts are English only? Do these rubbish tokens come from embedded images? How come everything's empty? Taking a text with "hey, let's tokenize it into a bag of words blindly, like they do in the tutorials, push it through this magical unicorn machine learning library and hope for the best" is ill advised.
Another good practice is to keep internal strings as Unicode, and only encode/decode on IO (preferably using UTF8). As of Python 3.3, there is practically no memory penalty for using Unicode over UTF8 byte strings (PEP 393).
Let's write a function to go over all messages in the 20newsgroups archive:
def iter_20newsgroups(fname, log_every=None):
"""
Yield plain text of each 20 newsgroups message, as a unicode string.
The messages are read from raw tar.gz file `fname` on disk (e.g. `./data/20news-bydate.tar.gz`)
"""
extracted = 0
with tarfile.open(fname, 'r:gz') as tf:
for file_number, file_info in enumerate(tf):
if file_info.isfile():
if log_every and extracted % log_every == 0:
logging.info("extracting 20newsgroups file #%i: %s" % (extracted, file_info.name))
content = tf.extractfile(file_info).read()
yield process_message(content)
extracted += 1
This uses the process_message()
we wrote above, to process each message in turn. The messages are extracted on-the-fly, one after another, using a generator.
Such data streaming is a very important pattern: real data is typically too large to fit into RAM, and we don't need all of it in RAM at the same time anyway -- that's just wasteful. With streamed data, we can process arbitrarily large input, reading the data from a file on disk, SQL database, shared network disk, or even more exotic remote network protocols.
# itertools is an inseparable friend with data streaming (Python built-in library)
import itertools
# let's only parse and print the first three messages, lazily
# `list(stream)` materializes the stream elements into a plain Python list
message_stream = iter_20newsgroups('./data/20news-bydate.tar.gz', log_every=2)
print(list(itertools.islice(message_stream, 3)))
Note: Data streaming saves memory, but does nothing to save time. A common pattern for speeding up data processing is parallelization, via Python's multi-processing and multi-threading support. We don't have time to cover processing parallelization or cluster distribution in this tutorial. For an example, see this Wikipedia parsing code from my Similarity Shootout benchmark.
A generator only gives us a single pass through the data:
# print the next two messages; the three messages printed above are already gone
print(list(itertools.islice(message_stream, 2)))
Let's wrap the generator inside an object's __iter__
method, so we can iterate over the stream multiple times:
class Corpus20News(object):
def __init__(self, fname):
self.fname = fname
def __iter__(self):
for text in iter_20newsgroups(self.fname):
# tokenize each message; simply lowercase & match alphabetic chars, for now
yield list(gensim.utils.tokenize(text, lower=True))
tokenized_corpus = Corpus20News('./data/20news-bydate.tar.gz')
# print the first two tokenized messages
print(list(itertools.islice(tokenized_corpus, 2)))
# the same two tokenized messages (not the next two!)
# each call to __iter__ "resets" the stream, by creating a new generator object internally
print(list(itertools.islice(tokenized_corpus, 2)))
Lemmatization is type of normalization that treats different inflected forms of a word as a single unit ("work", "working", "works", "worked", "working" => same lemma: "work"):
import gensim
print(gensim.utils.lemmatize("worked"))
print(gensim.utils.lemmatize("working"))
print(gensim.utils.lemmatize("I was working with a working class hero."))
There's a part of speech (POS) tag included in each token: lemma/POS. Note how articles and prepositions, such as "The", "a", or "over", have been filtered out from the result. Only word categories that traditionally carry the most meaning, such as nouns, adjectives and verbs, are left. Gensim uses the pattern library internally, because its lemmatization performs (much) better than alternatives such as NLTK.
class Corpus20News_Lemmatize(object):
def __init__(self, fname):
self.fname = fname
def __iter__(self):
for message in iter_20newsgroups(self.fname):
yield self.tokenize(message)
def tokenize(self, text):
"""Break text into a list of lemmatized words."""
return gensim.utils.lemmatize(text)
lemmatized_corpus = Corpus20News_Lemmatize('./data/20news-bydate.tar.gz')
print(list(itertools.islice(lemmatized_corpus, 2)))
Exercise (10 min): Modify tokenize()
to ignore (=not return) generic words, such as "do", "then", "be", "as"... These are called stopwords and we may want to remove them because some topic modeling algorithms are sensitive to their presence. An example of common stopwords set for English is in from gensim.parsing.preprocessing import STOPWORDS
.
Collocation is a "sequence of words or terms that co-occur more often than would be expected by chance."
Named entity recognition (NER) is the task of locating chunks of text that refer to people, locations, organizations etc.
Detecting collocations and named entities often has a significant business value: "General Electric" stays a single entity (token), rather than two words "general" and "electric". Same with "Marathon Petroleum", "George Bush" etc -- a topic model doesn't confuse its topics via words coming from unrelated entities, such as "Korea" and "Carolina" via "North".
import nltk
from nltk.collocations import TrigramCollocationFinder
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures
def best_ngrams(words, top_n=1000, min_freq=100):
"""
Extract `top_n` most salient collocations (bigrams and trigrams),
from a stream of words. Ignore collocations with frequency
lower than `min_freq`.
This fnc uses NLTK for the collocation detection itself -- not very scalable!
Return the detected ngrams as compiled regular expressions, for their faster
detection later on.
"""
tcf = TrigramCollocationFinder.from_words(words)
tcf.apply_freq_filter(min_freq)
trigrams = [' '.join(w) for w in tcf.nbest(TrigramAssocMeasures.chi_sq, top_n)]
logging.info("%i trigrams found: %s..." % (len(trigrams), trigrams[:20]))
bcf = tcf.bigram_finder()
bcf.apply_freq_filter(min_freq)
bigrams = [' '.join(w) for w in bcf.nbest(BigramAssocMeasures.pmi, top_n)]
logging.info("%i bigrams found: %s..." % (len(bigrams), bigrams[:20]))
pat_gram2 = re.compile('(%s)' % '|'.join(bigrams), re.UNICODE)
pat_gram3 = re.compile('(%s)' % '|'.join(trigrams), re.UNICODE)
return pat_gram2, pat_gram3
from gensim.parsing.preprocessing import STOPWORDS
class Corpus20News_Collocations(object):
def __init__(self, fname):
self.fname = fname
logging.info("collecting ngrams from %s" % self.fname)
# generator of documents; one element = list of words
documents = (self.split_words(text) for text in iter_20newsgroups(self.fname, log_every=1000))
# generator: concatenate (chain) all words into a single sequence, lazily
words = itertools.chain.from_iterable(documents)
self.bigrams, self.trigrams = best_ngrams(words)
def split_words(self, text, stopwords=STOPWORDS):
"""
Break text into a list of single words. Ignore any token that falls into
the `stopwords` set.
"""
return [word
for word in gensim.utils.tokenize(text, lower=True)
if word not in STOPWORDS and len(word) > 3]
def tokenize(self, message):
"""
Break text (string) into a list of Unicode tokens.
The resulting tokens can be longer phrases (collocations) too,
e.g. `new_york`, `real_estate` etc.
"""
text = u' '.join(self.split_words(message))
text = re.sub(self.trigrams, lambda match: match.group(0).replace(u' ', u'_'), text)
text = re.sub(self.bigrams, lambda match: match.group(0).replace(u' ', u'_'), text)
return text.split()
def __iter__(self):
for message in iter_20newsgroups(self.fname):
yield self.tokenize(message)
%time collocations_corpus = Corpus20News_Collocations('./data/20news-bydate.tar.gz')
print(list(itertools.islice(collocations_corpus, 2)))
Instead of detecting collocations by frequency, we can run (shallow) syntactic parsing. This tags each word with its part-of-speech (POS) category, and suggests phrases based on chunks of "noun phrases":
from textblob import TextBlob
def head(stream, n=10):
"""Convenience fnc: return the first `n` elements of the stream, as plain list."""
return list(itertools.islice(stream, n))
def best_phrases(document_stream, top_n=1000, prune_at=50000):
"""Return a set of `top_n` most common noun phrases."""
np_counts = {}
for docno, doc in enumerate(document_stream):
# prune out infrequent phrases from time to time, to save RAM.
# the result may not be completely accurate because of this step
if docno % 1000 == 0:
sorted_phrases = sorted(np_counts.iteritems(), key=lambda item: -item[1])
np_counts = dict(sorted_phrases[:prune_at])
logging.info("at document #%i, considering %i phrases: %s..." %
(docno, len(np_counts), head(sorted_phrases)))
# how many times have we seen each noun phrase?
for np in TextBlob(doc).noun_phrases:
# only consider multi-word NEs where each word contains at least one letter
if u' ' not in np:
continue
# ignore phrases that contain too short/non-alphabetic words
if all(word.isalpha() and len(word) > 2 for word in np.split()):
np_counts[np] = np_counts.get(np, 0) + 1
sorted_phrases = sorted(np_counts, key=lambda np: -np_counts[np])
return set(head(sorted_phrases, top_n))
class Corpus20News_NE(object):
def __init__(self, fname):
self.fname = fname
logging.info("collecting entities from %s" % self.fname)
doc_stream = itertools.islice(iter_20newsgroups(self.fname), 10000)
self.entities = best_phrases(doc_stream)
logging.info("selected %i entities: %s..." %
(len(self.entities), list(self.entities)[:10]))
def __iter__(self):
for message in iter_20newsgroups(self.fname):
yield self.tokenize(message)
def tokenize(self, message, stopwords=STOPWORDS):
"""
Break text (string) into a list of Unicode tokens.
The resulting tokens can be longer phrases (named entities) too,
e.g. `new_york`, `real_estate` etc.
"""
result = []
for np in TextBlob(message).noun_phrases:
if u' ' in np and np not in self.entities:
# only consider multi-word phrases we detected in the constructor
continue
token = u'_'.join(part for part in gensim.utils.tokenize(np) if len(part) > 2)
if len(token) < 4 or token in stopwords:
# ignore very short phrases and stop words
continue
result.append(token)
return result
%time ne_corpus = Corpus20News_NE('./data/20news-bydate.tar.gz')
print(head(ne_corpus, 5))
What we've just done, cleaning up raw input as the first step to more advanced processing, is actually the most "varied" and the most challenging part of building up machine learning pipelines (along with the last step on the other end of the pipeline: evaluation).
You typically have to know what overall goal you're trying to achieve to choose the correct preprocessing+evaluation approach. There is no "one best way" to preprocess text -- different applications require different steps, all the way down to custom tokenizers and lemmatizers. This is especially true for other (non-English) languages. Always log liberally and check the output coming out of your pipeline at various steps, to spot potential unforeseen problems.
Now that we have the data in a common format & ready to be vectorized, subsequent notebooks will be more straightforward & run of the mill.
Flow of data preparation:
In the next notebook, we'll learn how to plug such preprocessed data streams into gensim, a library for topic modeling and information retrieval.
Continue with opening the next ipython notebook, 2 - Topic Modeling
.