In this notebook we'll
In the previous notebook 1 - Streamed Corpora
we used the 20newsgroups corpus to demonstrate data preprocessing and streaming.
Now we'll switch to the English Wikipedia and do some topic modeling.
# import and setup modules we'll be using in this notebook
import logging
import itertools
import numpy as np
import gensim
logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO # ipython sometimes messes up the logging setup; restore
def head(stream, n=10):
"""Convenience fnc: return the first `n` elements of the stream, as plain list."""
return list(itertools.islice(stream, n))
Let's use the now-familiar pattern of streaming over an entire Wikipedia dump, without unzipping the raw file:
from gensim.utils import smart_open, simple_preprocess
from gensim.corpora.wikicorpus import _extract_pages, filter_wiki
from gensim.parsing.preprocessing import STOPWORDS
def tokenize(text):
return [token for token in simple_preprocess(text) if token not in STOPWORDS]
def iter_wiki(dump_file):
"""Yield each article from the Wikipedia dump, as a `(title, tokens)` 2-tuple."""
ignore_namespaces = 'Wikipedia Category File Portal Template MediaWiki User Help Book Draft'.split()
for title, text, pageid in _extract_pages(smart_open(dump_file)):
text = filter_wiki(text)
tokens = tokenize(text)
if len(tokens) < 50 or any(title.startswith(ns + ':') for ns in ignore_namespaces):
continue # ignore short articles and various meta-articles
yield title, tokens
# only use simplewiki in this tutorial (fewer documents)
# the full wiki dump is exactly the same format, but larger
stream = iter_wiki('./data/simplewiki-20140623-pages-articles.xml.bz2')
for title, tokens in itertools.islice(iter_wiki('./data/simplewiki-20140623-pages-articles.xml.bz2'), 8):
print title, tokens[:10] # print the article title and its first ten tokens
Dictionaries are objects that map into raw text tokens (strings) from their numerical ids (integers). Example:
id2word = {0: u'word', 2: u'profit', 300: u'another_word'}
This mapping step is technically (not conceptually) necessary because most algorithms rely on numerical libraries that work with vectors indexed by integers, rather than by strings, and have to know the vector/matrix dimensionality in advance.
The mapping can be constructed automatically by giving Dictionary
class a stream of tokenized documents:
doc_stream = (tokens for _, tokens in iter_wiki('./data/simplewiki-20140623-pages-articles.xml.bz2'))
%time id2word_wiki = gensim.corpora.Dictionary(doc_stream)
print(id2word_wiki)
The dictionary object now contains all words that appeared in the corpus, along with how many times they appeared. Let's filter out both very infrequent words and very frequent words (stopwords), to clear up resources as well as remove noise:
# ignore words that appear in less than 20 documents or more than 10% documents
id2word_wiki.filter_extremes(no_below=20, no_above=0.1)
print(id2word_wiki)
Exercise (5 min): Print all words and their ids from id2word_wiki
where the word starts with "human".
Note for advanced users: In fully online scenarios, where the documents can only be streamed once (no repeating the stream), we can't exhaust the document stream just to build a dictionary. In this case we can map strings directly into their integer hash, using a hashing function such as MurmurHash or MD5. This is called the "hashing trick". A dictionary built this way is more difficult to debug, because there may be hash collisions: multiple words represented by a single id. See the documentation of HashDictionary for more details.
A streamed corpus and a dictionary is all we need to create bag-of-words vectors:
doc = "A blood cell, also called a hematocyte, is a cell produced by hematopoiesis and normally found in blood."
bow = id2word_wiki.doc2bow(tokenize(doc))
print(bow)
print(id2word_wiki[10882])
Let's wrap the entire dump, as a stream of bag-of-word vectors:
class WikiCorpus(object):
def __init__(self, dump_file, dictionary, clip_docs=None):
"""
Parse the first `clip_docs` Wikipedia documents from file `dump_file`.
Yield each document in turn, as a list of tokens (unicode strings).
"""
self.dump_file = dump_file
self.dictionary = dictionary
self.clip_docs = clip_docs
def __iter__(self):
self.titles = []
for title, tokens in itertools.islice(iter_wiki(self.dump_file), self.clip_docs):
self.titles.append(title)
yield self.dictionary.doc2bow(tokens)
def __len__(self):
return self.clip_docs
# create a stream of bag-of-words vectors
wiki_corpus = WikiCorpus('./data/simplewiki-20140623-pages-articles.xml.bz2', id2word_wiki)
vector = next(iter(wiki_corpus))
print(vector) # print the first vector in the stream
# what is the most common word in that first article?
most_index, most_count = max(vector, key=lambda (word_index, count): count)
print(id2word_wiki[most_index], most_count)
Let's store all those bag-of-words vectors into a file, so we don't have to parse the bzipped Wikipedia XML every time over and over:
%time gensim.corpora.MmCorpus.serialize('./data/wiki_bow.mm', wiki_corpus)
mm_corpus = gensim.corpora.MmCorpus('./data/wiki_bow.mm')
print(mm_corpus)
mm_corpus
now contains exactly the same bag-of-words vectors as wiki_corpus
before, but they are backed by the .mm
file, rather than extracted on the fly from the xml.bz2
file:
print(next(iter(mm_corpus)))
Topic modeling in gensim is realized via transformations. A transformation is something that takes a corpus and spits out another corpus on output, using corpus_out = transformation_object[corpus_in]
syntax. What exactly happens in between is determined by what kind of transformation we're using -- options are Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), Random Projections (RP) etc.
Some transformations need to be initialized (=trained) before they can be used. For example, let's train an LDA transformation model, using our bag-of-words WikiCorpus as training data:
clipped_corpus = gensim.utils.ClippedCorpus(mm_corpus, 4000) # use fewer documents during training, LDA is slow
# ClippedCorpus new in gensim 0.10.1
# copy&paste it from https://github.com/piskvorky/gensim/blob/0.10.1/gensim/utils.py#L467 if necessary (or upgrade your gensim)
%time lda_model = gensim.models.LdaModel(clipped_corpus, num_topics=10, id2word=id2word_wiki, passes=4)
_ = lda_model.print_topics(-1) # print a few most important words for each LDA topic
More info on model parameters in gensim docs.
Transformation can be stacked. For example, here we'll train a TFIDF model, and then train Latent Semantic Analysis on top of TFIDF:
%time tfidf_model = gensim.models.TfidfModel(mm_corpus, id2word=id2word_wiki)
The TFIDF transformation only modifies feature weights of each word. Its input and output dimensionality are identical (=the dictionary size).
%time lsi_model = gensim.models.LsiModel(tfidf_model[mm_corpus], id2word=id2word_wiki, num_topics=200)
The LSI transformation goes from a space of high dimensionality (~TFIDF, tens of thousands) into a space of low dimensionality (a few hundreds; here 200). For this reason it can also seen as dimensionality reduction.
As always, the transformations are applied "lazily", so the resulting output corpus is streamed as well:
print(next(iter(lsi_model[tfidf_model[mm_corpus]])))
We can store this "LSA via TFIDF via bag-of-words" corpus the same way again:
# cache the transformed corpora to disk, for use in later notebooks
%time gensim.corpora.MmCorpus.serialize('./data/wiki_tfidf.mm', tfidf_model[mm_corpus])
%time gensim.corpora.MmCorpus.serialize('./data/wiki_lsa.mm', lsi_model[tfidf_model[mm_corpus]])
# gensim.corpora.MmCorpus.serialize('./data/wiki_lda.mm', lda_model[mm_corpus])
(you can also gzip/bzip2 these .mm
files to save space, as gensim can work with zipped input transparently)
Persisting a transformed corpus to disk makes sense if we want to iterate over it multiple times and the transformation is costly. As before, the saved result is indistinguishable from when it's computed on the fly, so this is effectively a form of "corpus caching":
tfidf_corpus = gensim.corpora.MmCorpus('./data/wiki_tfidf.mm')
# `tfidf_corpus` is now exactly the same as `tfidf_model[wiki_corpus]`
print(tfidf_corpus)
lsi_corpus = gensim.corpora.MmCorpus('./data/wiki_lsa.mm')
# and `lsi_corpus` now equals `lsi_model[tfidf_model[wiki_corpus]]` = `lsi_model[tfidf_corpus]`
print(lsi_corpus)
We can use the trained models to transform new, unseen documents into the semantic space:
text = "A blood cell, also called a hematocyte, is a cell produced by hematopoiesis and normally found in blood."
# transform text into the bag-of-words space
bow_vector = id2word_wiki.doc2bow(tokenize(text))
print([(id2word_wiki[id], count) for id, count in bow_vector])
# transform into LDA space
lda_vector = lda_model[bow_vector]
print(lda_vector)
# print the document's single most prominent LDA topic
print(lda_model.print_topic(max(lda_vector, key=lambda item: item[1])[0]))
Exercise (5 min): print text
transformed into TFIDF space.
For stacked transformations, apply the same stack during transformation as was applied during training:
# transform into LSI space
lsi_vector = lsi_model[tfidf_model[bow_vector]]
print(lsi_vector)
# print the document's single most prominent LSI topic (not interpretable like LDA!)
print(lsi_model.print_topic(max(lsi_vector, key=lambda item: abs(item[1]))[0]))
Gensim objects have save/load
methods for persisting a model to disk, so it can be re-used later (or sent over network to a different computer, or whatever):
# store all trained models to disk
lda_model.save('./data/lda_wiki.model')
lsi_model.save('./data/lsi_wiki.model')
tfidf_model.save('./data/tfidf_wiki.model')
id2word_wiki.save('./data/wiki.dictionary')
# load the same model back; the result is equal to `lda_model`
same_lda_model = gensim.models.LdaModel.load('./data/lda_wiki.model')
These methods are optimized for storing large models; internal matrices that consume a lot of RAM are mmap'ed in read-only mode. This allows "sharing" a single model between several processes, through the OS's virtual memory management.
Topic modeling is an unsupervised task; we do not know in advance what the topics ought to look like. This makes evaluation tricky: whereas in supervised learning (classification, regression) we simply compare predicted labels to expected labels, there are no "expected labels" in topic modeling.
Each topic modeling method (LSI, LDA...) its own way of measuring internal quality (perplexity, reconstruction error...). But these are an artifact of the particular approach taken (bayesian training, matrix factorization...), and mostly of academic interest. There's no way to compare such scores across different types of topic models, either. The best way to really evaluate quality of unsupervised tasks is to evaluate how they improve the superordinate task, the one we're actually training them for.
For example, when the ultimate goal is to retrieve semantically similar documents, we manually tag a set of similar documents and then see how well a given semantic model maps those similar documents together.
Such manual tagging can be resource intensive, so people hae been looking for clever ways to automate it. In Reading tea leaves: How humans interpret topic models, Wallach&al suggest a "word intrusion" method that works well for models where the topics are meant to be "human interpretable", such as LDA. For each trained topic, they take its first ten words, then substitute one of them with another, randomly chosen word (intruder!) and see whether a human can reliably tell which one it was. If so, the trained topic is topically coherent (good); if not, the topic has no discernible theme (bad):
# select top 50 words for each of the 20 LDA topics
top_words = [[word for _, word in lda_model.show_topic(topicno, topn=50)] for topicno in range(lda_model.num_topics)]
print(top_words)
# get all top 50 words in all 20 topics, as one large set
all_words = set(itertools.chain.from_iterable(top_words))
print("Can you spot the misplaced word in each topic?")
# for each topic, replace a word at a different index, to make it more interesting
replace_index = np.random.randint(0, 10, lda_model.num_topics)
replacements = []
for topicno, words in enumerate(top_words):
other_words = all_words.difference(words)
replacement = np.random.choice(list(other_words))
replacements.append((words[replace_index[topicno]], replacement))
words[replace_index[topicno]] = replacement
print("%i: %s" % (topicno, ' '.join(words[:10])))
print("Actual replacements were:")
print(list(enumerate(replacements)))
We can also use a different trick, one which doesn't require manual tagging or "eyeballing" (resource intensive) and doesn't limit the evaluation to only interpretable models. We'll split each document into two parts, and check that 1) topics of the first half are similar to topics of the second 2) halves of different documents are mostly dissimilar:
# evaluate on 1k documents **not** used in LDA training
doc_stream = (tokens for _, tokens in iter_wiki('./data/simplewiki-20140623-pages-articles.xml.bz2')) # generator
test_docs = list(itertools.islice(doc_stream, 8000, 9000))
def intra_inter(model, test_docs, num_pairs=10000):
# split each test document into two halves and compute topics for each half
part1 = [model[id2word_wiki.doc2bow(tokens[: len(tokens) / 2])] for tokens in test_docs]
part2 = [model[id2word_wiki.doc2bow(tokens[len(tokens) / 2 :])] for tokens in test_docs]
# print computed similarities (uses cossim)
print("average cosine similarity between corresponding parts (higher is better):")
print(np.mean([gensim.matutils.cossim(p1, p2) for p1, p2 in zip(part1, part2)]))
random_pairs = np.random.randint(0, len(test_docs), size=(num_pairs, 2))
print("average cosine similarity between 10,000 random parts (lower is better):")
print(np.mean([gensim.matutils.cossim(part1[i[0]], part2[i[1]]) for i in random_pairs]))
print("LDA results:")
intra_inter(lda_model, test_docs)
print("LSI results:")
intra_inter(lsi_model, test_docs)
In this notebook, we saw how to:
In this notebook, we've used a smallish simplewiki-20140623-pages-articles.xml.bz2
file, for time reasons. You can run exactly the same code on the full Wikipedia dump too [BZ2 10.2GB] -- the same format is the same. Our streamed approach ensures that RAM footprint of the processing stays constant. There's actually a script in gensim that does all these steps for you, and uses parallelization (multiprocessing) for faster execution, see Experiments on the English Wikipedia.
In the next notebook, we'll see how to the index semantically transformed corpora and run queries against the index.
Continue by opening the next ipython notebook, 3 - Indexing and Retrieval
.