Topic Modeling for Fun and Profit

In this notebook, you'll

  • check you have all dependencies installed correctly
  • check you have downloaded all necessary data
  • get up to speed with efficient Python data access patterns

Tutorial setup

Check all dependencies are installed correctly (see the README) (highlight each cell with your mouse and press SHIFT+ENTER):

In [1]:
import numpy
In [2]:
import scipy
In [3]:
import gensim
gensim.utils.lemmatize("The quick brown fox jumps over the lazy dog!")
Out[3]:
['quick/JJ', 'brown/JJ', 'fox/NN', 'jump/NN', 'lazy/JJ', 'dog/NN']
In [4]:
import textblob
textblob.TextBlob("The quick brown fox jumps over the lazy dog!").noun_phrases
Out[4]:
WordList([u'quick brown fox jumps', u'lazy dog'])

If the above executes without errors, you'll see a number appear to the left of each of these cell prompts, and you're good to go!

In case you're using virtual evironments (recommended), check that the right package/location was picked up by Python:

In [5]:
print(scipy.__version__, scipy.__file__)
print(gensim.__version__, gensim.__file__)
scipy.show_config()
('0.14.0', '/Users/kofola/miniconda/envs/europy14/lib/python2.7/site-packages/scipy/__init__.pyc')
('0.10.1', '/Users/kofola/miniconda/envs/europy14/lib/python2.7/site-packages/gensim/__init__.pyc')
atlas_threads_info:
  NOT AVAILABLE
blas_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    extra_compile_args = ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers']
    define_macros = [('NO_ATLAS_INFO', 3)]
atlas_blas_threads_info:
  NOT AVAILABLE
openblas_info:
  NOT AVAILABLE
lapack_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    extra_compile_args = ['-msse3']
    define_macros = [('NO_ATLAS_INFO', 3)]
atlas_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE
mkl_info:
  NOT AVAILABLE

Check training data

Make sure you have downloaded all necessary data files (again, see the README):

In [6]:
!ls -lh ./data/
total 223640
-rw-r--r--  1 kofola  staff    14M Jul 22 21:19 20news-bydate.tar.gz
-rw-r--r--  1 kofola  staff   163B Jul 22 21:19 README.md
-rw-r--r--  1 kofola  staff    95M Jul 22 21:19 simplewiki-20140623-pages-articles.xml.bz2

You should see at least two entries there: simplewiki-20140623-pages-articles.xml.bz2 and 20news-bydate.tar.gz.

Quick Python recap

Data streaming, generators, iterators

Generators are a built-in way to iterate over a sequence once, without materializing all its elements at the same time:

In [7]:
def odd_numbers():
    """
    Yield one odd number after another.
    
    Don't try to materialize its result in plain list, with `list(odd_numbers)`,
    because the sequence is infinite and you'll run out of RAM!
    """
    result = 1
    while True:
        yield result  # `yield` instead of `return`!
        result += 2

odd_numbers_generator = odd_numbers()

for odd_number in odd_numbers_generator:
    print(odd_number)
    if odd_number > 10:
        break
1
3
5
7
9
11

We'll be using this pattern of "generate a data point, process it, forget it" often, because it allows us to bypass RAM limitations. With generators we can process huge text corpora in constant memory, using clever algorithms that don't mind operating one-data-point-at-a-time.

This is in contrast to plain Python lists, Pandas frames or even NumPy and SciPy arrays, where the entire sequence must be known beforehand and mapped into (virtual) memory fully.

Generators and iterators come at a cost: since we're only allowed to go one item after another, it's not possible to skip to the middle of the sequence. Unless we take care of it manually, there's no equivalent of randomly accessing an arbitrary element ala lists: some_list[100] will work, but some_generator[100] won't.

An iterable is like a generator (memory efficient), except it can be iterated over multiple times. To achieve that, we override the object's special __iter__ method (which Python calls every time we loop over the object) to return a generator:

In [8]:
class OddNumbers(object):
    def __iter__(self):
        result = 1
        while True:
            yield result
            result += 2

odd_numbers_iterator = OddNumbers()

for odd_number in odd_numbers_iterator:
    print(odd_number)
    if odd_number > 10:
        break
1
3
5
7
9
11

That's all we need to know for our purposes. For more info, read Data streaming in Python: generators, iterators, iterables, or Python's documentation for "iterator types".

NumPy & SciPy arrays

NumPy is a 3rd party package (not built-in). NumPy arrays are a concise and efficient way to represent a fixed-length list of numbers (or, actually and uninterestingly for this tutorial, of any objects). Their power comes from pithy array slicing, even in multiple dimensions:

In [9]:
# create a 2D table of random numbers, with 10 rows and 5 columns
x = numpy.random.rand(10, 5)

print(x)
[[ 0.94101733  0.81959211  0.17912685  0.66843553  0.8875169 ]
 [ 0.90260043  0.40696554  0.72392572  0.72318521  0.53059643]
 [ 0.4796837   0.60057962  0.06515785  0.74031769  0.17075889]
 [ 0.57708454  0.68949203  0.77487153  0.94096938  0.12406982]
 [ 0.85767963  0.12268029  0.30514173  0.26364486  0.27548273]
 [ 0.06972059  0.73688682  0.28564218  0.06346427  0.63332022]
 [ 0.83166623  0.30425193  0.59588164  0.87722348  0.19238508]
 [ 0.86394735  0.89410927  0.96060885  0.26987568  0.20651538]
 [ 0.2797839   0.76736262  0.98665244  0.3771055   0.7964668 ]
 [ 0.67799309  0.72377897  0.96231661  0.68319975  0.54770925]]

In [10]:
# print element in 3rd row and 2nd column
print(x[2, 1])  
0.600579619968

In [11]:
# print the entire 3rd row
print(x[2])
[ 0.4796837   0.60057962  0.06515785  0.74031769  0.17075889]

In [12]:
# print the entire 2nd column
print(x[:, 1])
[ 0.81959211  0.40696554  0.60057962  0.68949203  0.12268029  0.73688682
  0.30425193  0.89410927  0.76736262  0.72377897]

In [13]:
# print a sub-table (rectangular region), starting at [0, 0] and ending at [4, 2] (exclusive)
print(x[:4, :2])
[[ 0.94101733  0.81959211]
 [ 0.90260043  0.40696554]
 [ 0.4796837   0.60057962]
 [ 0.57708454  0.68949203]]

and the fact that the underlying implementation is written to be fast (in C, even plugging into fast BLAS where available).

Similarly, the 3rd part SciPy package contains scipy.sparse arrays, which are a way to represent vectors and matrices with assumed (implicit) zeros.

scipy.sparse arrays are not as efficient as NumPy arrays, because they don't plug into BLAS and because their memory access patterns are more involved (cache misses). But not materializing the zeros explicitly can make a huge difference for very sparse arrays (lots of zeros). However, all non-zero values must still reside in memory, so ultimately, for large data, we still resort to generators and data streaming.

A common pattern that we'll be using is combining the efficiency of in-memory arrays (numpy, scipy.sparse) with the scalability of data streaming. Instead of processing one document at a time (slow), or all documents at once (non-scalable), we'll be reading a chunk of documents into RAM (= as many documents as RAM allows), processing this chunk, then throwing it away and streaming a new chunk into RAM.

Itertools

A built-in Python library for efficient work data streams (iterables, iterators, generators):

In [14]:
import itertools

infinite_stream = OddNumbers()

# compute the first 10 items (and no more) & print them
print(list(itertools.islice(infinite_stream, 10)))

# lazily concatenate streams; the result is also infinite
concat_stream = itertools.chain('abcde', infinite_stream)
print(list(itertools.islice(concat_stream, 10)))

numbered_stream = enumerate(infinite_stream)  # also infinite
print(list(itertools.islice(numbered_stream, 10)))

# etc; see the itertools docs for more examples
[1, 3, 5, 7, 9, 11, 13, 15, 17, 19]
['a', 'b', 'c', 'd', 'e', 1, 3, 5, 7, 9]
[(0, 1), (1, 3), (2, 5), (3, 7), (4, 9), (5, 11), (6, 13), (7, 15), (8, 17), (9, 19)]

The examples above show another useful pattern: take a small sample of the stream (e.g. the first ten elements) and convert them into plain Python list, with list(islice(stream, 10)). To convert an entire stream into list, simply list(stream) (watch out for RAM here though, especially with infinite streams!). Nothing beats the simplicity of list(stream) for debugging purposes.

Notebooks

At any point, you can save the notebook (any notebook) to disk by pressing CTRL+s (or CMD+s). This will save all changes you've made to the notebook, including cell outputs, locally to your disk.

To discard your notebook changes, simply checkout the notebook file again from git (or extract it again from the repository ZIP archive). This will reset the notebook to its original state, losing all your changes.