LDA Model

Introduces Gensim’s LDA model and demonstrates its use on the NIPS corpus.

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

The purpose of this tutorial is to demonstrate how to train and tune an LDA model.

In this tutorial we will:

  • Load input data.

  • Pre-process that data.

  • Transform documents into bag-of-words vectors.

  • Train an LDA model.

This tutorial will not:

  • Explain how Latent Dirichlet Allocation works

  • Explain how the LDA model performs inference

  • Teach you all the parameters and options for Gensim’s LDA implementation

If you are not familiar with the LDA model or how to use it in Gensim, I (Olavur Mortensen) suggest you read up on that before continuing with this tutorial. Basic understanding of the LDA model should suffice. Examples:

I would also encourage you to consider each step when applying the model to your data, instead of just blindly applying my solution. The different steps will depend on your data and possibly your goal with the model.

Data

I have used a corpus of NIPS papers in this tutorial, but if you’re following this tutorial just to learn about LDA I encourage you to consider picking a corpus on a subject that you are familiar with. Qualitatively evaluating the output of an LDA model is challenging and can require you to understand the subject matter of your corpus (depending on your goal with the model).

NIPS (Neural Information Processing Systems) is a machine learning conference so the subject matter should be well suited for most of the target audience of this tutorial. You can download the original data from Sam Roweis’ website. The code below will also do that for you.

Important

The corpus contains 1740 documents, and not particularly long ones. So keep in mind that this tutorial is not geared towards efficiency, and be careful before applying the code to a large dataset.

import io
import os.path
import re
import tarfile

import smart_open

def extract_documents(url='https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz'):
    with smart_open.open(url, "rb") as file:
        with tarfile.open(fileobj=file) as tar:
            for member in tar.getmembers():
                if member.isfile() and re.search(r'nipstxt/nips\d+/\d+\.txt', member.name):
                    member_bytes = tar.extractfile(member).read()
                    yield member_bytes.decode('utf-8', errors='replace')

docs = list(extract_documents())

So we have a list of 1740 documents, where each document is a Unicode string. If you’re thinking about using your own corpus, then you need to make sure that it’s in the same format (list of Unicode strings) before proceeding with the rest of this tutorial.

print(len(docs))
print(docs[0][:500])

Out:

1740
387
Neural Net and Traditional Classifiers 
William Y. Huang and Richard P. Lippmann
MIT Lincoln Laboratory
Lexington, MA 02173, USA
Abstract
Previous work on nets with continuous-valued inputs led to generative
procedures to construct convex decision regions with two-layer percepttons (one hidden
layer) and arbitrary decision regions with three-layer percepttons (two hidden layers).
Here we demonstrate that two-layer perceptton classifiers trained with back propagation
can form both c

Pre-process and vectorize the documents

As part of preprocessing, we will:

  • Tokenize (split the documents into tokens).

  • Lemmatize the tokens.

  • Compute bigrams.

  • Compute a bag-of-words representation of the data.

First we tokenize the text using a regular expression tokenizer from NLTK. We remove numeric tokens and tokens that are only a single character, as they don’t tend to be useful, and the dataset contains a lot of them.

Important

This tutorial uses the nltk library for preprocessing, although you can replace it with something else if you want.

# Tokenize the documents.
from nltk.tokenize import RegexpTokenizer

# Split the documents into tokens.
tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(docs)):
    docs[idx] = docs[idx].lower()  # Convert to lowercase.
    docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

# Remove numbers, but not words that contain numbers.
docs = [[token for token in doc if not token.isnumeric()] for doc in docs]

# Remove words that are only one character.
docs = [[token for token in doc if len(token) > 1] for doc in docs]

We use the WordNet lemmatizer from NLTK. A lemmatizer is preferred over a stemmer in this case because it produces more readable words. Output that is easy to read is very desirable in topic modelling.

# Lemmatize the documents.
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]

We find bigrams in the documents. Bigrams are sets of two adjacent words. Using bigrams we can get phrases like “machine_learning” in our output (spaces are replaced with underscores); without bigrams we would only get “machine” and “learning”.

Note that in the code below, we find bigrams and then add them to the original data, because we would like to keep the words “machine” and “learning” as well as the bigram “machine_learning”.

Important

Computing n-grams of large dataset can be very computationally and memory intensive.

# Compute bigrams.
from gensim.models import Phrases

# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)

Out:

2022-04-22 17:42:29,962 : INFO : collecting all words and their counts
2022-04-22 17:42:29,963 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2022-04-22 17:42:37,368 : INFO : collected 1120198 token types (unigram + bigrams) from a corpus of 4629808 words and 1740 sentences
2022-04-22 17:42:37,368 : INFO : merged Phrases<1120198 vocab, min_count=20, threshold=10.0, max_vocab_size=40000000>
2022-04-22 17:42:37,426 : INFO : Phrases lifecycle event {'msg': 'built Phrases<1120198 vocab, min_count=20, threshold=10.0, max_vocab_size=40000000> in 7.41s', 'datetime': '2022-04-22T17:42:37.369061', 'gensim': '4.1.3.dev0', 'python': '3.9.7 (default, Sep  3 2021, 12:37:55) \n[Clang 12.0.5 (clang-1205.0.22.9)]', 'platform': 'macOS-11.6.5-x86_64-i386-64bit', 'event': 'created'}

We remove rare words and common words based on their document frequency. Below we remove words that appear in less than 20 documents or in more than 50% of the documents. Consider trying to remove words only based on their frequency, or maybe combining that with this approach.

# Remove rare and common tokens.
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)

Out:

2022-04-22 17:42:50,414 : INFO : adding document #0 to Dictionary<0 unique tokens: []>
2022-04-22 17:42:54,959 : INFO : built Dictionary<79429 unique tokens: ['1ooooo', '1st', '25oo', '2o00', '4ooo']...> from 1740 documents (total 4953968 corpus positions)
2022-04-22 17:42:54,960 : INFO : Dictionary lifecycle event {'msg': "built Dictionary<79429 unique tokens: ['1ooooo', '1st', '25oo', '2o00', '4ooo']...> from 1740 documents (total 4953968 corpus positions)", 'datetime': '2022-04-22T17:42:54.960496', 'gensim': '4.1.3.dev0', 'python': '3.9.7 (default, Sep  3 2021, 12:37:55) \n[Clang 12.0.5 (clang-1205.0.22.9)]', 'platform': 'macOS-11.6.5-x86_64-i386-64bit', 'event': 'created'}
2022-04-22 17:42:55,733 : INFO : discarding 70785 tokens: [('1ooooo', 1), ('25oo', 2), ('2o00', 6), ('4ooo', 2), ('64k', 6), ('a', 1740), ('aaditional', 1), ('above', 1114), ('abstract', 1740), ('acase', 1)]...
2022-04-22 17:42:55,734 : INFO : keeping 8644 tokens which were in no less than 20 and no more than 870 (=50.0%) documents
2022-04-22 17:42:55,779 : INFO : resulting dictionary: Dictionary<8644 unique tokens: ['1st', '5oo', '7th', 'a2', 'a_well']...>

Finally, we transform the documents to a vectorized form. We simply compute the frequency of each word, including the bigrams.

# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]

Let’s see how many tokens and documents we have to train on.

print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Out:

Number of unique tokens: 8644
Number of documents: 1740

Training

We are ready to train the LDA model. We will first discuss how to set some of the training parameters.

First of all, the elephant in the room: how many topics do I need? There is really no easy answer for this, it will depend on both your data and your application. I have used 10 topics here because I wanted to have a few topics that I could interpret and “label”, and because that turned out to give me reasonably good results. You might not need to interpret all your topics, so you could use a large number of topics, for example 100.

chunksize controls how many documents are processed at a time in the training algorithm. Increasing chunksize will speed up training, at least as long as the chunk of documents easily fit into memory. I’ve set chunksize = 2000, which is more than the amount of documents, so I process all the data in one go. Chunksize can however influence the quality of the model, as discussed in Hoffman and co-authors [2], but the difference was not substantial in this case.

passes controls how often we train the model on the entire corpus. Another word for passes might be “epochs”. iterations is somewhat technical, but essentially it controls how often we repeat a particular loop over each document. It is important to set the number of “passes” and “iterations” high enough.

I suggest the following way to choose iterations and passes. First, enable logging (as described in many Gensim tutorials), and set eval_every = 1 in LdaModel. When training the model look for a line in the log that looks something like this:

2016-06-21 15:40:06,753 - gensim.models.ldamodel - DEBUG - 68/1566 documents converged within 400 iterations

If you set passes = 20 you will see this line 20 times. Make sure that by the final passes, most of the documents have converged. So you want to choose both passes and iterations to be high enough for this to happen.

We set alpha = 'auto' and eta = 'auto'. Again this is somewhat technical, but essentially we are automatically learning two parameters in the model that we usually would have to specify explicitly.

# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
num_topics = 10
chunksize = 2000
passes = 20
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make an index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

Out:

2022-04-22 17:43:05,111 : INFO : using autotuned alpha, starting with [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
2022-04-22 17:43:05,115 : INFO : using serial LDA version on this node
2022-04-22 17:43:05,137 : INFO : running online (multi-pass) LDA training, 10 topics, 20 passes over the supplied corpus of 1740 documents, updating model once every 1740 documents, evaluating perplexity every 0 documents, iterating 400x with a convergence threshold of 0.001000
2022-04-22 17:43:05,148 : INFO : PROGRESS: pass 0, at document #1740/1740
2022-04-22 17:43:21,190 : INFO : optimized alpha [0.0578294, 0.07125457, 0.07889137, 0.09016259, 0.077791244, 0.0792375, 0.097086295, 0.061600033, 0.095310934, 0.060617708]
2022-04-22 17:43:21,202 : INFO : topic #0 (0.058): 0.007*"hidden" + 0.006*"word" + 0.005*"recognition" + 0.004*"gaussian" + 0.003*"hidden_unit" + 0.003*"rule" + 0.003*"component" + 0.003*"layer" + 0.003*"image" + 0.002*"connection"
2022-04-22 17:43:21,202 : INFO : topic #9 (0.061): 0.015*"neuron" + 0.007*"cell" + 0.005*"signal" + 0.005*"spike" + 0.004*"layer" + 0.004*"response" + 0.004*"firing" + 0.004*"noise" + 0.003*"density" + 0.003*"hidden"
2022-04-22 17:43:21,202 : INFO : topic #3 (0.090): 0.006*"image" + 0.005*"class" + 0.003*"classifier" + 0.003*"classification" + 0.003*"recognition" + 0.003*"component" + 0.003*"kernel" + 0.003*"noise" + 0.003*"sequence" + 0.002*"rule"
2022-04-22 17:43:21,203 : INFO : topic #8 (0.095): 0.004*"hidden" + 0.003*"signal" + 0.003*"rule" + 0.003*"dynamic" + 0.002*"control" + 0.002*"prediction" + 0.002*"net" + 0.002*"sequence" + 0.002*"speech" + 0.002*"matrix"
2022-04-22 17:43:21,203 : INFO : topic #6 (0.097): 0.006*"image" + 0.005*"cell" + 0.004*"neuron" + 0.004*"layer" + 0.004*"field" + 0.004*"object" + 0.003*"recognition" + 0.003*"signal" + 0.003*"noise" + 0.003*"class"
2022-04-22 17:43:21,203 : INFO : topic diff=1.159133, rho=1.000000
2022-04-22 17:43:21,212 : INFO : PROGRESS: pass 1, at document #1740/1740
2022-04-22 17:43:30,981 : INFO : optimized alpha [0.05010912, 0.057179544, 0.06367695, 0.07760008, 0.061386272, 0.06139503, 0.06987214, 0.050920427, 0.08028384, 0.05094144]
2022-04-22 17:43:30,987 : INFO : topic #0 (0.050): 0.009*"word" + 0.009*"hidden" + 0.008*"recognition" + 0.005*"gaussian" + 0.005*"speech" + 0.004*"hidden_unit" + 0.004*"mixture" + 0.003*"layer" + 0.003*"component" + 0.003*"likelihood"
2022-04-22 17:43:30,987 : INFO : topic #9 (0.051): 0.019*"neuron" + 0.009*"cell" + 0.009*"spike" + 0.007*"signal" + 0.006*"response" + 0.005*"firing" + 0.005*"stimulus" + 0.005*"noise" + 0.004*"layer" + 0.004*"visual"
2022-04-22 17:43:30,987 : INFO : topic #6 (0.070): 0.007*"image" + 0.006*"cell" + 0.005*"object" + 0.005*"field" + 0.004*"motion" + 0.004*"visual" + 0.004*"signal" + 0.004*"direction" + 0.004*"layer" + 0.004*"filter"
2022-04-22 17:43:30,988 : INFO : topic #3 (0.078): 0.008*"image" + 0.006*"class" + 0.005*"classifier" + 0.004*"classification" + 0.003*"kernel" + 0.003*"recognition" + 0.003*"component" + 0.003*"noise" + 0.003*"estimate" + 0.003*"gaussian"
2022-04-22 17:43:30,988 : INFO : topic #8 (0.080): 0.004*"hidden" + 0.004*"rule" + 0.003*"sequence" + 0.003*"prediction" + 0.003*"net" + 0.003*"bound" + 0.003*"optimal" + 0.003*"signal" + 0.003*"dynamic" + 0.002*"hidden_unit"
2022-04-22 17:43:30,988 : INFO : topic diff=0.292768, rho=0.577350
2022-04-22 17:43:30,996 : INFO : PROGRESS: pass 2, at document #1740/1740
2022-04-22 17:43:38,324 : INFO : optimized alpha [0.046267115, 0.049782153, 0.055386752, 0.070311576, 0.054385237, 0.052613482, 0.0592381, 0.044921257, 0.07121881, 0.045337107]
2022-04-22 17:43:38,330 : INFO : topic #7 (0.045): 0.009*"chip" + 0.006*"analog" + 0.006*"neuron" + 0.006*"noise" + 0.006*"memory" + 0.005*"layer" + 0.004*"connection" + 0.004*"signal" + 0.004*"circuit" + 0.004*"image"
2022-04-22 17:43:38,331 : INFO : topic #9 (0.045): 0.021*"neuron" + 0.011*"spike" + 0.011*"cell" + 0.007*"signal" + 0.007*"response" + 0.007*"stimulus" + 0.006*"firing" + 0.005*"noise" + 0.004*"visual" + 0.004*"layer"
2022-04-22 17:43:38,331 : INFO : topic #6 (0.059): 0.009*"image" + 0.007*"object" + 0.006*"cell" + 0.006*"visual" + 0.006*"motion" + 0.005*"field" + 0.005*"direction" + 0.004*"filter" + 0.004*"signal" + 0.004*"response"
2022-04-22 17:43:38,331 : INFO : topic #3 (0.070): 0.007*"image" + 0.007*"class" + 0.005*"classifier" + 0.004*"classification" + 0.003*"kernel" + 0.003*"sample" + 0.003*"estimate" + 0.003*"gaussian" + 0.003*"component" + 0.003*"noise"
2022-04-22 17:43:38,331 : INFO : topic #8 (0.071): 0.005*"hidden" + 0.005*"rule" + 0.003*"sequence" + 0.003*"net" + 0.003*"bound" + 0.003*"prediction" + 0.003*"optimal" + 0.003*"generalization" + 0.003*"hidden_unit" + 0.002*"tree"
2022-04-22 17:43:38,331 : INFO : topic diff=0.259048, rho=0.500000
2022-04-22 17:43:38,339 : INFO : PROGRESS: pass 3, at document #1740/1740
2022-04-22 17:43:44,815 : INFO : optimized alpha [0.04398281, 0.045212083, 0.050260257, 0.066244416, 0.050919566, 0.047668763, 0.053777307, 0.041211806, 0.06501518, 0.041524593]
2022-04-22 17:43:44,821 : INFO : topic #7 (0.041): 0.010*"chip" + 0.007*"analog" + 0.007*"neuron" + 0.006*"memory" + 0.006*"noise" + 0.005*"circuit" + 0.005*"signal" + 0.005*"layer" + 0.004*"voltage" + 0.004*"connection"
2022-04-22 17:43:44,821 : INFO : topic #9 (0.042): 0.021*"neuron" + 0.012*"spike" + 0.012*"cell" + 0.008*"signal" + 0.008*"stimulus" + 0.008*"response" + 0.007*"firing" + 0.005*"noise" + 0.004*"visual" + 0.004*"activity"
2022-04-22 17:43:44,821 : INFO : topic #6 (0.054): 0.011*"image" + 0.008*"object" + 0.007*"visual" + 0.007*"motion" + 0.006*"field" + 0.006*"cell" + 0.005*"direction" + 0.005*"filter" + 0.004*"signal" + 0.004*"response"
2022-04-22 17:43:44,822 : INFO : topic #8 (0.065): 0.005*"rule" + 0.005*"hidden" + 0.003*"sequence" + 0.003*"generalization" + 0.003*"net" + 0.003*"bound" + 0.003*"prediction" + 0.003*"hidden_unit" + 0.003*"optimal" + 0.003*"machine"
2022-04-22 17:43:44,822 : INFO : topic #3 (0.066): 0.007*"image" + 0.007*"class" + 0.005*"classifier" + 0.005*"classification" + 0.004*"gaussian" + 0.004*"sample" + 0.003*"estimate" + 0.003*"kernel" + 0.003*"noise" + 0.003*"component"
2022-04-22 17:43:44,822 : INFO : topic diff=0.235399, rho=0.447214
2022-04-22 17:43:44,830 : INFO : PROGRESS: pass 4, at document #1740/1740
2022-04-22 17:43:50,907 : INFO : optimized alpha [0.042409703, 0.0423433, 0.04680129, 0.06358971, 0.049375836, 0.044652227, 0.0507185, 0.038540646, 0.06110631, 0.038821314]
2022-04-22 17:43:50,913 : INFO : topic #7 (0.039): 0.011*"chip" + 0.008*"analog" + 0.008*"neuron" + 0.007*"circuit" + 0.007*"memory" + 0.006*"noise" + 0.006*"signal" + 0.005*"voltage" + 0.005*"layer" + 0.004*"vlsi"
2022-04-22 17:43:50,914 : INFO : topic #9 (0.039): 0.021*"neuron" + 0.013*"spike" + 0.013*"cell" + 0.009*"stimulus" + 0.009*"signal" + 0.009*"response" + 0.007*"firing" + 0.006*"noise" + 0.004*"activity" + 0.004*"visual"
2022-04-22 17:43:50,914 : INFO : topic #6 (0.051): 0.013*"image" + 0.009*"object" + 0.008*"visual" + 0.007*"motion" + 0.007*"field" + 0.006*"cell" + 0.006*"direction" + 0.005*"filter" + 0.005*"response" + 0.004*"map"
2022-04-22 17:43:50,914 : INFO : topic #8 (0.061): 0.006*"rule" + 0.005*"hidden" + 0.004*"generalization" + 0.004*"sequence" + 0.003*"net" + 0.003*"prediction" + 0.003*"hidden_unit" + 0.003*"bound" + 0.003*"machine" + 0.003*"tree"
2022-04-22 17:43:50,914 : INFO : topic #3 (0.064): 0.007*"class" + 0.006*"image" + 0.005*"classifier" + 0.005*"classification" + 0.004*"gaussian" + 0.004*"sample" + 0.004*"estimate" + 0.003*"kernel" + 0.003*"density" + 0.003*"prior"
2022-04-22 17:43:50,915 : INFO : topic diff=0.220905, rho=0.408248
2022-04-22 17:43:50,922 : INFO : PROGRESS: pass 5, at document #1740/1740
2022-04-22 17:43:57,459 : INFO : optimized alpha [0.04136415, 0.040443134, 0.04439863, 0.062082667, 0.048723623, 0.042787064, 0.048876576, 0.036657482, 0.058343116, 0.03701785]
2022-04-22 17:43:57,465 : INFO : topic #7 (0.037): 0.012*"chip" + 0.009*"analog" + 0.008*"neuron" + 0.008*"circuit" + 0.007*"memory" + 0.006*"signal" + 0.006*"noise" + 0.006*"voltage" + 0.005*"vlsi" + 0.004*"layer"
2022-04-22 17:43:57,465 : INFO : topic #9 (0.037): 0.022*"neuron" + 0.013*"spike" + 0.013*"cell" + 0.009*"stimulus" + 0.009*"signal" + 0.009*"response" + 0.008*"firing" + 0.006*"noise" + 0.004*"activity" + 0.004*"channel"
2022-04-22 17:43:57,466 : INFO : topic #6 (0.049): 0.015*"image" + 0.010*"object" + 0.009*"visual" + 0.007*"motion" + 0.007*"field" + 0.006*"direction" + 0.006*"cell" + 0.005*"filter" + 0.005*"map" + 0.005*"response"
2022-04-22 17:43:57,466 : INFO : topic #8 (0.058): 0.006*"rule" + 0.005*"hidden" + 0.004*"generalization" + 0.004*"sequence" + 0.003*"net" + 0.003*"hidden_unit" + 0.003*"prediction" + 0.003*"bound" + 0.003*"machine" + 0.003*"tree"
2022-04-22 17:43:57,466 : INFO : topic #3 (0.062): 0.007*"class" + 0.006*"image" + 0.005*"classifier" + 0.005*"classification" + 0.004*"gaussian" + 0.004*"sample" + 0.004*"estimate" + 0.004*"density" + 0.004*"prior" + 0.003*"bayesian"
2022-04-22 17:43:57,467 : INFO : topic diff=0.210451, rho=0.377964
2022-04-22 17:43:57,477 : INFO : PROGRESS: pass 6, at document #1740/1740
2022-04-22 17:44:02,657 : INFO : optimized alpha [0.040722344, 0.039083496, 0.04264549, 0.061218463, 0.048731733, 0.041630186, 0.047772773, 0.03532755, 0.0563227, 0.03579225]
2022-04-22 17:44:02,663 : INFO : topic #7 (0.035): 0.012*"chip" + 0.010*"analog" + 0.009*"circuit" + 0.009*"neuron" + 0.007*"memory" + 0.007*"signal" + 0.006*"voltage" + 0.006*"noise" + 0.005*"vlsi" + 0.004*"implementation"
2022-04-22 17:44:02,663 : INFO : topic #9 (0.036): 0.022*"neuron" + 0.014*"spike" + 0.013*"cell" + 0.010*"stimulus" + 0.009*"signal" + 0.009*"response" + 0.008*"firing" + 0.006*"noise" + 0.005*"channel" + 0.005*"activity"
2022-04-22 17:44:02,664 : INFO : topic #4 (0.049): 0.008*"matrix" + 0.006*"gradient" + 0.005*"solution" + 0.004*"convergence" + 0.004*"distance" + 0.004*"let" + 0.004*"minimum" + 0.003*"optimization" + 0.003*"neuron" + 0.003*"eq"
2022-04-22 17:44:02,664 : INFO : topic #8 (0.056): 0.007*"rule" + 0.005*"hidden" + 0.004*"generalization" + 0.004*"sequence" + 0.004*"net" + 0.003*"hidden_unit" + 0.003*"prediction" + 0.003*"tree" + 0.003*"machine" + 0.003*"bound"
2022-04-22 17:44:02,664 : INFO : topic #3 (0.061): 0.007*"class" + 0.005*"classifier" + 0.005*"image" + 0.005*"gaussian" + 0.005*"classification" + 0.004*"sample" + 0.004*"estimate" + 0.004*"density" + 0.004*"prior" + 0.004*"bayesian"
2022-04-22 17:44:02,664 : INFO : topic diff=0.201353, rho=0.353553
2022-04-22 17:44:02,673 : INFO : PROGRESS: pass 7, at document #1740/1740
2022-04-22 17:44:08,716 : INFO : optimized alpha [0.040365368, 0.038083963, 0.041339714, 0.06076524, 0.04909782, 0.040898465, 0.047129765, 0.034341704, 0.054831598, 0.034885667]
2022-04-22 17:44:08,722 : INFO : topic #7 (0.034): 0.013*"chip" + 0.010*"circuit" + 0.010*"analog" + 0.009*"neuron" + 0.007*"memory" + 0.007*"signal" + 0.007*"voltage" + 0.006*"noise" + 0.005*"vlsi" + 0.005*"implementation"
2022-04-22 17:44:08,723 : INFO : topic #9 (0.035): 0.022*"neuron" + 0.014*"spike" + 0.014*"cell" + 0.010*"stimulus" + 0.010*"signal" + 0.010*"response" + 0.008*"firing" + 0.006*"noise" + 0.005*"channel" + 0.005*"activity"
2022-04-22 17:44:08,723 : INFO : topic #4 (0.049): 0.009*"matrix" + 0.006*"gradient" + 0.005*"solution" + 0.004*"convergence" + 0.004*"distance" + 0.004*"let" + 0.004*"minimum" + 0.003*"optimization" + 0.003*"eq" + 0.003*"neuron"
2022-04-22 17:44:08,723 : INFO : topic #8 (0.055): 0.007*"rule" + 0.005*"hidden" + 0.005*"generalization" + 0.004*"sequence" + 0.004*"hidden_unit" + 0.004*"net" + 0.003*"prediction" + 0.003*"tree" + 0.003*"machine" + 0.003*"bound"
2022-04-22 17:44:08,723 : INFO : topic #3 (0.061): 0.007*"class" + 0.005*"classifier" + 0.005*"gaussian" + 0.005*"classification" + 0.005*"sample" + 0.005*"image" + 0.004*"estimate" + 0.004*"density" + 0.004*"prior" + 0.004*"bayesian"
2022-04-22 17:44:08,724 : INFO : topic diff=0.192330, rho=0.333333
2022-04-22 17:44:08,732 : INFO : PROGRESS: pass 8, at document #1740/1740
2022-04-22 17:44:13,585 : INFO : optimized alpha [0.040182494, 0.037441313, 0.04036209, 0.060601927, 0.049758103, 0.04055522, 0.046829112, 0.03359148, 0.053864058, 0.03418947]
2022-04-22 17:44:13,591 : INFO : topic #7 (0.034): 0.013*"chip" + 0.011*"circuit" + 0.011*"analog" + 0.009*"neuron" + 0.007*"signal" + 0.007*"memory" + 0.007*"voltage" + 0.006*"vlsi" + 0.006*"noise" + 0.005*"implementation"
2022-04-22 17:44:13,592 : INFO : topic #9 (0.034): 0.022*"neuron" + 0.014*"spike" + 0.014*"cell" + 0.010*"stimulus" + 0.010*"signal" + 0.010*"response" + 0.008*"firing" + 0.006*"noise" + 0.005*"channel" + 0.005*"frequency"
2022-04-22 17:44:13,592 : INFO : topic #4 (0.050): 0.009*"matrix" + 0.006*"gradient" + 0.005*"solution" + 0.005*"convergence" + 0.004*"distance" + 0.004*"let" + 0.004*"minimum" + 0.003*"optimization" + 0.003*"eq" + 0.003*"descent"
2022-04-22 17:44:13,592 : INFO : topic #8 (0.054): 0.007*"rule" + 0.006*"hidden" + 0.005*"generalization" + 0.004*"sequence" + 0.004*"hidden_unit" + 0.004*"net" + 0.003*"prediction" + 0.003*"tree" + 0.003*"machine" + 0.003*"trained"
2022-04-22 17:44:13,592 : INFO : topic #3 (0.061): 0.007*"class" + 0.005*"classifier" + 0.005*"gaussian" + 0.005*"sample" + 0.005*"classification" + 0.004*"estimate" + 0.004*"density" + 0.004*"image" + 0.004*"prior" + 0.004*"bayesian"
2022-04-22 17:44:13,593 : INFO : topic diff=0.182985, rho=0.316228
2022-04-22 17:44:13,601 : INFO : PROGRESS: pass 9, at document #1740/1740
2022-04-22 17:44:19,306 : INFO : optimized alpha [0.040097952, 0.036957335, 0.039702885, 0.060680483, 0.050588053, 0.040437363, 0.046769954, 0.033025023, 0.053330485, 0.033663847]
2022-04-22 17:44:19,312 : INFO : topic #7 (0.033): 0.013*"chip" + 0.012*"circuit" + 0.011*"analog" + 0.010*"neuron" + 0.008*"signal" + 0.007*"memory" + 0.007*"voltage" + 0.006*"vlsi" + 0.005*"noise" + 0.005*"implementation"
2022-04-22 17:44:19,312 : INFO : topic #9 (0.034): 0.022*"neuron" + 0.014*"spike" + 0.014*"cell" + 0.010*"stimulus" + 0.010*"signal" + 0.010*"response" + 0.008*"firing" + 0.007*"noise" + 0.006*"channel" + 0.005*"frequency"
2022-04-22 17:44:19,313 : INFO : topic #4 (0.051): 0.009*"matrix" + 0.006*"gradient" + 0.006*"solution" + 0.005*"convergence" + 0.004*"distance" + 0.004*"let" + 0.004*"minimum" + 0.004*"eq" + 0.003*"optimization" + 0.003*"descent"
2022-04-22 17:44:19,313 : INFO : topic #8 (0.053): 0.008*"rule" + 0.006*"hidden" + 0.005*"generalization" + 0.004*"sequence" + 0.004*"hidden_unit" + 0.004*"net" + 0.004*"prediction" + 0.003*"tree" + 0.003*"machine" + 0.003*"trained"
2022-04-22 17:44:19,313 : INFO : topic #3 (0.061): 0.007*"class" + 0.005*"gaussian" + 0.005*"classifier" + 0.005*"sample" + 0.005*"classification" + 0.005*"estimate" + 0.004*"density" + 0.004*"prior" + 0.004*"bayesian" + 0.004*"mixture"
2022-04-22 17:44:19,313 : INFO : topic diff=0.173278, rho=0.301511
2022-04-22 17:44:19,321 : INFO : PROGRESS: pass 10, at document #1740/1740
2022-04-22 17:44:23,819 : INFO : optimized alpha [0.040098477, 0.036638554, 0.03923829, 0.060877353, 0.051485594, 0.04045682, 0.04686068, 0.032584008, 0.05302629, 0.03327818]
2022-04-22 17:44:23,825 : INFO : topic #7 (0.033): 0.013*"chip" + 0.012*"circuit" + 0.011*"analog" + 0.010*"neuron" + 0.008*"signal" + 0.007*"memory" + 0.007*"voltage" + 0.006*"vlsi" + 0.005*"noise" + 0.005*"implementation"
2022-04-22 17:44:23,825 : INFO : topic #9 (0.033): 0.021*"neuron" + 0.014*"spike" + 0.014*"cell" + 0.011*"stimulus" + 0.010*"signal" + 0.010*"response" + 0.008*"firing" + 0.007*"noise" + 0.006*"channel" + 0.006*"frequency"
2022-04-22 17:44:23,826 : INFO : topic #4 (0.051): 0.009*"matrix" + 0.006*"gradient" + 0.006*"solution" + 0.005*"convergence" + 0.004*"distance" + 0.004*"let" + 0.004*"minimum" + 0.004*"eq" + 0.004*"optimization" + 0.003*"descent"
2022-04-22 17:44:23,826 : INFO : topic #8 (0.053): 0.008*"rule" + 0.006*"hidden" + 0.005*"generalization" + 0.004*"hidden_unit" + 0.004*"sequence" + 0.004*"net" + 0.004*"prediction" + 0.004*"tree" + 0.003*"machine" + 0.003*"trained"
2022-04-22 17:44:23,826 : INFO : topic #3 (0.061): 0.007*"class" + 0.006*"gaussian" + 0.005*"classifier" + 0.005*"sample" + 0.005*"estimate" + 0.005*"classification" + 0.004*"density" + 0.004*"prior" + 0.004*"mixture" + 0.004*"bayesian"
2022-04-22 17:44:23,827 : INFO : topic diff=0.163348, rho=0.288675
2022-04-22 17:44:23,834 : INFO : PROGRESS: pass 11, at document #1740/1740
2022-04-22 17:44:29,135 : INFO : optimized alpha [0.040188633, 0.03646946, 0.038880475, 0.06112813, 0.05245481, 0.04061286, 0.047049697, 0.03229136, 0.05290524, 0.03296597]
2022-04-22 17:44:29,141 : INFO : topic #7 (0.032): 0.013*"chip" + 0.013*"circuit" + 0.011*"analog" + 0.010*"neuron" + 0.008*"signal" + 0.007*"memory" + 0.007*"voltage" + 0.006*"vlsi" + 0.005*"noise" + 0.005*"implementation"
2022-04-22 17:44:29,141 : INFO : topic #9 (0.033): 0.021*"neuron" + 0.014*"spike" + 0.014*"cell" + 0.011*"signal" + 0.011*"stimulus" + 0.010*"response" + 0.008*"firing" + 0.007*"noise" + 0.006*"frequency" + 0.006*"channel"
2022-04-22 17:44:29,142 : INFO : topic #4 (0.052): 0.009*"matrix" + 0.006*"gradient" + 0.006*"solution" + 0.005*"convergence" + 0.004*"distance" + 0.004*"let" + 0.004*"minimum" + 0.004*"eq" + 0.004*"optimization" + 0.003*"optimal"
2022-04-22 17:44:29,142 : INFO : topic #8 (0.053): 0.008*"rule" + 0.006*"hidden" + 0.005*"generalization" + 0.004*"hidden_unit" + 0.004*"sequence" + 0.004*"prediction" + 0.004*"net" + 0.004*"tree" + 0.003*"machine" + 0.003*"trained"
2022-04-22 17:44:29,142 : INFO : topic #3 (0.061): 0.007*"class" + 0.006*"gaussian" + 0.005*"classifier" + 0.005*"sample" + 0.005*"estimate" + 0.005*"classification" + 0.005*"density" + 0.004*"prior" + 0.004*"mixture" + 0.004*"bayesian"
2022-04-22 17:44:29,142 : INFO : topic diff=0.153485, rho=0.277350
2022-04-22 17:44:29,150 : INFO : PROGRESS: pass 12, at document #1740/1740
2022-04-22 17:44:33,545 : INFO : optimized alpha [0.04036388, 0.03635188, 0.038611963, 0.061483774, 0.05345723, 0.040894084, 0.04736741, 0.03211178, 0.05297828, 0.03274891]
2022-04-22 17:44:33,551 : INFO : topic #7 (0.032): 0.013*"circuit" + 0.013*"chip" + 0.011*"analog" + 0.010*"neuron" + 0.008*"signal" + 0.007*"voltage" + 0.007*"memory" + 0.006*"vlsi" + 0.005*"implementation" + 0.005*"noise"
2022-04-22 17:44:33,552 : INFO : topic #9 (0.033): 0.021*"neuron" + 0.014*"spike" + 0.014*"cell" + 0.011*"signal" + 0.011*"stimulus" + 0.011*"response" + 0.009*"firing" + 0.007*"noise" + 0.006*"frequency" + 0.006*"channel"
2022-04-22 17:44:33,552 : INFO : topic #8 (0.053): 0.008*"rule" + 0.006*"hidden" + 0.006*"generalization" + 0.004*"hidden_unit" + 0.004*"sequence" + 0.004*"prediction" + 0.004*"net" + 0.004*"tree" + 0.003*"machine" + 0.003*"trained"
2022-04-22 17:44:33,552 : INFO : topic #4 (0.053): 0.009*"matrix" + 0.006*"gradient" + 0.006*"solution" + 0.005*"convergence" + 0.004*"distance" + 0.004*"let" + 0.004*"minimum" + 0.004*"eq" + 0.003*"optimization" + 0.003*"optimal"
2022-04-22 17:44:33,552 : INFO : topic #3 (0.061): 0.007*"class" + 0.006*"gaussian" + 0.005*"sample" + 0.005*"classifier" + 0.005*"estimate" + 0.005*"density" + 0.005*"classification" + 0.004*"prior" + 0.004*"mixture" + 0.004*"bayesian"
2022-04-22 17:44:33,553 : INFO : topic diff=0.143831, rho=0.267261
2022-04-22 17:44:33,562 : INFO : PROGRESS: pass 13, at document #1740/1740
2022-04-22 17:44:39,235 : INFO : optimized alpha [0.040587135, 0.03631959, 0.03839379, 0.061911535, 0.05453887, 0.041285977, 0.047773384, 0.032027513, 0.05315258, 0.03261802]
2022-04-22 17:44:39,246 : INFO : topic #7 (0.032): 0.014*"circuit" + 0.013*"chip" + 0.011*"analog" + 0.010*"neuron" + 0.008*"signal" + 0.008*"voltage" + 0.007*"memory" + 0.006*"vlsi" + 0.005*"implementation" + 0.005*"noise"
2022-04-22 17:44:39,246 : INFO : topic #9 (0.033): 0.021*"neuron" + 0.014*"spike" + 0.014*"cell" + 0.011*"signal" + 0.011*"stimulus" + 0.011*"response" + 0.009*"firing" + 0.007*"noise" + 0.007*"frequency" + 0.006*"channel"
2022-04-22 17:44:39,247 : INFO : topic #8 (0.053): 0.008*"rule" + 0.006*"hidden" + 0.006*"generalization" + 0.004*"hidden_unit" + 0.004*"sequence" + 0.004*"prediction" + 0.004*"net" + 0.004*"tree" + 0.004*"machine" + 0.003*"trained"
2022-04-22 17:44:39,247 : INFO : topic #4 (0.055): 0.009*"matrix" + 0.007*"gradient" + 0.006*"solution" + 0.005*"convergence" + 0.004*"distance" + 0.004*"let" + 0.004*"minimum" + 0.004*"eq" + 0.003*"optimization" + 0.003*"optimal"
2022-04-22 17:44:39,247 : INFO : topic #3 (0.062): 0.007*"class" + 0.006*"gaussian" + 0.005*"sample" + 0.005*"classifier" + 0.005*"estimate" + 0.005*"density" + 0.005*"classification" + 0.004*"prior" + 0.004*"mixture" + 0.004*"bayesian"
2022-04-22 17:44:39,248 : INFO : topic diff=0.134602, rho=0.258199
2022-04-22 17:44:39,258 : INFO : PROGRESS: pass 14, at document #1740/1740
2022-04-22 17:44:46,319 : INFO : optimized alpha [0.040821876, 0.036360793, 0.03824259, 0.062456302, 0.055688635, 0.041737743, 0.048259463, 0.032020763, 0.05343126, 0.03254091]
2022-04-22 17:44:46,325 : INFO : topic #7 (0.032): 0.014*"circuit" + 0.013*"chip" + 0.012*"analog" + 0.010*"neuron" + 0.008*"signal" + 0.008*"voltage" + 0.007*"memory" + 0.006*"vlsi" + 0.005*"implementation" + 0.005*"noise"
2022-04-22 17:44:46,326 : INFO : topic #9 (0.033): 0.021*"neuron" + 0.014*"spike" + 0.014*"cell" + 0.011*"signal" + 0.011*"response" + 0.011*"stimulus" + 0.009*"firing" + 0.007*"noise" + 0.007*"frequency" + 0.006*"channel"
2022-04-22 17:44:46,327 : INFO : topic #8 (0.053): 0.008*"rule" + 0.006*"hidden" + 0.006*"generalization" + 0.004*"hidden_unit" + 0.004*"sequence" + 0.004*"prediction" + 0.004*"net" + 0.004*"tree" + 0.004*"machine" + 0.003*"trained"
2022-04-22 17:44:46,327 : INFO : topic #4 (0.056): 0.009*"matrix" + 0.007*"gradient" + 0.006*"solution" + 0.005*"convergence" + 0.004*"distance" + 0.004*"let" + 0.004*"minimum" + 0.004*"eq" + 0.004*"optimal" + 0.003*"optimization"
2022-04-22 17:44:46,327 : INFO : topic #3 (0.062): 0.007*"class" + 0.006*"gaussian" + 0.005*"sample" + 0.005*"classifier" + 0.005*"estimate" + 0.005*"density" + 0.004*"mixture" + 0.004*"classification" + 0.004*"prior" + 0.004*"bayesian"
2022-04-22 17:44:46,328 : INFO : topic diff=0.125871, rho=0.250000
2022-04-22 17:44:46,338 : INFO : PROGRESS: pass 15, at document #1740/1740
2022-04-22 17:44:53,655 : INFO : optimized alpha [0.04109236, 0.036467522, 0.0381424, 0.06306473, 0.056903645, 0.04227092, 0.04874864, 0.032058466, 0.053792715, 0.03251973]
2022-04-22 17:44:53,666 : INFO : topic #7 (0.032): 0.014*"circuit" + 0.013*"chip" + 0.012*"analog" + 0.010*"neuron" + 0.008*"signal" + 0.008*"voltage" + 0.007*"memory" + 0.006*"vlsi" + 0.005*"implementation" + 0.005*"bit"
2022-04-22 17:44:53,666 : INFO : topic #9 (0.033): 0.021*"neuron" + 0.014*"spike" + 0.014*"cell" + 0.011*"signal" + 0.011*"response" + 0.011*"stimulus" + 0.009*"firing" + 0.007*"frequency" + 0.007*"noise" + 0.006*"channel"
2022-04-22 17:44:53,667 : INFO : topic #8 (0.054): 0.008*"rule" + 0.006*"hidden" + 0.006*"generalization" + 0.004*"hidden_unit" + 0.004*"prediction" + 0.004*"sequence" + 0.004*"net" + 0.004*"tree" + 0.004*"machine" + 0.003*"trained"
2022-04-22 17:44:53,667 : INFO : topic #4 (0.057): 0.009*"matrix" + 0.007*"gradient" + 0.006*"solution" + 0.005*"convergence" + 0.004*"distance" + 0.004*"let" + 0.004*"minimum" + 0.004*"eq" + 0.004*"optimal" + 0.003*"optimization"
2022-04-22 17:44:53,667 : INFO : topic #3 (0.063): 0.007*"class" + 0.006*"gaussian" + 0.005*"sample" + 0.005*"estimate" + 0.005*"classifier" + 0.005*"density" + 0.005*"mixture" + 0.005*"prior" + 0.004*"classification" + 0.004*"bayesian"
2022-04-22 17:44:53,667 : INFO : topic diff=0.117670, rho=0.242536
2022-04-22 17:44:53,679 : INFO : PROGRESS: pass 16, at document #1740/1740
2022-04-22 17:45:00,393 : INFO : optimized alpha [0.041376065, 0.03660367, 0.0380804, 0.06374838, 0.058118302, 0.0428449, 0.049285352, 0.03212048, 0.054208644, 0.032528903]
2022-04-22 17:45:00,403 : INFO : topic #7 (0.032): 0.014*"circuit" + 0.013*"chip" + 0.012*"analog" + 0.011*"neuron" + 0.008*"signal" + 0.008*"voltage" + 0.008*"memory" + 0.006*"vlsi" + 0.006*"implementation" + 0.005*"bit"
2022-04-22 17:45:00,403 : INFO : topic #9 (0.033): 0.021*"neuron" + 0.014*"cell" + 0.014*"spike" + 0.012*"signal" + 0.011*"response" + 0.011*"stimulus" + 0.009*"firing" + 0.007*"frequency" + 0.007*"noise" + 0.007*"channel"
2022-04-22 17:45:00,404 : INFO : topic #8 (0.054): 0.008*"rule" + 0.006*"hidden" + 0.006*"generalization" + 0.004*"hidden_unit" + 0.004*"prediction" + 0.004*"sequence" + 0.004*"net" + 0.004*"tree" + 0.004*"machine" + 0.003*"trained"
2022-04-22 17:45:00,404 : INFO : topic #4 (0.058): 0.009*"matrix" + 0.007*"gradient" + 0.006*"solution" + 0.005*"convergence" + 0.004*"distance" + 0.004*"let" + 0.004*"minimum" + 0.004*"eq" + 0.004*"optimal" + 0.003*"optimization"
2022-04-22 17:45:00,404 : INFO : topic #3 (0.064): 0.007*"class" + 0.006*"gaussian" + 0.006*"sample" + 0.005*"estimate" + 0.005*"classifier" + 0.005*"density" + 0.005*"mixture" + 0.005*"prior" + 0.004*"classification" + 0.004*"bayesian"
2022-04-22 17:45:00,405 : INFO : topic diff=0.109988, rho=0.235702
2022-04-22 17:45:00,416 : INFO : PROGRESS: pass 17, at document #1740/1740
2022-04-22 17:45:09,386 : INFO : optimized alpha [0.041690826, 0.036777373, 0.038074017, 0.06447209, 0.059317604, 0.043464534, 0.04985148, 0.032209247, 0.05470903, 0.032565065]
2022-04-22 17:45:09,400 : INFO : topic #7 (0.032): 0.014*"circuit" + 0.013*"chip" + 0.012*"analog" + 0.011*"neuron" + 0.008*"signal" + 0.008*"voltage" + 0.008*"memory" + 0.006*"vlsi" + 0.006*"implementation" + 0.005*"bit"
2022-04-22 17:45:09,400 : INFO : topic #9 (0.033): 0.020*"neuron" + 0.014*"cell" + 0.014*"spike" + 0.012*"signal" + 0.011*"response" + 0.011*"stimulus" + 0.009*"firing" + 0.007*"frequency" + 0.007*"noise" + 0.007*"channel"
2022-04-22 17:45:09,401 : INFO : topic #8 (0.055): 0.008*"rule" + 0.006*"hidden" + 0.006*"generalization" + 0.004*"hidden_unit" + 0.004*"prediction" + 0.004*"sequence" + 0.004*"net" + 0.004*"tree" + 0.004*"machine" + 0.003*"trained"
2022-04-22 17:45:09,401 : INFO : topic #4 (0.059): 0.009*"matrix" + 0.007*"gradient" + 0.006*"solution" + 0.005*"convergence" + 0.004*"distance" + 0.004*"let" + 0.004*"minimum" + 0.004*"eq" + 0.004*"optimal" + 0.003*"optimization"
2022-04-22 17:45:09,402 : INFO : topic #3 (0.064): 0.007*"class" + 0.006*"gaussian" + 0.006*"sample" + 0.005*"estimate" + 0.005*"classifier" + 0.005*"density" + 0.005*"mixture" + 0.005*"prior" + 0.004*"classification" + 0.004*"bayesian"
2022-04-22 17:45:09,402 : INFO : topic diff=0.102916, rho=0.229416
2022-04-22 17:45:09,423 : INFO : PROGRESS: pass 18, at document #1740/1740
2022-04-22 17:45:19,067 : INFO : optimized alpha [0.042022552, 0.037017036, 0.038090236, 0.06523256, 0.06052085, 0.044076443, 0.050475497, 0.03232651, 0.055261094, 0.032642473]
2022-04-22 17:45:19,077 : INFO : topic #7 (0.032): 0.015*"circuit" + 0.013*"chip" + 0.012*"analog" + 0.011*"neuron" + 0.008*"signal" + 0.008*"voltage" + 0.008*"memory" + 0.006*"vlsi" + 0.006*"implementation" + 0.005*"bit"
2022-04-22 17:45:19,077 : INFO : topic #9 (0.033): 0.020*"neuron" + 0.014*"cell" + 0.014*"spike" + 0.012*"signal" + 0.011*"response" + 0.011*"stimulus" + 0.009*"firing" + 0.008*"frequency" + 0.007*"noise" + 0.007*"channel"
2022-04-22 17:45:19,078 : INFO : topic #8 (0.055): 0.009*"rule" + 0.006*"hidden" + 0.006*"generalization" + 0.005*"hidden_unit" + 0.004*"prediction" + 0.004*"net" + 0.004*"sequence" + 0.004*"tree" + 0.004*"machine" + 0.003*"trained"
2022-04-22 17:45:19,078 : INFO : topic #4 (0.061): 0.009*"matrix" + 0.007*"gradient" + 0.006*"solution" + 0.005*"convergence" + 0.004*"distance" + 0.004*"minimum" + 0.004*"let" + 0.004*"eq" + 0.004*"optimal" + 0.003*"optimization"
2022-04-22 17:45:19,078 : INFO : topic #3 (0.065): 0.007*"class" + 0.007*"gaussian" + 0.006*"sample" + 0.005*"estimate" + 0.005*"density" + 0.005*"classifier" + 0.005*"mixture" + 0.005*"prior" + 0.004*"classification" + 0.004*"likelihood"
2022-04-22 17:45:19,079 : INFO : topic diff=0.096362, rho=0.223607
2022-04-22 17:45:19,090 : INFO : PROGRESS: pass 19, at document #1740/1740
2022-04-22 17:45:26,202 : INFO : optimized alpha [0.042380035, 0.037280142, 0.03813037, 0.06597655, 0.0617652, 0.044686105, 0.051100377, 0.032451425, 0.05581024, 0.03274816]
2022-04-22 17:45:26,210 : INFO : topic #7 (0.032): 0.015*"circuit" + 0.013*"chip" + 0.012*"analog" + 0.011*"neuron" + 0.008*"signal" + 0.008*"voltage" + 0.008*"memory" + 0.006*"vlsi" + 0.006*"implementation" + 0.005*"bit"
2022-04-22 17:45:26,210 : INFO : topic #9 (0.033): 0.020*"neuron" + 0.015*"cell" + 0.014*"spike" + 0.012*"signal" + 0.011*"response" + 0.011*"stimulus" + 0.009*"firing" + 0.008*"frequency" + 0.007*"noise" + 0.007*"channel"
2022-04-22 17:45:26,210 : INFO : topic #8 (0.056): 0.009*"rule" + 0.006*"hidden" + 0.006*"generalization" + 0.005*"hidden_unit" + 0.004*"prediction" + 0.004*"net" + 0.004*"sequence" + 0.004*"tree" + 0.004*"machine" + 0.003*"trained"
2022-04-22 17:45:26,211 : INFO : topic #4 (0.062): 0.009*"matrix" + 0.007*"gradient" + 0.006*"solution" + 0.005*"convergence" + 0.004*"distance" + 0.004*"minimum" + 0.004*"let" + 0.004*"eq" + 0.004*"optimal" + 0.003*"energy"
2022-04-22 17:45:26,211 : INFO : topic #3 (0.066): 0.007*"class" + 0.007*"gaussian" + 0.006*"sample" + 0.005*"estimate" + 0.005*"density" + 0.005*"mixture" + 0.005*"classifier" + 0.005*"prior" + 0.004*"likelihood" + 0.004*"bayesian"
2022-04-22 17:45:26,211 : INFO : topic diff=0.090311, rho=0.218218
2022-04-22 17:45:26,222 : INFO : LdaModel lifecycle event {'msg': 'trained LdaModel<num_terms=8644, num_topics=10, decay=0.5, chunksize=2000> in 141.08s', 'datetime': '2022-04-22T17:45:26.222157', 'gensim': '4.1.3.dev0', 'python': '3.9.7 (default, Sep  3 2021, 12:37:55) \n[Clang 12.0.5 (clang-1205.0.22.9)]', 'platform': 'macOS-11.6.5-x86_64-i386-64bit', 'event': 'created'}

We can compute the topic coherence of each topic. Below we display the average topic coherence and print the topics in order of topic coherence.

Note that we use the “Umass” topic coherence measure here (see gensim.models.ldamodel.LdaModel.top_topics()), Gensim has recently obtained an implementation of the “AKSW” topic coherence measure (see accompanying blog post, http://rare-technologies.com/what-is-topic-coherence/).

If you are familiar with the subject of the articles in this dataset, you can see that the topics below make a lot of sense. However, they are not without flaws. We can see that there is substantial overlap between some topics, others are hard to interpret, and most of them have at least some terms that seem out of place. If you were able to do better, feel free to share your methods on the blog at http://rare-technologies.com/lda-training-tips/ !

top_topics = model.top_topics(corpus)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

Out:

2022-04-22 17:45:28,224 : INFO : CorpusAccumulator accumulated stats from 1000 documents
Average topic coherence: -1.2010.
[([(0.009335279, 'matrix'),
   (0.006810243, 'gradient'),
   (0.0058767716, 'solution'),
   (0.0050566536, 'convergence'),
   (0.0043554083, 'distance'),
   (0.004101262, 'minimum'),
   (0.0040506367, 'let'),
   (0.0039807004, 'eq'),
   (0.0038555989, 'optimal'),
   (0.0034886731, 'energy'),
   (0.0034828722, 'optimization'),
   (0.0034504435, 'condition'),
   (0.0033918922, 'approximation'),
   (0.0033640305, 'descent'),
   (0.0032366295, 'constraint'),
   (0.0032220806, 'xi'),
   (0.003061566, 'stochastic'),
   (0.0029803582, 'component'),
   (0.0028803074, 'dynamic'),
   (0.00280652, 'graph')],
  -1.0314809310847135),
 ([(0.006758064, 'class'),
   (0.006583767, 'gaussian'),
   (0.005633773, 'sample'),
   (0.0053001167, 'estimate'),
   (0.0049426625, 'density'),
   (0.0048573534, 'mixture'),
   (0.004835742, 'classifier'),
   (0.0046612574, 'prior'),
   (0.004377199, 'likelihood'),
   (0.004344127, 'bayesian'),
   (0.0043293545, 'classification'),
   (0.0037983125, 'regression'),
   (0.0037747815, 'noise'),
   (0.003772593, 'log'),
   (0.0037171794, 'kernel'),
   (0.003717116, 'approximation'),
   (0.0037102823, 'variance'),
   (0.0034671598, 'component'),
   (0.0032801689, 'posterior'),
   (0.003173915, 'em')],
  -1.0736087121706135),
 ([(0.02519838, 'image'),
   (0.013268676, 'object'),
   (0.011446378, 'visual'),
   (0.009458303, 'field'),
   (0.008084482, 'motion'),
   (0.006914001, 'direction'),
   (0.0060067754, 'map'),
   (0.0055346545, 'position'),
   (0.004941865, 'pixel'),
   (0.004847295, 'spatial'),
   (0.0047093197, 'face'),
   (0.0046589067, 'eye'),
   (0.0046168645, 'location'),
   (0.0043804147, 'filter'),
   (0.0042905244, 'response'),
   (0.0041273055, 'view'),
   (0.0040860246, 'orientation'),
   (0.0038862277, 'receptive'),
   (0.0038229467, 'human'),
   (0.0038166828, 'recognition')],
  -1.101159857337566),
 ([(0.015339, 'layer'),
   (0.014894987, 'node'),
   (0.010977563, 'net'),
   (0.0097472165, 'hidden'),
   (0.0075573265, 'threshold'),
   (0.006544599, 'class'),
   (0.006098466, 'bound'),
   (0.005063979, 'activation'),
   (0.0047261445, 'dimension'),
   (0.0046081766, 'hidden_unit'),
   (0.004463069, 'theorem'),
   (0.0043413443, 'region'),
   (0.0040992484, 'polynomial'),
   (0.003927951, 'propagation'),
   (0.003906715, 'hidden_layer'),
   (0.003902104, 'back'),
   (0.0034719643, 'let'),
   (0.0034161368, 'bit'),
   (0.0033824549, 'connection'),
   (0.003204875, 'back_propagation')],
  -1.1578264561349325),
 ([(0.020037105, 'neuron'),
   (0.01450755, 'cell'),
   (0.014472483, 'spike'),
   (0.011981914, 'signal'),
   (0.011293252, 'response'),
   (0.010934215, 'stimulus'),
   (0.008777942, 'firing'),
   (0.0077151447, 'frequency'),
   (0.007196151, 'noise'),
   (0.006772501, 'channel'),
   (0.004612463, 'temporal'),
   (0.0043820725, 'auditory'),
   (0.0043365704, 'activity'),
   (0.0040383274, 'sound'),
   (0.004009629, 'potential'),
   (0.0039981017, 'correlation'),
   (0.0038944164, 'fig'),
   (0.0036725644, 'train'),
   (0.0034477867, 'firing_rate'),
   (0.0033127973, 'source')],
  -1.175461993278655),
 ([(0.015848655, 'neuron'),
   (0.015059427, 'cell'),
   (0.009022958, 'activity'),
   (0.008109199, 'connection'),
   (0.008041161, 'synaptic'),
   (0.0057249856, 'memory'),
   (0.0053059673, 'cortex'),
   (0.0050525647, 'dynamic'),
   (0.0047387453, 'cortical'),
   (0.004596282, 'simulation'),
   (0.004441938, 'inhibitory'),
   (0.004316362, 'phase'),
   (0.004202166, 'response'),
   (0.004129471, 'excitatory'),
   (0.0041026585, 'attractor'),
   (0.0036624784, 'synapsis'),
   (0.003452054, 'fig'),
   (0.003326298, 'interaction'),
   (0.003292976, 'layer'),
   (0.003188004, 'oscillator')],
  -1.224961800422038),
 ([(0.014448352, 'control'),
   (0.011206106, 'action'),
   (0.008610181, 'policy'),
   (0.0073960284, 'reinforcement'),
   (0.0071460134, 'dynamic'),
   (0.006695718, 'trajectory'),
   (0.006001844, 'optimal'),
   (0.005919467, 'controller'),
   (0.005142686, 'robot'),
   (0.0049040187, 'reinforcement_learning'),
   (0.004231131, 'environment'),
   (0.0038927419, 'reward'),
   (0.0036765926, 'goal'),
   (0.0032516345, 'forward'),
   (0.0029738136, 'arm'),
   (0.0029553284, 'adaptive'),
   (0.0029314642, 'sutton'),
   (0.0029179594, 'position'),
   (0.0028270711, 'path'),
   (0.002815493, 'motor')],
  -1.280662748184417),
 ([(0.01465422, 'circuit'),
   (0.0134508265, 'chip'),
   (0.012013224, 'analog'),
   (0.010762642, 'neuron'),
   (0.008197728, 'signal'),
   (0.007833759, 'voltage'),
   (0.0075949323, 'memory'),
   (0.0062134205, 'vlsi'),
   (0.005665418, 'implementation'),
   (0.00510467, 'bit'),
   (0.004741555, 'noise'),
   (0.004108878, 'processor'),
   (0.004068751, 'pulse'),
   (0.00402028, 'digital'),
   (0.003979967, 'design'),
   (0.0037854807, 'hardware'),
   (0.0036803125, 'transistor'),
   (0.0036066298, 'block'),
   (0.0035669305, 'device'),
   (0.0035628842, 'synapse')],
  -1.2836262379148498),
 ([(0.016415589, 'recognition'),
   (0.0136875985, 'speech'),
   (0.01258169, 'word'),
   (0.0104766805, 'hidden'),
   (0.0063662766, 'layer'),
   (0.0061339615, 'character'),
   (0.0056002084, 'trained'),
   (0.005490037, 'context'),
   (0.0051139165, 'sequence'),
   (0.004984547, 'architecture'),
   (0.004967922, 'hmm'),
   (0.004862166, 'speaker'),
   (0.004366162, 'net'),
   (0.0042531807, 'digit'),
   (0.0039046167, 'classification'),
   (0.0037942464, 'class'),
   (0.0037750585, 'frame'),
   (0.00358875, 'mixture'),
   (0.003476494, 'phoneme'),
   (0.0034512014, 'letter')],
  -1.323380921633785),
 ([(0.008542947, 'rule'),
   (0.00631226, 'hidden'),
   (0.00597873, 'generalization'),
   (0.0045754625, 'hidden_unit'),
   (0.0043068537, 'prediction'),
   (0.0040594153, 'net'),
   (0.003990005, 'sequence'),
   (0.0038032297, 'tree'),
   (0.0035338537, 'machine'),
   (0.0034035398, 'trained'),
   (0.003242104, 'recurrent'),
   (0.0031919426, 'training_set'),
   (0.0029770972, 'table'),
   (0.0028571628, 'learn'),
   (0.0028489903, 'language'),
   (0.0028364619, 'target'),
   (0.0026097689, 'architecture'),
   (0.0025739158, 'string'),
   (0.0025172615, 'symbol'),
   (0.0024356844, 'teacher')],
  -1.3578438548773115)]

Things to experiment with

  • no_above and no_below parameters in filter_extremes method.

  • Adding trigrams or even higher order n-grams.

  • Consider whether using a hold-out set or cross-validation is the way to go for you.

  • Try other datasets.

Where to go from here

References

  1. “Latent Dirichlet Allocation”, Blei et al. 2003.

  2. “Online Learning for Latent Dirichlet Allocation”, Hoffman et al. 2010.

Total running time of the script: ( 4 minutes 13.971 seconds)

Estimated memory usage: 664 MB

Gallery generated by Sphinx-Gallery