Tutorial on Mallet in Python

Radim Řehůřek 2014-03-20 gensim, programming 32 Comments

MALLET, “MAchine Learning for LanguagE Toolkit” is a brilliant software tool. Unlike gensim, “topic modelling for humans”, which uses Python, MALLET is written in Java and spells “topic modeling” with a single “l”. Dandy.

MALLET’s LDA

MALLET’s implementation of Latent Dirichlet Allocation has lots of things going for it.

It’s based on sampling, which is a more accurate fitting method than variational Bayes. Variational methods, such as the online VB inference implemented in gensim, are easier to parallelize and guaranteed to converge… but they essentially solve an approximate, aka more inaccurate, problem.

MALLET is not “yet another midterm assignment implementation of Gibbs sampling”. It contains cleverly optimized code, is threaded to support multicore computers and, importantly, battle scarred by legions of humanity majors applying MALLET to literary studies.

Plus, written directly by David Mimno, a top expert in the field.

Gensim wrapper

I’ve wanted to include a similarly efficient sampling implementation of LDA in gensim for a long time, but never found the time/motivation. Ben Trahan, the author of the recent LDA hyperparameter optimization patch for gensim, is on the job.

In the meanwhile, I’ve added a simple wrapper around MALLET so it can be used directly from Python, following gensim’s API:

model = gensim.models.LdaMallet(path_to_mallet, corpus, num_topics=10, id2word=dictionary)
print model[corpus]  # calculate & print topics of all documents in the corpus

And that’s it. The API is identical to the LdaModel class already in gensim, except you must specify path to the MALLET executable as its first parameter.

Check the LdaMallet API docs for setting other parameters such as threading (faster training, but consumes more memory), sampling iterations etc.

MALLET on Reuters

Let’s run a full end-to-end example.

NLTK includes several datasets we can use as our training corpus. In particular, the following assumes that the NLTK dataset “Reuters” can be found under /Users/kofola/nltk_data/corpora/reuters/training/:

# set up logging so we see what's going on
import logging
import os
from gensim import corpora, models, utils
logging.basicConfig(format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO)

def iter_documents(reuters_dir):
    """Iterate over Reuters documents, yielding one document at a time."""
    for fname in os.listdir(reuters_dir):
        # read each document as one big string
        document = open(os.path.join(reuters_dir, fname)).read()
        # parse document into a list of utf8 tokens
        yield utils.simple_preprocess(document)

class ReutersCorpus(object):
    def __init__(self, reuters_dir):
        self.reuters_dir = reuters_dir
        self.dictionary = corpora.Dictionary(iter_documents(reuters_dir))
        self.dictionary.filter_extremes()  # remove stopwords etc

    def __iter__(self):
        for tokens in iter_documents(self.reuters_dir):
            yield self.dictionary.doc2bow(tokens)

# set up the streamed corpus
corpus = ReutersCorpus('/Users/kofola/nltk_data/corpora/reuters/training/')
# INFO : adding document #0 to Dictionary(0 unique tokens: [])
# INFO : built Dictionary(24622 unique tokens: ['mdbl', 'fawc', 'degussa', 'woods', 'hanging']...) from 7769 documents (total 938238 corpus positions)
# INFO : keeping 7203 tokens which were in no less than 5 and no more than 3884 (=50.0%) documents
# INFO : resulting dictionary: Dictionary(7203 unique tokens: ['yellow', 'four', 'resisted', 'cyprus', 'increase']...)

# train 10 LDA topics using MALLET
mallet_path = '/Users/kofola/Downloads/mallet-2.0.7/bin/mallet'
model = models.LdaMallet(mallet_path, corpus, num_topics=10, id2word=corpus.dictionary)
# ...
# 0	5	spokesman ec government tax told european today companies president plan added made commission time statement chairman state national union
# 1	5	oil prices price production gas coffee crude market brazil international energy opec world petroleum bpd barrels producers day industry
# 2	5	trade japan japanese foreign economic officials united countries states official dollar agreement major told world yen bill house international
# 3	5	bank market rate stg rates exchange banks money interest dollar central week today fed term foreign dealers currency trading
# 4	5	tonnes wheat sugar mln export department grain corn agriculture week program year usda china soviet exports south sources crop
# 5	5	april march corp record cts dividend stock pay prior div board industries split qtly sets cash general share announced
# 6	5	pct billion year february january rose rise december fell growth compared earlier increase quarter current months month figures deficit
# 7	5	dlrs company mln year earnings sale quarter unit share gold sales expects reported results business canadian canada dlr operating
# 8	5	shares company group offer corp share stock stake acquisition pct common buy merger investment tender management bid outstanding purchase
# 9	5	mln cts net loss dlrs shr profit qtr year revs note oper sales avg shrs includes gain share tax
# 
# <1000> LL/token: -7.5002
# 
# Total time: 34 seconds

# now use the trained model to infer topics on a new document
doc = "Don't sell coffee, wheat nor sugar; trade gold, oil and gas instead."
bow = corpus.dictionary.doc2bow(utils.simple_preprocess(doc))
print model[bow]  # print list of (topic id, topic weight) pairs
# [[(0, 0.0903954802259887),
#   (1, 0.13559322033898305),
#   (2, 0.11299435028248588),
#   (3, 0.0847457627118644),
#   (4, 0.11864406779661017),
#   (5, 0.0847457627118644),
#   (6, 0.0847457627118644),
#   (7, 0.10357815442561205),
#   (8, 0.09981167608286252),
#   (9, 0.0847457627118644)]]

Apparently topics #1 (oil&co) and #4 (wheat&co) got the highest weights, so it passes the sniff test.

Note this MALLET wrapper is new in gensim version 0.9.0, and is extremely rudimentary for the time being. It serializes input (training corpus) into a file, calls the Java process to run Mallet, then parses out output from the files that Mallet produces. Not very efficient, not very robust.

Depending on how this wrapper is used/received, I may extend it in the future.

Or even better, try your hand at improving it yourself.

Comments 32

Audun Mathias Øygard
2014-03-22 at 9:56 pm

Ah, awesome! Now I don’t have to rewrite a python wrapper for the Mallet LDA everytime I use it. Thanks!

Reply
Joris
2014-03-28 at 7:50 am

Another nice update! Keem ’em coming! Suggestion: Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. Semantic Compositionality Through Recursive Matrix-Vector Spaces. In Proceedings of the 2012 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2012. 😉

Reply
1. Post
  Author
  
  Radim
  2014-03-28 at 10:15 am
  
  You mean, you’re working on a pull request implementing that article Joris? 😉
  
  Reply
Joris
2014-04-02 at 12:52 am

Obviously 😉

Reply
Seth Williams
2014-05-14 at 7:03 am

Hi Radim, This is an excellent guide on mallet in Python. Thanks a lot for sharing. I’ll be looking forward to more such tutorials from you. You can find out more in our Python course curriculum here http://www.fireboxtraining.com/python.

Reply
Artyom
2014-05-27 at 2:30 pm

Nice. I actually did something similiar for a DTM-gensim interface.

Reply
1. Post
  Author
  
  Radim
  2014-05-27 at 2:49 pm
  
  Cool!
  
  Care to share/submit a pull request?
  
  Reply
  1. Artyom
    2014-05-28 at 1:53 pm
    
    Ya, decided to clean it up a bit first and put my local version into a forked gensim. Will be ready in next couple of days
    
    I am also thinking about chancing a direct port of Blei’s DTM implementation, but not sure about it yet.
    
    Reply
Ivan
2014-09-06 at 6:31 pm

When I try to run your code, why it keeps showing
Invinite value after topic 0 0
?

Reply
Alex Simes
2016-01-27 at 10:49 am

Great! Thanks for putting this together 🙂

Is there a way to save the model to allow documents to be tested on it without retraining the whole thing?

Cheers!

Reply
1. Post
  Author
  
  Radim
  2016-01-27 at 1:05 pm
  
  You’re welcome 🙂
  
  The best way to “save the model” is to specify the `prefix` parameter to LdaMallet constructor:
  http://radimrehurek.com/gensim/models/wrappers/ldamallet.html#gensim.models.wrappers.ldamallet.LdaMallet
  
  By default, the data files for Mallet are stored in temp under a randomized name, so you’ll lose them after a restart. But when you say `prefix=”/my/directory/mallet/”`, all Mallet files are stored there instead. Then you can continue using the model even after reload.
  
  The Python model itself is saved/loaded using the standard `load()`/`save()` methods, like all models in gensim.
  
  Hope that helps,
  Radim
  
  Reply
  1. Alex
    2016-01-27 at 1:51 pm
    
    Yes it does help, thanks!
    
    Your github link is broken btw
    
    Cheers
    
    Reply
    1. Post
      Author
      
      Radim
      2016-01-29 at 8:33 am
      
      Fixed, thanks!
      
      Reply
  2. Aashish Khadka
    2019-05-23 at 7:55 pm
    
    I am facing a strange issue when loading a trained mallet model in python. I am working on jupyter notebook. If I load the saved model within same notebook, where the model was trained and pass new corpus, everything works fine and gives correct output for new text. However, if I load the saved model in different notebook and pass new corpus, regardless of the size of the new corpus, I am getting output for training text.
    
    I wanted to try if setting prefix would solve this issue. I was able to train the model without any issue. Below is the code:
    model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=all_corpus, num_topics=num_topics, id2word=dictionary, prefix=’C:\\Users\\axk0er8\\Sentiment_Analysis_Working\\NewsSentimentAnalysis\\mallet\\’,
    random_seed=42)
    
    However, when I load the trained model I get following error:
    RuntimeError: invalid doc topics format at line 2 in C:\\Users\\axk0er8\\Sentiment_Analysis_Working\\NewsSentimentAnalysis\\mallet\\doctopics.txt.infer
    
    Can you please help me understand this issue?
    
    Reply
    1. Aashish Khadka
      2019-05-23 at 8:00 pm
      
      Also, I tried same code by replacing ldamallet with gensim lda and it worked perfectly fine, regardless I loaded the saved model in same notebook or different notebook.
      
      Reply
Kevin
2016-03-09 at 11:20 pm

Do you know why I am getting the output this way?

[[(0, 0.10000000000000002),
(1, 0.10000000000000002),
(2, 0.10000000000000002),
(3, 0.10000000000000002),
(4, 0.10000000000000002),
(5, 0.10000000000000002),
(6, 0.10000000000000002),
(7, 0.10000000000000002),
(8, 0.10000000000000002),
(9, 0.10000000000000002)],
[(0, 0.10000000000000002),
(1, 0.10000000000000002),
(2, 0.10000000000000002),
(3, 0.10000000000000002),
(4, 0.10000000000000002),
(5, 0.10000000000000002),
(6, 0.10000000000000002),
(7, 0.10000000000000002),
(8, 0.10000000000000002),
(9, 0.10000000000000002)],

Reply
1. Post
  Author
  
  Radim Rehurek
  2016-03-10 at 12:55 am
  
  Are you using the same input as in tutorial?
  
  Maybe you passed in two queries, so you got two outputs?
  
  Send more info (versions of gensim, mallet, input, gist your logs, etc).
  
  Reply
  1. Kevin
    2016-03-10 at 4:32 am
    
    texts = [“Human machine interface enterprise resource planning quality processing management. , “,
    ” management processing quality enterprise resource planning systems is user interface management.”,
    “human engineering testing of enterprise resource planning interface processing quality management”,
    “nasty food dry desert poor staff good service cheap price bad location restaurant recommended”,
    “amazing service good food excellent desert kind staff bad service high price good location highly recommended”,
    “restaurant poor service bad food desert not recommended kind staff bad service high price good location”
    ]
    
    #adapted from Gensim tutorial
    
    id2word = corpora.Dictionary(texts)
    corpus = [id2word.doc2bow(text) for text in texts]
    
    path_to_mallet = “/Mallet/bin/mallet”
    
    model = gensim.models.wrappers.LdaMallet(path_to_mallet, corpus, num_topics=2, id2word=id2word)
    print model[corpus]
    
    #output
    [[(0, 0.5), (1, 0.5)], [(0, 0.5), (1, 0.5)], [(0, 0.5), (1, 0.5)], [(0, 0.5), (1, 0.5)], [(0, 0.5), (1, 0.5)], [(0, 0.5), (1, 0.5)]]
    
    I don’t think this output is accurate. Can you identify the issue here? Thanks.
    
    Reply
    1. Kevin
      2016-03-10 at 4:35 am
      
      Before creating the dictionary, I did tokenization (of course).
      # tokenize
      texts = [[word for word in document.lower().split() ] for document in texts]
      
      Reply
  2. kevin
    2016-03-11 at 3:45 pm
    
    Hi Radim,
    
    I am referring to this issue http://stackoverflow.com/questions/29259416/gensim-ldamallet-division-error
    
    Reply
Stefan
2016-03-14 at 4:14 pm

Is this supposed to work with Python 3? After making your sample compatible with Python2/3, it will run under Python 2, but it will throw an exception under Python 3.

Traceback (most recent call last):
File “demo.py”, line 56, in
print(model[bow]) # print list of (topic id, topic weight) pairs
File “/…/python3.4/site-packages/gensim/models/wrappers/ldamallet.py”, line 173, in __getitem__
result = list(self.read_doctopics(self.fdoctopics() + ‘.infer’))
File “/…/python3.4/site-packages/gensim/models/wrappers/ldamallet.py”, line 254, in read_doctopics
if lineno == 0 and line.startswith(“#doc “):
TypeError: startswith first arg must be bytes or a tuple of bytes, not str

Reply
1. Post
  Author
  
  Radim Řehůřek
  2016-03-14 at 11:41 pm
  
  Yeah, it is supposed to be working with Python 3.
  
  If it doesn’t, it’s a bug. Could you please file this issue under github? https://github.com/piskvorky/gensim/
  
  Include your package versions / OS etc please.
  
  Reply
Sandy
2018-02-28 at 2:24 pm

Hi Radim, thanks for the article .

I am new to topic modelling and mallet.

May i ask Gensim wrapper and MALLET on Reuters together? Or they are two different things in this tutorial?

Reply
1. Post
  Author
  
  Radim Řehůřek
  2018-02-28 at 2:54 pm
  
  Hi Sandy,
  
  I’m not sure what you mean. But the best place to describe your problem or ask for help would be our open source mailing list:
  https://groups.google.com/forum/#!forum/gensim
  
  Best,
  Radim
  
  Reply
2. Sandy
  2018-02-28 at 3:12 pm
  
  Sorry , i meant do i need to run it at 2 different files. or should i put the two things together and run as a whole?
  
  Reply
  1. Sandy
    2018-02-28 at 3:14 pm
    
    I run this python file, which i took from your post.
    
    # set up logging so we see what’s going on
    import logging
    import os
    from gensim import corpora, models, utils
    logging.basicConfig(format=”%(asctime)s : %(levelname)s : %(message)s”, level=logging.INFO)
    
    def iter_documents(reuters_dir):
    “””Iterate over Reuters documents, yielding one document at a time.”””
    for fname in os.listdir(reuters_dir):
    # read each document as one big string
    document = open(os.path.join(reuters_dir, fname)).read()
    # parse document into a list of utf8 tokens
    yield utils.simple_preprocess(document)
    
    class ReutersCorpus(object):
    def __init__(self, reuters_dir):
    self.reuters_dir = reuters_dir
    self.dictionary = corpora.Dictionary(iter_documents(reuters_dir))
    self.dictionary.filter_extremes() # remove stopwords etc
    
    def __iter__(self):
    for tokens in iter_documents(self.reuters_dir):
    yield self.dictionary.doc2bow(tokens)
    
    # set up the streamed corpus
    corpus = ReutersCorpus(‘/Users/kofola/nltk_data/corpora/reuters/training/’)
    # INFO : adding document #0 to Dictionary(0 unique tokens: [])
    # INFO : built Dictionary(24622 unique tokens: [‘mdbl’, ‘fawc’, ‘degussa’, ‘woods’, ‘hanging’]…) from 7769 documents (total 938238 corpus positions)
    # INFO : keeping 7203 tokens which were in no less than 5 and no more than 3884 (=50.0%) documents
    # INFO : resulting dictionary: Dictionary(7203 unique tokens: [‘yellow’, ‘four’, ‘resisted’, ‘cyprus’, ‘increase’]…)
    
    # train 10 LDA topics using MALLET
    mallet_path = ‘/Users/kofola/Downloads/mallet-2.0.7/bin/mallet’
    model = models.LdaMallet(mallet_path, corpus, num_topics=10, id2word=corpus.dictionary)
    # …
    # 0 5 spokesman ec government tax told european today companies president plan added made commission time statement chairman state national union
    # 1 5 oil prices price production gas coffee crude market brazil international energy opec world petroleum bpd barrels producers day industry
    # 2 5 trade japan japanese foreign economic officials united countries states official dollar agreement major told world yen bill house international
    # 3 5 bank market rate stg rates exchange banks money interest dollar central week today fed term foreign dealers currency trading
    # 4 5 tonnes wheat sugar mln export department grain corn agriculture week program year usda china soviet exports south sources crop
    # 5 5 april march corp record cts dividend stock pay prior div board industries split qtly sets cash general share announced
    # 6 5 pct billion year february january rose rise december fell growth compared earlier increase quarter current months month figures deficit
    # 7 5 dlrs company mln year earnings sale quarter unit share gold sales expects reported results business canadian canada dlr operating
    # 8 5 shares company group offer corp share stock stake acquisition pct common buy merger investment tender management bid outstanding purchase
    # 9 5 mln cts net loss dlrs shr profit qtr year revs note oper sales avg shrs includes gain share tax
    #
    # LL/token: -7.5002
    #
    # Total time: 34 seconds
    
    # now use the trained model to infer topics on a new document
    doc = “Don’t sell coffee, wheat nor sugar; trade gold, oil and gas instead.”
    bow = corpus.dictionary.doc2bow(utils.simple_preprocess(doc))
    print model[bow] # print list of (topic id, topic weight) pairs
    # [[(0, 0.0903954802259887),
    # (1, 0.13559322033898305),
    # (2, 0.11299435028248588),
    # (3, 0.0847457627118644),
    # (4, 0.11864406779661017),
    # (5, 0.0847457627118644),
    # (6, 0.0847457627118644),
    # (7, 0.10357815442561205),
    # (8, 0.09981167608286252),
    # (9, 0.0847457627118644)]]
    
    Reply
  2. Sandy
    2018-02-28 at 3:15 pm
    
    And i got this as error. So i not sure, do i include the gensim wrapper in the same python file or what should i do next ?
    
    C:\Python27\lib\site-packages\gensim\utils.py:1167: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
    warnings.warn(“detected Windows; aliasing chunkize to chunkize_serial”)
    2018-02-28 23:08:15,959 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
    2018-02-28 23:08:15,984 : INFO : built Dictionary(1131 unique tokens: [u’stock’, u’all’, u’concept’, u’managed’, u’forget’]…) from 20 documents (total 4006 corpus positions)
    2018-02-28 23:08:15,986 : INFO : discarding 1050 tokens: [(u’ad’, 2), (u’add’, 3), (u’agains’, 1), (u’always’, 4), (u’and’, 14), (u’annual’, 1), (u’ask’, 3), (u’bad’, 2), (u’bar’, 1), (u’before’, 3)]…
    2018-02-28 23:08:15,987 : INFO : keeping 81 tokens which were in no less than 5 and no more than 10 (=50.0%) documents
    2018-02-28 23:08:15,989 : INFO : resulting dictionary: Dictionary(81 unique tokens: [u’all’, u’since’, u’help’, u’just’, u’then’]…)
    Traceback (most recent call last):
    File “Topic.py”, line 37, in
    model = models.LdaMallet(mallet_path, corpus, num_topics=10, id2word=corpus.dictionary)
    AttributeError: ‘module’ object has no attribute ‘LdaMallet’
    
    Reply
    1. Joshua
      2018-03-06 at 7:10 pm
      
      Sandy,
      First to answer your question:
      I had the same error (AttributeError: ‘module’ object has no attribute ‘LdaMallet’). I looked in gensim/models and found that ldamallet.py is in the wrappers directory (https://github.com/RaRe-Technologies/gensim/tree/develop/gensim/models/wrappers). So, instead use the following:
      from gensim.models import wrappers
      (I used gensim.models.wrappers import LdaMallet)
      
      Next, I noticed that your number of kept tokens is very small (81), since you’re using a small corpus. This may be appropriate since those would be the most confident distinctive words, but I’d use a lower no_below (to keep infrequent tokens) and possibly a higher no_above ratio. .filter_extremes(no_below=1, no_above=.7)
      
      Finally, use self.model.save(model_filename) to save the model (you can then use load()) and self.model.show_topics(num_topics=-1) to get a list of all topics so that you can see what each number corresponds to, and what words represent the topics.
      
      Reply
Raniem
2018-03-22 at 2:14 pm

Hello.

I would like to thank you for your great efforts.

I have a question if you don’t mind? Is it normal that I get completely different topics models when using Mallet LDA and gensim LDA?!

I expect differences but they seem to be very different when I tried them on my corpus. I have also compared with the Reuters corpus and below are my models definitions and the top 10 topics for each model.

model = models.wrappers.LdaMallet(mallet_path, corpus, num_topics=10, id2word=corpus.dictionary)
gensim_model= gensim.models.ldamodel.LdaModel(corpus,num_topics=10,id2word=corpus.dictionary)

there are some different parameters like alpha I guess, but I am not sure if there is any other parameter that I have missed and made the results so different?!

======================Mallet Topics====================

0’0.176*”dlr” + 0.041*”sale” + 0.041*”mln” + 0.032*”april” + 0.030*”march” + 0.027*”record” + 0.027*”quarter” + 0.026*”year” + 0.024*”earn” + 0.023*”dividend”‘)
1’0.016*”spokesman” + 0.014*”sai” + 0.013*”franc” + 0.012*”report” + 0.012*”state” + 0.012*”govern” + 0.011*”plan” + 0.011*”union” + 0.010*”offici” + 0.010*”todai”‘)
2’0.125*”pct” + 0.078*”billion” + 0.062*”year” + 0.030*”februari” + 0.030*”januari” + 0.024*”rise” + 0.021*”rose” + 0.019*”month” + 0.016*”increas” + 0.015*”compar”‘)
3’0.045*”trade” + 0.020*”japan” + 0.017*”offici” + 0.014*”countri” + 0.013*”meet” + 0.011*”japanes” + 0.011*”agreement” + 0.011*”import” + 0.011*”industri” + 0.010*”world”‘)
4’0.047*”compani” + 0.036*”corp” + 0.029*”unit” + 0.018*”sell” + 0.016*”approv” + 0.016*”acquisit” + 0.015*”complet” + 0.015*”busi” + 0.014*”merger” + 0.013*”agreement”‘)
5’0.076*”share” + 0.040*”stock” + 0.037*”offer” + 0.028*”group” + 0.027*”compani” + 0.016*”board” + 0.016*”sharehold” + 0.016*”common” + 0.016*”invest” + 0.015*”pct”‘)
6’0.056*”oil” + 0.043*”price” + 0.028*”product” + 0.014*”ga” + 0.013*”barrel” + 0.012*”crude” + 0.012*”gold” + 0.011*”year” + 0.011*”cost” + 0.010*”increas”‘)
7’0.041*”tonn” + 0.032*”export” + 0.023*”price” + 0.017*”produc” + 0.016*”wheat” + 0.013*”agricultur” + 0.013*”sugar” + 0.012*”grain” + 0.011*”week” + 0.011*”coffe”‘)
8’0.221*”mln” + 0.117*”ct” + 0.092*”net” + 0.087*”loss” + 0.067*”shr” + 0.056*”profit” + 0.044*”oper” + 0.038*”dlr” + 0.033*”qtr” + 0.033*”rev”‘)
9’0.067*”bank” + 0.039*”rate” + 0.030*”market” + 0.023*”dollar” + 0.017*”stg” + 0.016*”exchang” + 0.014*”currenc” + 0.013*”monei” + 0.011*”yen” + 0.011*”reserv”‘)]

010*”grain” + 0.010*”tonn” + 0.010*”corn” + 0.009*”year” + 0.009*”ton” + 0.008*”strike” + 0.008*”union” + 0.008*”report” + 0.008*”compani” + 0.008*”wheat”

=======================Gensim Topics====================
0’0.028*”oil” + 0.015*”price” + 0.011*”meet” + 0.010*”dlr” + 0.008*”mln” + 0.008*”opec” + 0.008*”stock” + 0.007*”tax” + 0.007*”bpd” + 0.007*”product”‘)
1’0.062*”ct” + 0.031*”april” + 0.031*”record” + 0.023*”div” + 0.022*”pai” + 0.021*”qtly” + 0.021*”dividend” + 0.019*”prior” + 0.015*”march” + 0.014*”set”‘)
2’0.066*”mln” + 0.061*”dlr” + 0.060*”loss” + 0.051*”ct” + 0.049*”net” + 0.038*”shr” + 0.030*”year” + 0.028*”profit” + 0.026*”pct” + 0.020*”rev”‘)
3’0.032*”mln” + 0.031*”dlr” + 0.022*”compani” + 0.012*”bank” + 0.012*”stg” + 0.011*”year” + 0.010*”sale” + 0.010*”unit” + 0.009*”corp” + 0.008*”market”‘)
4’0.049*”bank” + 0.025*”rate” + 0.022*”pct” + 0.011*”billion” + 0.010*”reserv” + 0.009*”market” + 0.008*”central” + 0.008*”gold” + 0.008*”monei” + 0.007*”februari”‘)
5’0.023*”share” + 0.022*”dlr” + 0.015*”compani” + 0.015*”stock” + 0.011*”offer” + 0.011*”trade” + 0.009*”billion” + 0.008*”pct” + 0.006*”agreement” + 0.006*”debt”‘)
6’0.016*”trade” + 0.015*”pct” + 0.011*”year” + 0.009*”price” + 0.009*”export” + 0.008*”market” + 0.007*”japan” + 0.007*”industri” + 0.007*”govern” + 0.006*”import”‘)
7’0.109*”mln” + 0.048*”billion” + 0.028*”net” + 0.025*”year” + 0.025*”dlr” + 0.020*”ct” + 0.017*”shr” + 0.013*”profit” + 0.011*”sale” + 0.009*”pct”‘)
8’0.030*”mln” + 0.029*”pct” + 0.024*”share” + 0.024*”tonn” + 0.011*”dlr” + 0.010*”year” + 0.010*”stock” + 0.010*”offer” + 0.009*”tender” + 0.009*”corp”‘)
9’0.010*”grain” + 0.010*”tonn” + 0.010*”corn” + 0.009*”year” + 0.009*”ton” + 0.008*”strike” + 0.008*”union” + 0.008*”report” + 0.008*”compani” + 0.008*”wheat”‘)]

Reply
Shiks
2018-04-13 at 10:35 am

hey, i am getting an error:

“Error: Could not find or load main class cc.mallet.classify.tui.Csv2Vectors.java”

how to correct this error? please help me out with it. thank you.

Reply
Bhavana Malepaty
2018-06-19 at 12:19 pm

mallet_path = ‘/home/hp/Downloads/mallet-2.0.8/bin/mallet’ # update this path
#ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=5, id2word=dictionary)
ldamallet = models.wrappers.LdaMallet(mallet_path, corpus, num_topics=5, id2word=dictionary)

On doing this, I get an error:
CalledProcessError: Command ‘/home/hp/Downloads/mallet-2.0.8/bin/mallet import-file –preserve-case –keep-sequence –remove-stopwords –token-regex “\S+” –input /tmp/95d303_corpus.txt –output /tmp/95d303_corpus.mallet’ returned non-zero exit status 127.

Can you please help me debug this

Reply
Aashish Khadka
2019-05-23 at 8:47 pm

I am facing a strange issue when loading a trained mallet model in python. I am working on jupyter notebook. If I load the saved model within same notebook, where the model was trained and pass new corpus, everything works fine and gives correct output for new text. However, if I load the saved model in different notebook and pass new corpus, regardless of the size of the new corpus, I am getting output for training text.

I wanted to try if setting prefix would solve this issue. I was able to train the model without any issue. Below is the code:
model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=all_corpus, num_topics=num_topics, id2word=dictionary, prefix=’C:\\Users\\axk0er8\\Sentiment_Analysis_Working\\NewsSentimentAnalysis\\mallet\\’,
random_seed=42)

However, when I load the trained model I get following error:
RuntimeError: invalid doc topics format at line 2 in C:\\Users\\axk0er8\\Sentiment_Analysis_Working\\NewsSentimentAnalysis\\mallet\\doctopics.txt.infer

Can you please help me understand this issue?

Reply

MALLET’s LDA

Gensim wrapper

MALLET on Reuters

Comments 32

Leave a Reply Cancel reply