gensim logo

gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine:

Corporate trainings in Python Data Science and Deep Learning

scripts.make_wiki_online – Convert articles from a Wikipedia dump

scripts.make_wiki_online – Convert articles from a Wikipedia dump


Convert articles from a Wikipedia dump to (sparse) vectors. The input is a bz2-compressed dump of Wikipedia articles, in XML format.

This actually creates three files:

  • OUTPUT_PREFIX_wordids.txt: mapping between words and their integer ids
  • bag-of-words (word counts) representation, in Matrix Matrix format
  • TF-IDF representation
  • OUTPUT_PREFIX.tfidf_model: TF-IDF model dump

The output Matrix Market files can then be compressed (e.g., by bzip2) to save disk space; gensim’s corpus iterators can work with compressed input, too.

VOCABULARY_SIZE controls how many of the most frequent words to keep (after removing tokens that appear in more than 10%% of all documents). Defaults to 100,000.

If you have the pattern package installed, this script will use a fancy lemmatization to get a lemma of each token (instead of plain alphabetic tokenizer). The package is available at .

python -m gensim.scripts.make_wikicorpus ~/gensim/results/enwiki-latest-pages-articles.xml.bz2 ~/gensim/results/wiki