gensim logo

gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

corpora.wikicorpus – Corpus from a Wikipedia dump

corpora.wikicorpus – Corpus from a Wikipedia dump

Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump.

If you have the pattern package installed, this module will use a fancy lemmatization to get a lemma of each token (instead of plain alphabetic tokenizer). The package is available at .

See scripts/ for a canned (example) script based on this module.

class gensim.corpora.wikicorpus.WikiCorpus(fname, processes=None, lemmatize=False, dictionary=None, filter_namespaces=('0', ), tokenizer_func=<function tokenize>, article_min_tokens=50, token_min_len=2, token_max_len=15, lower=True)

Bases: gensim.corpora.textcorpus.TextCorpus

Treat a wikipedia articles dump (<LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2) as a (read-only) corpus.

The documents are extracted on-the-fly, so that the whole (massive) dump can stay compressed on disk.

Note: “multistream” archives are not supported in Python 2 due to limitations in the core bz2 library.

>>> wiki = WikiCorpus('enwiki-20100622-pages-articles.xml.bz2') # create word->word_id mapping, takes almost 8h
>>> MmCorpus.serialize('', wiki) # another 8h, creates a file in MatrixMarket format and mapping

Initialize the corpus. Unless a dictionary is provided, this scans the corpus once, to determine its vocabulary.

If pattern package is installed, use fancier shallow parsing to get token lemmas. Otherwise, use simple regexp tokenization. You can override this automatic logic by forcing the lemmatize parameter explicitly. self.metadata if set to true will ensure that serialize will write out article titles to a pickle file.

Set article_min_tokens as a min threshold for article token count (defaults to 50). Any article below this is ignored.

Set tokenizer_func (defaults to tokenize) with a custom function reference to control tokenization else use the default regexp tokenization. Set this parameter for languages like japanese or thai to perform better tokenization. The tokenizer_func needs to take 4 parameters: (text, token_min_len, token_max_len, lower). The parameter values are as configured on the class instance by default.

Set lower to control if everything should be converted to lowercase or not (default True).

Set token_min_len, token_max_len as thresholds for token lengths that are returned (default to 2 and 15).


Iterate over the dump, returning text version of each article as a list of tokens.

Only articles of sufficient length are returned (short articles & redirects etc are ignored). This is control by article_min_tokens on the class instance.

Note that this iterates over the texts; if you want vectors, just use the standard corpus interface instead of this function:

>>> for vec in wiki_corpus:
>>>     print(vec)

Yield documents from the underlying plain text collection (of one or more files). Each item yielded from this method will be considered a document by subsequent preprocessing methods.


If dictionary is None, initialize to an empty Dictionary, and then if there is an input for the corpus, add all documents from that input. If the dictionary is already initialized, simply set it as the corpus’s dictionary.

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.


Apply preprocessing to a single text document. This should perform tokenization in addition to any other desired preprocessing steps.

Parameters:text (str) – document text read from plain-text file.
Returns:tokens produced from text as a result of preprocessing.
Return type:iterable of str
sample_texts(n, seed=None, length=None)

Yield n random documents from the corpus without replacement.

Given the number of remaining documents in a corpus, we need to choose n elements. The probability for the current element to be chosen is n/remaining. If we choose it, we just decrease the n and move to the next element. Computing the corpus length may be a costly operation so you can use the optional parameter length instead.

  • n (int) – number of documents we want to sample.
  • seed (int|None) – if specified, use it as a seed for local random generator.
  • length (int|None) – if specified, use it as a guess of corpus length. It must be positive and not greater than actual corpus length.

list[str] – document represented as a list of tokens. See get_texts method.


ValueError – when n is invalid or length was set incorrectly.

save(*args, **kwargs)
save_corpus(fname, corpus, id2word=None, metadata=False)

Save an existing corpus to disk.

Some formats also support saving the dictionary (feature_id->word mapping), which can in this case be provided by the optional id2word parameter.

>>> MmCorpus.save_corpus('', corpus)

Some corpora also support an index of where each document begins, so that the documents on disk can be accessed in O(1) time (see the corpora.IndexedCorpus base class). In this case, save_corpus is automatically called internally by serialize, which does save_corpus plus saves the index at the same time, so you want to store the corpus with:

>>> MmCorpus.serialize('', corpus) # stores index as well, allowing random access to individual documents

Calling serialize() is preferred to calling save_corpus().


Yield tuples of functions and their output for each stage of preprocessing. This is useful for debugging issues with the corpus preprocessing pipeline.

gensim.corpora.wikicorpus.extract_pages(f, filter_namespaces=False)

Extract pages from a MediaWiki database dump = open file-like object f.

Return an iterable over (str, str, str) which generates (title, content, pageid) triplets.


Filter out wiki mark-up from raw, leaving only text. raw is either unicode or utf-8 encoded string.


Returns the namespace of tag.


Should only be used when master is prepared to handle termination of child processes.

gensim.corpora.wikicorpus.process_article(args, tokenizer_func=<function tokenize>, token_min_len=2, token_max_len=15, lower=True)

Parse a wikipedia article, returning its content as a list of tokens (utf8-encoded strings).

Set tokenizer_func (defaults to tokenize) parameter for languages like japanese or thai to perform better tokenization. The tokenizer_func needs to take 4 parameters: (text, token_min_len, token_max_len, lower).


Remove the ‘File:’ and ‘Image:’ markup, keeping the file caption.

Return a copy of s with all the ‘File:’ and ‘Image:’ markup replaced by their corresponding captions. See for the markup details.


Remove template wikimedia markup.

Return a copy of s with all the wikimedia markup template removed. See for wikimedia templates details.

Note: Since template can be nested, it is difficult remove them using regular expresssions.

gensim.corpora.wikicorpus.tokenize(content, token_min_len=2, token_max_len=15, lower=True)

Tokenize a piece of text from wikipedia. The input string content is assumed to be mark-up free (see filter_wiki()).

Set token_min_len, token_max_len as character length (not bytes!) thresholds for individual tokens.

Return list of tokens as utf8 bytestrings.