Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump.
If you have the pattern package installed, this module will use a fancy lemmatization to get a lemma of each token (instead of plain alphabetic tokenizer). The package is available at https://github.com/clips/pattern .
See scripts/process_wiki.py for a canned (example) script based on this module.
Treat a wikipedia articles dump (*articles.xml.bz2) as a (read-only) corpus.
The documents are extracted on-the-fly, so that the whole (massive) dump can stay compressed on disk.
>>> wiki = WikiCorpus('enwiki-20100622-pages-articles.xml.bz2') # create word->word_id mapping, takes almost 8h >>> wiki.saveAsText('wiki_en_vocab200k') # another 8h, creates a file in MatrixMarket format plus file with id->word
Initialize the corpus. Unless a dictionary is provided, this scans the corpus once, to determine its vocabulary.
If pattern package is installed, use fancier shallow parsing to get token lemmas. Otherwise, use simple regexp tokenization. You can override this automatic logic by forcing the lemmatize parameter explicitly.
Iterate over the dump, returning text version of each article as a list of tokens.
Only articles of sufficient length are returned (short articles & redirects etc are ignored).
Note that this iterates over the texts; if you want vectors, just use the standard corpus interface instead of this function:
>>> for vec in wiki_corpus: >>> print vec
Load a previously saved object from file (also see save).
Save the object to file via pickling (also see load).
Save an existing corpus to disk.
Some formats also support saving the dictionary (feature_id->word mapping), which can in this case be provided by the optional id2word parameter.
>>> MmCorpus.save_corpus('file.mm', corpus)
Some corpora also support an index of where each document begins, so that the documents on disk can be accessed in O(1) time (see the corpora.IndexedCorpus base class). In this case, save_corpus is automatically called internally by serialize, which does save_corpus plus saves the index at the same time, so you want to store the corpus with:
>>> MmCorpus.serialize('file.mm', corpus) # stores index as well, allowing random access to individual documents
Calling serialize() is preferred to calling save_corpus().
Filter out wiki mark-up from raw, leaving only text. raw is either unicode or utf-8 encoded string.
Parse a wikipedia article, returning its content as a list of tokens (utf8-encoded strings).
Remove the ‘File:’ and ‘Image:’ markup, keeping the file caption.
Return a copy of s with all the ‘File:’ and ‘Image:’ markup replaced by their corresponding captions. See http://www.mediawiki.org/wiki/Help:Images for the markup details.
Remove template wikimedia markup.
Return a copy of s with all the wikimedia markup template removed. See http://meta.wikimedia.org/wiki/Help:Template for wikimedia templates details.
Note: Since template can be nested, it is difficult remove them using regular expresssions.
Tokenize a piece of text from wikipedia. The input string content is assumed to be mark-up free (see filter_wiki()).
Return list of tokens as utf8 bytestrings. Ignore words shorted than 2 or longer that 15 characters (not bytes!).