corpora.wikicorpus – Corpus from a Wikipedia dump

Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump.

Uses multiprocessing internally to parallelize the work and process the dump more quickly.

Notes

See gensim.scripts.make_wiki for a canned (example) command-line script based on this module.

gensim.corpora.wikicorpus.ARTICLE_MIN_WORDS = 50

Ignore shorter articles (after full preprocessing).

gensim.corpora.wikicorpus.IGNORED_NAMESPACES = ['Wikipedia', 'Category', 'File', 'Portal', 'Template', 'MediaWiki', 'User', 'Help', 'Book', 'Draft', 'WikiProject', 'Special', 'Talk']

MediaWiki namespaces that ought to be ignored.

gensim.corpora.wikicorpus.RE_P0 = re.compile('<!--.*?-->', re.DOTALL)

Comments.

gensim.corpora.wikicorpus.RE_P1 = re.compile('<ref([> ].*?)(</ref>|/>)', re.DOTALL)

Footnotes.

gensim.corpora.wikicorpus.RE_P10 = re.compile('<math([> ].*?)(</math>|/>)', re.DOTALL)

Math content.

gensim.corpora.wikicorpus.RE_P11 = re.compile('<(.*?)>', re.DOTALL)

All other tags.

gensim.corpora.wikicorpus.RE_P12 = re.compile('(({\\|)|(\\|-(?!\\d))|(\\|}))(.*?)(?=\\n)')

Table formatting.

gensim.corpora.wikicorpus.RE_P13 = re.compile('(?<=(\\n[ ])|(\\n\\n)|([ ]{2})|(.\\n)|(.\\t))(\\||\\!)([^[\\]\\n]*?\\|)*')

Table cell formatting.

gensim.corpora.wikicorpus.RE_P14 = re.compile('\\[\\[Category:[^][]*\\]\\]')

Categories.

gensim.corpora.wikicorpus.RE_P15 = re.compile('\\[\\[([fF]ile:|[iI]mage)[^]]*(\\]\\])')

Remove File and Image templates.

gensim.corpora.wikicorpus.RE_P16 = re.compile('\\[{2}(.*?)\\]{2}')

Capture interlinks text and article linked

gensim.corpora.wikicorpus.RE_P17 = re.compile('(\\n.{0,4}((bgcolor)|(\\d{0,1}[ ]?colspan)|(rowspan)|(style=)|(class=)|(align=)|(scope=))(.*))|(^.{0,2}((bgcolor)|(\\d{0,1}[ ]?colspan)|(rowspan)|(style=)|(class=)|(align=))(.*))')

Table markup

gensim.corpora.wikicorpus.RE_P2 = re.compile('(\\n\\[\\[[a-z][a-z][\\w-]*:[^:\\]]+\\]\\])+$')

Links to languages.

gensim.corpora.wikicorpus.RE_P3 = re.compile('{{([^}{]*)}}', re.DOTALL)

Template.

gensim.corpora.wikicorpus.RE_P4 = re.compile('{{([^}]*)}}', re.DOTALL)

Template.

gensim.corpora.wikicorpus.RE_P5 = re.compile('\\[(\\w+):\\/\\/(.*?)(( (.*?))|())\\]')

Remove URL, keep description.

gensim.corpora.wikicorpus.RE_P6 = re.compile('\\[([^][]*)\\|([^][]*)\\]', re.DOTALL)

Simplify links, keep description.

gensim.corpora.wikicorpus.RE_P7 = re.compile('\\n\\[\\[[iI]mage(.*?)(\\|.*?)*\\|(.*?)\\]\\]')

Keep description of images.

gensim.corpora.wikicorpus.RE_P8 = re.compile('\\n\\[\\[[fF]ile(.*?)(\\|.*?)*\\|(.*?)\\]\\]')

Keep description of files.

gensim.corpora.wikicorpus.RE_P9 = re.compile('<nowiki([> ].*?)(</nowiki>|/>)', re.DOTALL)

External links.

class gensim.corpora.wikicorpus.WikiCorpus(fname, processes=None, lemmatize=None, dictionary=None, filter_namespaces=('0', ), tokenizer_func=<function tokenize>, article_min_tokens=50, token_min_len=2, token_max_len=15, lower=True, filter_articles=None)

Bases: gensim.corpora.textcorpus.TextCorpus

Treat a Wikipedia articles dump as a read-only, streamed, memory-efficient corpus.

Supported dump formats:

  • <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2

  • <LANG>wiki-latest-pages-articles.xml.bz2

The documents are extracted on-the-fly, so that the whole (massive) dump can stay compressed on disk.

Notes

Dumps for the English Wikipedia can be founded at https://dumps.wikimedia.org/enwiki/.

metadata

Whether to write articles titles to serialized corpus.

Type

bool

Warning

“Multistream” archives are not supported in Python 2 due to limitations in the core bz2 library.

Examples

>>> from gensim.test.utils import datapath, get_tmpfile
>>> from gensim.corpora import WikiCorpus, MmCorpus
>>>
>>> path_to_wiki_dump = datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")
>>> corpus_path = get_tmpfile("wiki-corpus.mm")
>>>
>>> wiki = WikiCorpus(path_to_wiki_dump)  # create word->word_id mapping, ~8h on full wiki
>>> MmCorpus.serialize(corpus_path, wiki)  # another 8h, creates a file in MatrixMarket format and mapping

Initialize the corpus.

Unless a dictionary is provided, this scans the corpus once, to determine its vocabulary.

Parameters
  • fname (str) – Path to the Wikipedia dump file.

  • processes (int, optional) – Number of processes to run, defaults to max(1, number of cpu - 1).

  • dictionary (Dictionary, optional) – Dictionary, if not provided, this scans the corpus once, to determine its vocabulary IMPORTANT: this needs a really long time.

  • filter_namespaces (tuple of str, optional) – Namespaces to consider.

  • tokenizer_func (function, optional) – Function that will be used for tokenization. By default, use tokenize(). If you inject your own tokenizer, it must conform to this interface: tokenizer_func(text: str, token_min_len: int, token_max_len: int, lower: bool) -> list of str

  • article_min_tokens (int, optional) – Minimum tokens in article. Article will be ignored if number of tokens is less.

  • token_min_len (int, optional) – Minimal token length.

  • token_max_len (int, optional) – Maximal token length.

  • lower (bool, optional) – If True - convert all text to lower case.

  • filter_articles (callable or None, optional) – If set, each XML article element will be passed to this callable before being processed. Only articles where the callable returns an XML element are processed, returning None allows filtering out some articles based on customised rules.

Warning

Unless a dictionary is provided, this scans the corpus once, to determine its vocabulary.

add_lifecycle_event(event_name, log_level=20, **event)

Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.

Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.

The lifecycle_events attribute is persisted across object’s save() and load() operations. It has no impact on the use of the model, but is useful during debugging and support.

Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.

Parameters
  • event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.

  • event (dict) –

    Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.

    This method will automatically add the following key-values to event, so you don’t have to specify them:

    • datetime: the current date & time

    • gensim: the current Gensim version

    • python: the current Python version

    • platform: the current platform

    • event: the name of this event

  • log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.

get_texts()

Iterate over the dump, yielding a list of tokens for each article that passed the length and namespace filtering.

Uses multiprocessing internally to parallelize the work and process the dump more quickly.

Notes

This iterates over the texts. If you want vectors, just use the standard corpus interface instead of this method:

Examples

>>> from gensim.test.utils import datapath
>>> from gensim.corpora import WikiCorpus
>>>
>>> path_to_wiki_dump = datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")
>>>
>>> for vec in WikiCorpus(path_to_wiki_dump):
...     pass
Yields
  • list of str – If metadata is False, yield only list of token extracted from the article.

  • (list of str, (int, str)) – List of tokens (extracted from the article), page id and article title otherwise.

getstream()

Generate documents from the underlying plain text collection (of one or more files).

Yields

str – Document read from plain-text file.

Notes

After generator end - initialize self.length attribute.

init_dictionary(dictionary)

Initialize/update dictionary.

Parameters

dictionary (Dictionary, optional) – If a dictionary is provided, it will not be updated with the given corpus on initialization. If None - new dictionary will be built for the given corpus.

Notes

If self.input is None - make nothing.

property input
classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

preprocess_text(text)

Apply self.character_filters, self.tokenizer, self.token_filters to a single text document.

Parameters

text (str) – Document read from plain-text file.

Returns

List of tokens extracted from text.

Return type

list of str

sample_texts(n, seed=None, length=None)

Generate n random documents from the corpus without replacement.

Parameters
  • n (int) – Number of documents we want to sample.

  • seed (int, optional) – If specified, use it as a seed for local random generator.

  • length (int, optional) – Value will used as corpus length (because calculate length of corpus can be costly operation). If not specified - will call __length__.

Raises

ValueError – If n less than zero or greater than corpus size.

Notes

Given the number of remaining documents in a corpus, we need to choose n elements. The probability for the current element to be chosen is n / remaining. If we choose it, we just decrease the n and move to the next element.

Yields

list of str – Sampled document as sequence of tokens.

save(*args, **kwargs)

Saves corpus in-memory state.

Warning

This save only the “state” of a corpus class, not the corpus data!

For saving data use the serialize method of the output format you’d like to use (e.g. gensim.corpora.mmcorpus.MmCorpus.serialize()).

static save_corpus(fname, corpus, id2word=None, metadata=False)

Save corpus to disk.

Some formats support saving the dictionary (feature_id -> word mapping), which can be provided by the optional id2word parameter.

Notes

Some corpora also support random access via document indexing, so that the documents on disk can be accessed in O(1) time (see the gensim.corpora.indexedcorpus.IndexedCorpus base class).

In this case, save_corpus() is automatically called internally by serialize(), which does save_corpus() plus saves the index at the same time.

Calling serialize() is preferred to calling :meth:`gensim.interfaces.CorpusABC.save_corpus().

Parameters
  • fname (str) – Path to output file.

  • corpus (iterable of list of (int, number)) – Corpus in BoW format.

  • id2word (Dictionary, optional) – Dictionary of corpus.

  • metadata (bool, optional) – Write additional metadata to a separate too?

step_through_preprocess(text)

Apply preprocessor one by one and generate result.

Warning

This is useful for debugging issues with the corpus preprocessing pipeline.

Parameters

text (str) – Document text read from plain-text file.

Yields

(callable, object) – Pre-processor, output from pre-processor (based on text)

gensim.corpora.wikicorpus.extract_pages(f, filter_namespaces=False, filter_articles=None)

Extract pages from a MediaWiki database dump.

Parameters
  • f (file) – File-like object.

  • filter_namespaces (list of str or bool) – Namespaces that will be extracted.

Yields

tuple of (str or None, str, str) – Title, text and page id.

gensim.corpora.wikicorpus.filter_example(elem, text, *args, **kwargs)

Example function for filtering arbitrary documents from wikipedia dump.

The custom filter function is called _before_ tokenisation and should work on the raw text and/or XML element information.

The filter function gets the entire context of the XML element passed into it, but you can of course choose not the use some or all parts of the context. Please refer to gensim.corpora.wikicorpus.extract_pages() for the exact details of the page context.

Parameters
  • elem (etree.Element) – XML etree element

  • text (str) – The text of the XML node

  • namespace (str) – XML namespace of the XML element

  • title (str) – Page title

  • page_tag (str) – XPath expression for page.

  • text_path (str) – XPath expression for text.

  • title_path (str) – XPath expression for title.

  • ns_path (str) – XPath expression for namespace.

  • pageid_path (str) – XPath expression for page id.

Example

>>> import gensim.corpora
>>> filter_func = gensim.corpora.wikicorpus.filter_example
>>> dewiki = gensim.corpora.WikiCorpus(
...     './dewiki-20180520-pages-articles-multistream.xml.bz2',
...     filter_articles=filter_func)
gensim.corpora.wikicorpus.filter_wiki(raw, promote_remaining=True, simplify_links=True)

Filter out wiki markup from raw, leaving only text.

Parameters
  • raw (str) – Unicode or utf-8 encoded string.

  • promote_remaining (bool) – Whether uncaught markup should be promoted to plain text.

  • simplify_links (bool) – Whether links should be simplified keeping only their description text.

Returns

raw without markup.

Return type

str

Find all interlinks to other articles in the dump.

Parameters

raw (str) – Unicode or utf-8 encoded string.

Returns

List of tuples in format [(linked article, the actual text found), …].

Return type

list

gensim.corpora.wikicorpus.get_namespace(tag)

Get the namespace of tag.

Parameters

tag (str) – Namespace or tag.

Returns

Matched namespace or tag.

Return type

str

gensim.corpora.wikicorpus.init_to_ignore_interrupt()

Enables interruption ignoring.

Warning

Should only be used when master is prepared to handle termination of child processes.

gensim.corpora.wikicorpus.process_article(args, tokenizer_func=<function tokenize>, token_min_len=2, token_max_len=15, lower=True)

Parse a Wikipedia article, extract all tokens.

Notes

Set tokenizer_func (defaults is tokenize()) parameter for languages like Japanese or Thai to perform better tokenization. The tokenizer_func needs to take 4 parameters: (text: str, token_min_len: int, token_max_len: int, lower: bool).

Parameters
  • args ((str, str, int)) – Article text, article title, page identificator.

  • tokenizer_func (function) – Function for tokenization (defaults is tokenize()). Needs to have interface: tokenizer_func(text: str, token_min_len: int, token_max_len: int, lower: bool) -> list of str.

  • token_min_len (int) – Minimal token length.

  • token_max_len (int) – Maximal token length.

  • lower (bool) – Convert article text to lower case?

Returns

List of tokens from article, title and page id.

Return type

(list of str, str, int)

gensim.corpora.wikicorpus.remove_file(s)

Remove the ‘File:’ and ‘Image:’ markup, keeping the file caption.

Parameters

s (str) – String containing ‘File:’ and ‘Image:’ markup.

Returns

Сopy of s with all the ‘File:’ and ‘Image:’ markup replaced by their corresponding captions.

Return type

str

gensim.corpora.wikicorpus.remove_markup(text, promote_remaining=True, simplify_links=True)

Filter out wiki markup from text, leaving only text.

Parameters
  • text (str) – String containing markup.

  • promote_remaining (bool) – Whether uncaught markup should be promoted to plain text.

  • simplify_links (bool) – Whether links should be simplified keeping only their description text.

Returns

text without markup.

Return type

str

gensim.corpora.wikicorpus.remove_template(s)

Remove template wikimedia markup.

Parameters

s (str) – String containing markup template.

Returns

Сopy of s with all the wikimedia markup template removed.

Return type

str

Notes

Since template can be nested, it is difficult remove them using regular expressions.

gensim.corpora.wikicorpus.tokenize(content, token_min_len=2, token_max_len=15, lower=True)

Tokenize a piece of text from Wikipedia.

Set token_min_len, token_max_len as character length (not bytes!) thresholds for individual tokens.

Parameters
  • content (str) – String without markup (see filter_wiki()).

  • token_min_len (int) – Minimal token length.

  • token_max_len (int) – Maximal token length.

  • lower (bool) – Convert content to lower case?

Returns

List of tokens from content.

Return type

list of str