scripts.segment_wiki – Convert wikipedia dump to json-line format

This script using for extracting plain text out of a raw Wikipedia dump. Input is an xml.bz2 file provided by MediaWiki that looks like <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2 (e.g. 14 GB of https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2).

It streams through all the XML articles using multiple cores (#cores - 1, by default), decompressing on the fly and extracting plain text from the articles and their sections.

For each extracted article, it prints its title, section names and plain text section contents, in json-line format.

How to use

  1. Process Wikipedia dump with this script

    python -m gensim.scripts.segment_wiki -i -f enwiki-latest-pages-articles.xml.bz2 -o enwiki-latest.json.gz
    
  2. Read output in simple way:

>>> from gensim import utils
>>> import json
>>>
>>> # iterate over the plain text data we just created
>>> with utils.open('enwiki-latest.json.gz', 'rb') as f:
>>>     for line in f:
>>>         # decode each JSON line into a Python dictionary object
>>>         article = json.loads(line)
>>>
>>>         # each article has a "title", a mapping of interlinks and a list of "section_titles" and
>>>         # "section_texts".
>>>         print("Article title: %s" % article['title'])
>>>         print("Interlinks: %s" + article['interlinks'])
>>>         for section_title, section_text in zip(article['section_titles'], article['section_texts']):
>>>             print("Section title: %s" % section_title)
>>>             print("Section text: %s" % section_text)

Notes

Processing the entire English Wikipedia dump takes 1.7 hours (about 3 million articles per hour, or 10 MB of XML per second) on an 8 core Intel i7-7700 @3.60GHz.

Command line arguments

...
  -h, --help            show this help message and exit
  -f FILE, --file FILE  Path to MediaWiki database dump (read-only).
  -o OUTPUT, --output OUTPUT
                        Path to output file (stdout if not specified). If ends in .gz or .bz2, the output file will be automatically compressed (recommended!).
  -w WORKERS, --workers WORKERS
                        Number of parallel workers for multi-core systems. Default: 7.
  -m MIN_ARTICLE_CHARACTER, --min-article-character MIN_ARTICLE_CHARACTER
                        Ignore articles with fewer characters than this (article stubs). Default: 200.
  -i, --include-interlinks
                        Include a mapping for interlinks to other articles in the dump. The mappings format is: "interlinks": [("article_title_1", "interlink_text_1"), ("article_title_2", "interlink_text_2"), ...]
gensim.scripts.segment_wiki.extract_page_xmls(f)

Extract pages from a MediaWiki database dump.

Parameters

f (file) – File descriptor of MediaWiki dump.

Yields

str – XML strings for page tags.

gensim.scripts.segment_wiki.segment(page_xml, include_interlinks=False)

Parse the content inside a page tag

Parameters
  • page_xml (str) – Content from page tag.

  • include_interlinks (bool) – Whether or not interlinks should be parsed.

Returns

Structure contains (title, [(section_heading, section_content), …], (Optionally) [(interlink_article, interlink_text), …]).

Return type

(str, list of (str, str), (Optionally) list of (str, str))

gensim.scripts.segment_wiki.segment_all_articles(file_path, min_article_character=200, workers=None, include_interlinks=False)

Extract article titles and sections from a MediaWiki bz2 database dump.

Parameters
  • file_path (str) – Path to MediaWiki dump, typical filename is <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2.

  • min_article_character (int, optional) – Minimal number of character for article (except titles and leading gaps).

  • workers (int or None) – Number of parallel workers, max(1, multiprocessing.cpu_count() - 1) if None.

  • include_interlinks (bool) – Whether or not interlinks should be included in the output

Yields

(str, list of (str, str), (Optionally) list of (str, str)) – Structure contains (title, [(section_heading, section_content), …], (Optionally) [(interlink_article, interlink_text), …]).

gensim.scripts.segment_wiki.segment_and_write_all_articles(file_path, output_file, min_article_character=200, workers=None, include_interlinks=False)

Write article title and sections to output_file (or stdout, if output_file is None).

The output format is one article per line, in json-line format with 4 fields:

'title' - title of article,
'section_titles' - list of titles of sections,
'section_texts' - list of content from sections,
(Optional) 'section_interlinks' - list of interlinks in the article.
Parameters
  • file_path (str) – Path to MediaWiki dump, typical filename is <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2.

  • output_file (str or None) – Path to output file in json-lines format, or None for printing to stdout.

  • min_article_character (int, optional) – Minimal number of character for article (except titles and leading gaps).

  • workers (int or None) – Number of parallel workers, max(1, multiprocessing.cpu_count() - 1) if None.

  • include_interlinks (bool) – Whether or not interlinks should be included in the output