gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

scripts.segment_wiki – Convert wikipedia dump to json-line format

scripts.segment_wiki – Convert wikipedia dump to json-line format

This script using for extracting plain text out of a raw Wikipedia dump. Input is an xml.bz2 file provided by MediaWiki that looks like <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2 (e.g. 14 GB of https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2).

It streams through all the XML articles using multiple cores (#cores - 1, by default), decompressing on the fly and extracting plain text from the articles and their sections.

For each extracted article, it prints its title, section names and plain text section contents, in json-line format.

How to use

  1. Process Wikipedia dump with this script

    python -m gensim.scripts.segment_wiki -i -f enwiki-latest-pages-articles.xml.bz2 -o enwiki-latest.json.gz
    
  2. Read output in simple way

    >>> from smart_open import smart_open
    >>> import json
    >>>
    >>> # iterate over the plain text data we just created
    >>> for line in smart_open('enwiki-latest.json.gz'):
    >>>    # decode each JSON line into a Python dictionary object
    >>>    article = json.loads(line)
    >>>
    >>>    # each article has a "title", a mapping of interlinks and a list of "section_titles" and "section_texts".
    >>>    print("Article title: %s" % article['title'])
    >>>    print("Interlinks: %s" + article['interlinks'])
    >>>    for section_title, section_text in zip(article['section_titles'], article['section_texts']):
    >>>        print("Section title: %s" % section_title)
    >>>        print("Section text: %s" % section_text)
    

Notes

Processing the entire English Wikipedia dump takes 1.7 hours (about 3 million articles per hour, or 10 MB of XML per second) on an 8 core Intel i7-7700 @3.60GHz.

Command line arguments

...
  -h, --help            show this help message and exit
  -f FILE, --file FILE  Path to MediaWiki database dump (read-only).
  -o OUTPUT, --output OUTPUT
                        Path to output file (stdout if not specified). If ends in .gz or .bz2, the output file will be automatically compressed (recommended!).
  -w WORKERS, --workers WORKERS
                        Number of parallel workers for multi-core systems. Default: 7.
  -m MIN_ARTICLE_CHARACTER, --min-article-character MIN_ARTICLE_CHARACTER
                        Ignore articles with fewer characters than this (article stubs). Default: 200.
  -i, --include-interlinks
                        Include a mapping for interlinks to other articles in the dump. The mappings format is: "interlinks": {"article_title_1": "interlink_text_1", "article_title_2": "interlink_text_2", ...}
gensim.scripts.segment_wiki.extract_page_xmls(f)

Extract pages from a MediaWiki database dump.

Parameters:f (file) – File descriptor of MediaWiki dump.
Yields:str – XML strings for page tags.
gensim.scripts.segment_wiki.segment(page_xml, include_interlinks=False)

Parse the content inside a page tag

Parameters:
  • page_xml (str) – Content from page tag.
  • include_interlinks (bool) – Whether or not interlinks should be parsed.
Returns:

(str, list of (str, str), (Optionally) dict of (str – Structure contains (title, [(section_heading, section_content), …], (Optionally) {interlinks}).

Return type:

str))

gensim.scripts.segment_wiki.segment_all_articles(file_path, min_article_character=200, workers=None, include_interlinks=False)

Extract article titles and sections from a MediaWiki bz2 database dump.

Parameters:
  • file_path (str) – Path to MediaWiki dump, typical filename is <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2.
  • min_article_character (int, optional) – Minimal number of character for article (except titles and leading gaps).
  • workers (int or None) – Number of parallel workers, max(1, multiprocessing.cpu_count() - 1) if None.
  • include_interlinks (bool) – Whether or not interlinks should be included in the output
Yields:

(str, list of (str, str), (Optionally) dict of str (str)) – Structure contains (title, [(section_heading, section_content), …], (Optionally) {interlinks}).

gensim.scripts.segment_wiki.segment_and_write_all_articles(file_path, output_file, min_article_character=200, workers=None, include_interlinks=False)

Write article title and sections to output_file (or stdout, if output_file is None).

The output format is one article per line, in json-line format with 4 fields:

'title' - title of article,
'section_titles' - list of titles of sections,
'section_texts' - list of content from sections,
(Optional) 'section_interlinks' - list of interlinks in the article.
Parameters:
  • file_path (str) – Path to MediaWiki dump, typical filename is <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2.
  • output_file (str or None) – Path to output file in json-lines format, or None for printing to stdout.
  • min_article_character (int, optional) – Minimal number of character for article (except titles and leading gaps).
  • workers (int or None) – Number of parallel workers, max(1, multiprocessing.cpu_count() - 1) if None.
  • include_interlinks (bool) – Whether or not interlinks should be included in the output