scripts.segment_wiki
– Convert wikipedia dump to json-line format¶
This script using for extracting plain text out of a raw Wikipedia dump. Input is an xml.bz2 file provided by MediaWiki that looks like <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2 (e.g. 14 GB of https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2).
It streams through all the XML articles using multiple cores (#cores - 1, by default), decompressing on the fly and extracting plain text from the articles and their sections.
For each extracted article, it prints its title, section names and plain text section contents, in json-line format.
How to use¶
Process Wikipedia dump with this script
python -m gensim.scripts.segment_wiki -i -f enwiki-latest-pages-articles.xml.bz2 -o enwiki-latest.json.gz
Read output in simple way:
>>> from gensim import utils
>>> import json
>>>
>>> # iterate over the plain text data we just created
>>> with utils.open('enwiki-latest.json.gz', 'rb') as f:
>>> for line in f:
>>> # decode each JSON line into a Python dictionary object
>>> article = json.loads(line)
>>>
>>> # each article has a "title", a mapping of interlinks and a list of "section_titles" and
>>> # "section_texts".
>>> print("Article title: %s" % article['title'])
>>> print("Interlinks: %s" + article['interlinks'])
>>> for section_title, section_text in zip(article['section_titles'], article['section_texts']):
>>> print("Section title: %s" % section_title)
>>> print("Section text: %s" % section_text)
Notes
Processing the entire English Wikipedia dump takes 1.7 hours (about 3 million articles per hour, or 10 MB of XML per second) on an 8 core Intel i7-7700 @3.60GHz.
Command line arguments¶
...
-h, --help show this help message and exit
-f FILE, --file FILE Path to MediaWiki database dump (read-only).
-o OUTPUT, --output OUTPUT
Path to output file (stdout if not specified). If ends in .gz or .bz2, the output file will be automatically compressed (recommended!).
-w WORKERS, --workers WORKERS
Number of parallel workers for multi-core systems. Default: 15.
-m MIN_ARTICLE_CHARACTER, --min-article-character MIN_ARTICLE_CHARACTER
Ignore articles with fewer characters than this (article stubs). Default: 200.
-i, --include-interlinks
Include a mapping for interlinks to other articles in the dump. The mappings format is: "interlinks": [("article_title_1", "interlink_text_1"), ("article_title_2", "interlink_text_2"), ...]
- gensim.scripts.segment_wiki.extract_page_xmls(f)¶
Extract pages from a MediaWiki database dump.
- Parameters
f (file) – File descriptor of MediaWiki dump.
- Yields
str – XML strings for page tags.
- gensim.scripts.segment_wiki.segment(page_xml, include_interlinks=False)¶
Parse the content inside a page tag
- Parameters
page_xml (str) – Content from page tag.
include_interlinks (bool) – Whether or not interlinks should be parsed.
- Returns
Structure contains (title, [(section_heading, section_content), …], (Optionally) [(interlink_article, interlink_text), …]).
- Return type
(str, list of (str, str), (Optionally) list of (str, str))
- gensim.scripts.segment_wiki.segment_all_articles(file_path, min_article_character=200, workers=None, include_interlinks=False)¶
Extract article titles and sections from a MediaWiki bz2 database dump.
- Parameters
file_path (str) – Path to MediaWiki dump, typical filename is <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2.
min_article_character (int, optional) – Minimal number of character for article (except titles and leading gaps).
workers (int or None) – Number of parallel workers, max(1, multiprocessing.cpu_count() - 1) if None.
include_interlinks (bool) – Whether or not interlinks should be included in the output
- Yields
(str, list of (str, str), (Optionally) list of (str, str)) – Structure contains (title, [(section_heading, section_content), …], (Optionally) [(interlink_article, interlink_text), …]).
- gensim.scripts.segment_wiki.segment_and_write_all_articles(file_path, output_file, min_article_character=200, workers=None, include_interlinks=False)¶
Write article title and sections to output_file (or stdout, if output_file is None).
The output format is one article per line, in json-line format with 4 fields:
'title' - title of article, 'section_titles' - list of titles of sections, 'section_texts' - list of content from sections, (Optional) 'section_interlinks' - list of interlinks in the article.
- Parameters
file_path (str) – Path to MediaWiki dump, typical filename is <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2.
output_file (str or None) – Path to output file in json-lines format, or None for printing to stdout.
min_article_character (int, optional) – Minimal number of character for article (except titles and leading gaps).
workers (int or None) – Number of parallel workers, max(1, multiprocessing.cpu_count() - 1) if None.
include_interlinks (bool) – Whether or not interlinks should be included in the output