gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

scripts.segment_wiki – Convert wikipedia dump to json-line format

scripts.segment_wiki – Convert wikipedia dump to json-line format

CLI script for extracting plain text out of a raw Wikipedia dump. Input is an xml.bz2 file provided by MediaWiki that looks like <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2 (e.g. 14 GB of https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2).

It streams through all the XML articles using multiple cores (#cores - 1, by default), decompressing on the fly and extracting plain text from the articles and their sections.

For each extracted article, it prints its title, section names and plain text section contents, in json-line format.

Examples

python -m gensim.scripts.segment_wiki -h

python -m gensim.scripts.segment_wiki -f enwiki-latest-pages-articles.xml.bz2 -o enwiki-latest.json.gz

Processing the entire English Wikipedia dump takes 1.7 hours (about 3 million articles per hour, or 10 MB of XML per second) on an 8 core Intel i7-7700 @3.60GHz.

You can then read the created output (~6.1 GB gzipped) with:

>>> # iterate over the plain text data we just created
>>> for line in smart_open('enwiki-latest.json.gz'):
>>>    # decode each JSON line into a Python dictionary object
>>>    article = json.loads(line)
>>>
>>>    # each article has a "title" and a list of "section_titles" and "section_texts".
>>>    print("Article title: %s" % article['title'])
>>>    for section_title, section_text in zip(article['section_titles'], article['section_texts']):
>>>        print("Section title: %s" % section_title)
>>>        print("Section text: %s" % section_text)
gensim.scripts.segment_wiki.extract_page_xmls(f)

Extract pages from a MediaWiki database dump.

Parameters:f (file) – File descriptor of MediaWiki dump.
Yields:str – XML strings for page tags.
gensim.scripts.segment_wiki.segment(page_xml)

Parse the content inside a page tag

Parameters:page_xml (str) – Content from page tag.
Returns:Structure contains (title, [(section_heading, section_content)]).
Return type:(str, list of (str, str))
gensim.scripts.segment_wiki.segment_all_articles(file_path, min_article_character=200, workers=None)

Extract article titles and sections from a MediaWiki bz2 database dump.

Parameters:
  • file_path (str) – Path to MediaWiki dump, typical filename is <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2.
  • min_article_character (int, optional) – Minimal number of character for article (except titles and leading gaps).
  • workers (int or None) – Number of parallel workers, max(1, multiprocessing.cpu_count() - 1) if None.
Yields:

(str, list of (str, str)) – Structure contains (title, [(section_heading, section_content), …]).

gensim.scripts.segment_wiki.segment_and_write_all_articles(file_path, output_file, min_article_character=200, workers=None)

Write article title and sections to output_file (or stdout, if output_file is None).

The output format is one article per line, in json-line format with 3 fields:

'title' - title of article,
'section_titles' - list of titles of sections,
'section_texts' - list of content from sections.
Parameters:
  • file_path (str) – Path to MediaWiki dump, typical filename is <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2.
  • output_file (str or None) – Path to output file in json-lines format, or None for printing to stdout.
  • min_article_character (int, optional) – Minimal number of character for article (except titles and leading gaps).
  • workers (int or None) – Number of parallel workers, max(1, multiprocessing.cpu_count() - 1) if None.