gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

scripts.segment_wiki – Convert wikipedia dump to json-line format

scripts.segment_wiki – Convert wikipedia dump to json-line format

Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump (typical filename is <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2), extract titles, section names, section content and save to json-line format, that contains 3 fields

'title' (str) - title of article,
'section_titles' (list) - list of titles of sections,
'section_texts' (list) - list of content from sections.

English Wikipedia dump available here. Approximate time for processing is 2.5 hours (i7-6700HQ, SSD).

Examples

Convert wiki to json-lines format: python -m gensim.scripts.segment_wiki -f enwiki-latest-pages-articles.xml.bz2 | gzip > enwiki-latest.json.gz

Read json-lines dump

>>> # iterate over the plain text file we just created
>>> for line in smart_open('enwiki-latest.json.gz'):
>>>    # decode JSON into a Python object
>>>    article = json.loads(line)
>>>
>>>    # each article has a "title", "section_titles" and "section_texts" fields
>>>    print("Article title: %s" % article['title'])
>>>    for section_title, section_text in zip(article['section_titles'], article['section_texts']):
>>>        print("Section title: %s" % section_title)
>>>        print("Section text: %s" % section_text)
gensim.scripts.segment_wiki.extract_page_xmls(f)

Extract pages from a MediaWiki database dump.

Parameters:f (file) – File descriptor of MediaWiki dump.
Yields:str – XML strings for page tags.
gensim.scripts.segment_wiki.segment(page_xml)

Parse the content inside a page tag

Parameters:page_xml (str) – Content from page tag.
Returns:Structure contains (title, [(section_heading, section_content)]).
Return type:(str, list of (str, str))
gensim.scripts.segment_wiki.segment_all_articles(file_path, min_article_character=200)

Extract article titles and sections from a MediaWiki bz2 database dump.

Parameters:
  • file_path (str) – Path to MediaWiki dump, typical filename is <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2.
  • min_article_character (int, optional) – Minimal number of character for article (except titles and leading gaps).
Yields:

(str, list of (str, str)) – Structure contains (title, [(section_heading, section_content), …]).

gensim.scripts.segment_wiki.segment_and_write_all_articles(file_path, output_file, min_article_character=200)

Write article title and sections to output_file, output_file is json-line file with 3 fields:

'title' - title of article,
'section_titles' - list of titles of sections,
'section_texts' - list of content from sections.
Parameters:
  • file_path (str) – Path to MediaWiki dump, typical filename is <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2.
  • output_file (str) – Path to output file in json-lines format.
  • min_article_character (int, optional) – Minimal number of character for article (except titles and leading gaps).