corpora.wikicorpus – Corpus from a Wikipedia dump

`corpora.wikicorpus` – Corpus from a Wikipedia dump¶

Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump.

Uses multiprocessing internally to parallelize the work and process the dump more quickly.

Notes

If you have the pattern package installed, this module will use a fancy lemmatization to get a lemma of each token (instead of plain alphabetic tokenizer).

See gensim.scripts.make_wiki for a canned (example) command-line script based on this module.

gensim.corpora.wikicorpus.ARTICLE_MIN_WORDS = 50¶: Ignore shorter articles (after full preprocessing).

gensim.corpora.wikicorpus.IGNORED_NAMESPACES = ['Wikipedia', 'Category', 'File', 'Portal', 'Template', 'MediaWiki', 'User', 'Help', 'Book', 'Draft', 'WikiProject', 'Special', 'Talk']¶: MediaWiki namespaces that ought to be ignored.

gensim.corpora.wikicorpus.RE_P0 = re.compile('', re.DOTALL)¶: Comments.

gensim.corpora.wikicorpus.RE_P1 = re.compile('<ref([> ].*?)(</ref>|/>)', re.DOTALL)¶: Footnotes.

gensim.corpora.wikicorpus.RE_P10 = re.compile('<math([> ].*?)(</math>|/>)', re.DOTALL)¶: Math content.

gensim.corpora.wikicorpus.RE_P11 = re.compile('<(.*?)>', re.DOTALL)¶: All other tags.

gensim.corpora.wikicorpus.RE_P12 = re.compile('(({\\|)|(\\|-(?!\\d))|(\\|}))(.*?)(?=\\n)')¶: Table formatting.

gensim.corpora.wikicorpus.RE_P13 = re.compile('(?<=(\\n[ ])|(\\n\\n)|([ ]{2})|(.\\n)|(.\\t))(\\||\\!)([^[\\]\\n]*?\\|)*')¶: Table cell formatting.

gensim.corpora.wikicorpus.RE_P14 = re.compile('\\[\\[Category:[^][]*\\]\\]')¶: Categories.

gensim.corpora.wikicorpus.RE_P15 = re.compile('\\[\\[([fF]ile:|[iI]mage)[^]]*(\\]\\])')¶: Remove File and Image templates.

gensim.corpora.wikicorpus.RE_P16 = re.compile('\\[{2}(.*?)\\]{2}')¶: Capture interlinks text and article linked

gensim.corpora.wikicorpus.RE_P17 = re.compile('(\\n.{0,4}((bgcolor)|(\\d{0,1}[ ]?colspan)|(rowspan)|(style=)|(class=)|(align=)|(scope=))(.*))|(^.{0,2}((bgcolor)|(\\d{0,1}[ ]?colspan)|(rowspan)|(style=)|(class=)|(align=))(.*))')¶: Table markup

gensim.corpora.wikicorpus.RE_P2 = re.compile('(\\n\\[\\[[a-z][a-z][\\w-]*:[^:\\]]+\\]\\])+$')¶: Links to languages.

gensim.corpora.wikicorpus.RE_P3 = re.compile('{{([^}{]*)}}', re.DOTALL)¶: Template.

gensim.corpora.wikicorpus.RE_P4 = re.compile('{{([^}]*)}}', re.DOTALL)¶: Template.

gensim.corpora.wikicorpus.RE_P5 = re.compile('\\[(\\w+):\\/\\/(.*?)(( (.*?))|())\\]')¶: Remove URL, keep description.

gensim.corpora.wikicorpus.RE_P6 = re.compile('\\[([^][]*)\\|([^][]*)\\]', re.DOTALL)¶: Simplify links, keep description.

gensim.corpora.wikicorpus.RE_P7 = re.compile('\\n\\[\\[[iI]mage(.*?)(\\|.*?)*\\|(.*?)\\]\\]')¶: Keep description of images.

gensim.corpora.wikicorpus.RE_P8 = re.compile('\\n\\[\\[[fF]ile(.*?)(\\|.*?)*\\|(.*?)\\]\\]')¶: Keep description of files.

gensim.corpora.wikicorpus.RE_P9 = re.compile('<nowiki([> ].*?)(</nowiki>|/>)', re.DOTALL)¶: External links.

class gensim.corpora.wikicorpus.WikiCorpus(fname, processes=None, lemmatize=False, dictionary=None, filter_namespaces=('0', ), tokenizer_func=<function tokenize>, article_min_tokens=50, token_min_len=2, token_max_len=15, lower=True, filter_articles=None)¶

Bases: gensim.corpora.textcorpus.TextCorpus

Treat a Wikipedia articles dump as a read-only, streamed, memory-efficient corpus.

Supported dump formats:

<LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2
<LANG>wiki-latest-pages-articles.xml.bz2

The documents are extracted on-the-fly, so that the whole (massive) dump can stay compressed on disk.

Notes

Dumps for the English Wikipedia can be founded at https://dumps.wikimedia.org/enwiki/.

metadata¶

Whether to write articles titles to serialized corpus.

Type: bool

Warning

“Multistream” archives are not supported in Python 2 due to limitations in the core bz2 library.

Examples

>>> from gensim.test.utils import datapath, get_tmpfile
>>> from gensim.corpora import WikiCorpus, MmCorpus
>>>
>>> path_to_wiki_dump = datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")
>>> corpus_path = get_tmpfile("wiki-corpus.mm")
>>>
>>> wiki = WikiCorpus(path_to_wiki_dump)  # create word->word_id mapping, ~8h on full wiki
>>> MmCorpus.serialize(corpus_path, wiki)  # another 8h, creates a file in MatrixMarket format and mapping

Initialize the corpus.

Unless a dictionary is provided, this scans the corpus once, to determine its vocabulary.

Parameters

fname (str) – Path to the Wikipedia dump file.
processes (int, optional) – Number of processes to run, defaults to max(1, number of cpu - 1).
lemmatize (bool) –
Use lemmatization instead of simple regexp tokenization. Defaults to True if you have the pattern package installed.
dictionary (Dictionary, optional) – Dictionary, if not provided, this scans the corpus once, to determine its vocabulary IMPORTANT: this needs a really long time.
filter_namespaces (tuple of str, optional) – Namespaces to consider.
tokenizer_func (function, optional) – Function that will be used for tokenization. By default, use tokenize(). If you inject your own tokenizer, it must conform to this interface: tokenizer_func(text: str, token_min_len: int, token_max_len: int, lower: bool) -> list of str
article_min_tokens (int, optional) – Minimum tokens in article. Article will be ignored if number of tokens is less.
token_min_len (int, optional) – Minimal token length.
token_max_len (int, optional) – Maximal token length.
lower (bool, optional) – If True - convert all text to lower case.
filter_articles (callable or None, optional) – If set, each XML article element will be passed to this callable before being processed. Only articles where the callable returns an XML element are processed, returning None allows filtering out some articles based on customised rules.

Warning

Unless a dictionary is provided, this scans the corpus once, to determine its vocabulary.

get_texts()¶

Iterate over the dump, yielding a list of tokens for each article that passed the length and namespace filtering.

Uses multiprocessing internally to parallelize the work and process the dump more quickly.

Notes

This iterates over the texts. If you want vectors, just use the standard corpus interface instead of this method:

Examples

>>> from gensim.test.utils import datapath
>>> from gensim.corpora import WikiCorpus
>>>
>>> path_to_wiki_dump = datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")
>>>
>>> for vec in WikiCorpus(path_to_wiki_dump):
...     pass

Yields

list of str – If metadata is False, yield only list of token extracted from the article.
(list of str, (int, str)) – List of tokens (extracted from the article), page id and article title otherwise.

getstream()¶

Generate documents from the underlying plain text collection (of one or more files).

Yields: str – Document read from plain-text file.

Notes

After generator end - initialize self.length attribute.

init_dictionary(dictionary)¶

Initialize/update dictionary.

Parameters: dictionary (Dictionary, optional) – If a dictionary is provided, it will not be updated with the given corpus on initialization. If None - new dictionary will be built for the given corpus.

Notes

If self.input is None - make nothing.

classmethod load(fname, mmap=None)¶

Load an object previously saved using save() from a file.

Parameters

fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save(): Save object to file.

Returns: Object loaded from fname.
Return type: object
Raises: AttributeError – When called on an object instance instead of class (this is a class method).

preprocess_text(text)¶

Apply self.character_filters, self.tokenizer, self.token_filters to a single text document.

Parameters: text (str) – Document read from plain-text file.
Returns: List of tokens extracted from text.
Return type: list of str

sample_texts(n, seed=None, length=None)¶

Generate n random documents from the corpus without replacement.

Parameters

n (int) – Number of documents we want to sample.
seed (int, optional) – If specified, use it as a seed for local random generator.
length (int, optional) – Value will used as corpus length (because calculate length of corpus can be costly operation). If not specified - will call __length__.

Raises

ValueError – If n less than zero or greater than corpus size.

Notes

Given the number of remaining documents in a corpus, we need to choose n elements. The probability for the current element to be chosen is n / remaining. If we choose it, we just decrease the n and move to the next element.

Yields: list of str – Sampled document as sequence of tokens.

save(*args, **kwargs)¶: Saves corpus in-memory state.

Warning

This save only the “state” of a corpus class, not the corpus data!

For saving data use the serialize method of the output format you’d like to use (e.g. gensim.corpora.mmcorpus.MmCorpus.serialize()).

static save_corpus(fname, corpus, id2word=None, metadata=False)¶

Save corpus to disk.

Some formats support saving the dictionary (feature_id -> word mapping), which can be provided by the optional id2word parameter.

Notes

Some corpora also support random access via document indexing, so that the documents on disk can be accessed in O(1) time (see the gensim.corpora.indexedcorpus.IndexedCorpus base class).

In this case, save_corpus() is automatically called internally by serialize(), which does save_corpus() plus saves the index at the same time.

Calling serialize() is preferred to calling :meth:`gensim.interfaces.CorpusABC.save_corpus().

Parameters

fname (str) – Path to output file.
corpus (iterable of list of (int, number)) – Corpus in BoW format.
id2word (Dictionary, optional) – Dictionary of corpus.
metadata (bool, optional) – Write additional metadata to a separate too?

step_through_preprocess(text)¶

Apply preprocessor one by one and generate result.

Warning

This is useful for debugging issues with the corpus preprocessing pipeline.

Parameters: text (str) – Document text read from plain-text file.
Yields: (callable, object) – Pre-processor, output from pre-processor (based on text)

gensim.corpora.wikicorpus.extract_pages(f, filter_namespaces=False, filter_articles=None)¶

Extract pages from a MediaWiki database dump.

Parameters

f (file) – File-like object.
filter_namespaces (list of str or bool) – Namespaces that will be extracted.

Yields

tuple of (str or None, str, str) – Title, text and page id.

gensim.corpora.wikicorpus.filter_example(elem, text, *args, **kwargs)¶

Example function for filtering arbitrary documents from wikipedia dump.

The custom filter function is called _before_ tokenisation and should work on the raw text and/or XML element information.

The filter function gets the entire context of the XML element passed into it, but you can of course choose not the use some or all parts of the context. Please refer to gensim.corpora.wikicorpus.extract_pages() for the exact details of the page context.

Parameters

elem (etree.Element) – XML etree element
text (str) – The text of the XML node
namespace (str) – XML namespace of the XML element
title (str) – Page title
page_tag (str) – XPath expression for page.
text_path (str) – XPath expression for text.
title_path (str) – XPath expression for title.
ns_path (str) – XPath expression for namespace.
pageid_path (str) – XPath expression for page id.

Example

>>> import gensim.corpora
>>> filter_func = gensim.corpora.wikicorpus.filter_example
>>> dewiki = gensim.corpora.WikiCorpus(
...     './dewiki-20180520-pages-articles-multistream.xml.bz2',
...     filter_articles=filter_func)

gensim.corpora.wikicorpus.filter_wiki(raw, promote_remaining=True, simplify_links=True)¶

Filter out wiki markup from raw, leaving only text.

Parameters

raw (str) – Unicode or utf-8 encoded string.
promote_remaining (bool) – Whether uncaught markup should be promoted to plain text.
simplify_links (bool) – Whether links should be simplified keeping only their description text.

Returns

raw without markup.

Return type

str

gensim.corpora.wikicorpus.find_interlinks(raw)¶

Find all interlinks to other articles in the dump.

Parameters: raw (str) – Unicode or utf-8 encoded string.
Returns: List of tuples in format [(linked article, the actual text found), …].
Return type: list

gensim.corpora.wikicorpus.get_namespace(tag)¶

Get the namespace of tag.

Parameters: tag (str) – Namespace or tag.
Returns: Matched namespace or tag.
Return type: str

gensim.corpora.wikicorpus.init_to_ignore_interrupt()¶: Enables interruption ignoring.

Warning

Should only be used when master is prepared to handle termination of child processes.

gensim.corpora.wikicorpus.process_article(args, tokenizer_func=<function tokenize>, token_min_len=2, token_max_len=15, lower=True)¶

Parse a Wikipedia article, extract all tokens.

Notes

Set tokenizer_func (defaults is tokenize()) parameter for languages like Japanese or Thai to perform better tokenization. The tokenizer_func needs to take 4 parameters: (text: str, token_min_len: int, token_max_len: int, lower: bool).

Parameters

args ((str, bool, str, int)) – Article text, lemmatize flag (if True, lemmatize() will be used), article title, page identificator.
tokenizer_func (function) – Function for tokenization (defaults is tokenize()). Needs to have interface: tokenizer_func(text: str, token_min_len: int, token_max_len: int, lower: bool) -> list of str.
token_min_len (int) – Minimal token length.
token_max_len (int) – Maximal token length.
lower (bool) – Convert article text to lower case?

Returns

List of tokens from article, title and page id.

Return type

(list of str, str, int)

gensim.corpora.wikicorpus.remove_file(s)¶

Remove the ‘File:’ and ‘Image:’ markup, keeping the file caption.

Parameters: s (str) – String containing ‘File:’ and ‘Image:’ markup.
Returns: Сopy of s with all the ‘File:’ and ‘Image:’ markup replaced by their corresponding captions.
Return type: str

gensim.corpora.wikicorpus.remove_markup(text, promote_remaining=True, simplify_links=True)¶

Filter out wiki markup from text, leaving only text.

Parameters

text (str) – String containing markup.
promote_remaining (bool) – Whether uncaught markup should be promoted to plain text.
simplify_links (bool) – Whether links should be simplified keeping only their description text.

Returns

text without markup.

Return type

str

gensim.corpora.wikicorpus.remove_template(s)¶

Remove template wikimedia markup.

Parameters: s (str) – String containing markup template.
Returns: Сopy of s with all the wikimedia markup template removed.
Return type: str

Notes

Since template can be nested, it is difficult remove them using regular expressions.

gensim.corpora.wikicorpus.tokenize(content, token_min_len=2, token_max_len=15, lower=True)¶

Tokenize a piece of text from Wikipedia.

Set token_min_len, token_max_len as character length (not bytes!) thresholds for individual tokens.

Parameters

content (str) – String without markup (see filter_wiki()).
token_min_len (int) – Minimal token length.
token_max_len (int) – Maximal token length.
lower (bool) – Convert content to lower case?

Returns

List of tokens from content.

Return type

list of str

Get Expert Help From The Gensim Authors

corpora.wikicorpus – Corpus from a Wikipedia dump¶

`corpora.wikicorpus` – Corpus from a Wikipedia dump¶