corpora.wikicorpus
– Corpus from a Wikipedia dump¶Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump.
Uses multiprocessing internally to parallelize the work and process the dump more quickly.
Notes
If you have the pattern package installed, this module will use a fancy lemmatization to get a lemma of each token (instead of plain alphabetic tokenizer).
See gensim.scripts.make_wiki
for a canned (example) command-line script based on this module.
gensim.corpora.wikicorpus.
ARTICLE_MIN_WORDS
= 50¶Ignore shorter articles (after full preprocessing).
gensim.corpora.wikicorpus.
IGNORED_NAMESPACES
= ['Wikipedia', 'Category', 'File', 'Portal', 'Template', 'MediaWiki', 'User', 'Help', 'Book', 'Draft', 'WikiProject', 'Special', 'Talk']¶MediaWiki namespaces that ought to be ignored.
gensim.corpora.wikicorpus.
RE_P0
= re.compile('<!--.*?-->', re.DOTALL)¶Comments.
gensim.corpora.wikicorpus.
RE_P1
= re.compile('<ref([> ].*?)(</ref>|/>)', re.DOTALL)¶Footnotes.
gensim.corpora.wikicorpus.
RE_P10
= re.compile('<math([> ].*?)(</math>|/>)', re.DOTALL)¶Math content.
gensim.corpora.wikicorpus.
RE_P11
= re.compile('<(.*?)>', re.DOTALL)¶All other tags.
gensim.corpora.wikicorpus.
RE_P12
= re.compile('(({\\|)|(\\|-(?!\\d))|(\\|}))(.*?)(?=\\n)')¶Table formatting.
gensim.corpora.wikicorpus.
RE_P13
= re.compile('(?<=(\\n[ ])|(\\n\\n)|([ ]{2})|(.\\n)|(.\\t))(\\||\\!)([^[\\]\\n]*?\\|)*')¶Table cell formatting.
gensim.corpora.wikicorpus.
RE_P14
= re.compile('\\[\\[Category:[^][]*\\]\\]')¶Categories.
gensim.corpora.wikicorpus.
RE_P15
= re.compile('\\[\\[([fF]ile:|[iI]mage)[^]]*(\\]\\])')¶Remove File and Image templates.
gensim.corpora.wikicorpus.
RE_P16
= re.compile('\\[{2}(.*?)\\]{2}')¶Capture interlinks text and article linked
gensim.corpora.wikicorpus.
RE_P17
= re.compile('(\\n.{0,4}((bgcolor)|(\\d{0,1}[ ]?colspan)|(rowspan)|(style=)|(class=)|(align=)|(scope=))(.*))|(^.{0,2}((bgcolor)|(\\d{0,1}[ ]?colspan)|(rowspan)|(style=)|(class=)|(align=))(.*))')¶Table markup
gensim.corpora.wikicorpus.
RE_P2
= re.compile('(\\n\\[\\[[a-z][a-z][\\w-]*:[^:\\]]+\\]\\])+$')¶Links to languages.
gensim.corpora.wikicorpus.
RE_P3
= re.compile('{{([^}{]*)}}', re.DOTALL)¶Template.
gensim.corpora.wikicorpus.
RE_P4
= re.compile('{{([^}]*)}}', re.DOTALL)¶Template.
gensim.corpora.wikicorpus.
RE_P5
= re.compile('\\[(\\w+):\\/\\/(.*?)(( (.*?))|())\\]')¶Remove URL, keep description.
gensim.corpora.wikicorpus.
RE_P6
= re.compile('\\[([^][]*)\\|([^][]*)\\]', re.DOTALL)¶Simplify links, keep description.
gensim.corpora.wikicorpus.
RE_P7
= re.compile('\\n\\[\\[[iI]mage(.*?)(\\|.*?)*\\|(.*?)\\]\\]')¶Keep description of images.
gensim.corpora.wikicorpus.
RE_P8
= re.compile('\\n\\[\\[[fF]ile(.*?)(\\|.*?)*\\|(.*?)\\]\\]')¶Keep description of files.
gensim.corpora.wikicorpus.
RE_P9
= re.compile('<nowiki([> ].*?)(</nowiki>|/>)', re.DOTALL)¶External links.
gensim.corpora.wikicorpus.
WikiCorpus
(fname, processes=None, lemmatize=False, dictionary=None, filter_namespaces=('0', ), tokenizer_func=<function tokenize>, article_min_tokens=50, token_min_len=2, token_max_len=15, lower=True, filter_articles=None)¶Bases: gensim.corpora.textcorpus.TextCorpus
Treat a Wikipedia articles dump as a read-only, streamed, memory-efficient corpus.
Supported dump formats:
<LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2
<LANG>wiki-latest-pages-articles.xml.bz2
The documents are extracted on-the-fly, so that the whole (massive) dump can stay compressed on disk.
Notes
Dumps for the English Wikipedia can be founded at https://dumps.wikimedia.org/enwiki/.
metadata
¶Whether to write articles titles to serialized corpus.
bool
Warning
“Multistream” archives are not supported in Python 2 due to limitations in the core bz2 library.
Examples
>>> from gensim.test.utils import datapath, get_tmpfile
>>> from gensim.corpora import WikiCorpus, MmCorpus
>>>
>>> path_to_wiki_dump = datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")
>>> corpus_path = get_tmpfile("wiki-corpus.mm")
>>>
>>> wiki = WikiCorpus(path_to_wiki_dump) # create word->word_id mapping, ~8h on full wiki
>>> MmCorpus.serialize(corpus_path, wiki) # another 8h, creates a file in MatrixMarket format and mapping
Initialize the corpus.
Unless a dictionary is provided, this scans the corpus once, to determine its vocabulary.
fname (str) – Path to the Wikipedia dump file.
processes (int, optional) – Number of processes to run, defaults to max(1, number of cpu - 1).
lemmatize (bool) –
Use lemmatization instead of simple regexp tokenization. Defaults to True if you have the pattern package installed.
dictionary (Dictionary
, optional) – Dictionary, if not provided, this scans the corpus once, to determine its vocabulary
IMPORTANT: this needs a really long time.
filter_namespaces (tuple of str, optional) – Namespaces to consider.
tokenizer_func (function, optional) – Function that will be used for tokenization. By default, use tokenize()
.
If you inject your own tokenizer, it must conform to this interface:
tokenizer_func(text: str, token_min_len: int, token_max_len: int, lower: bool) -> list of str
article_min_tokens (int, optional) – Minimum tokens in article. Article will be ignored if number of tokens is less.
token_min_len (int, optional) – Minimal token length.
token_max_len (int, optional) – Maximal token length.
lower (bool, optional) – If True - convert all text to lower case.
filter_articles (callable or None, optional) – If set, each XML article element will be passed to this callable before being processed. Only articles where the callable returns an XML element are processed, returning None allows filtering out some articles based on customised rules.
Warning
Unless a dictionary is provided, this scans the corpus once, to determine its vocabulary.
get_texts
()¶Iterate over the dump, yielding a list of tokens for each article that passed the length and namespace filtering.
Uses multiprocessing internally to parallelize the work and process the dump more quickly.
Notes
This iterates over the texts. If you want vectors, just use the standard corpus interface instead of this method:
Examples
>>> from gensim.test.utils import datapath
>>> from gensim.corpora import WikiCorpus
>>>
>>> path_to_wiki_dump = datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")
>>>
>>> for vec in WikiCorpus(path_to_wiki_dump):
... pass
list of str – If metadata is False, yield only list of token extracted from the article.
(list of str, (int, str)) – List of tokens (extracted from the article), page id and article title otherwise.
getstream
()¶Generate documents from the underlying plain text collection (of one or more files).
str – Document read from plain-text file.
Notes
After generator end - initialize self.length attribute.
init_dictionary
(dictionary)¶Initialize/update dictionary.
dictionary (Dictionary
, optional) – If a dictionary is provided, it will not be updated with the given corpus on initialization.
If None - new dictionary will be built for the given corpus.
Notes
If self.input is None - make nothing.
load
(fname, mmap=None)¶Load an object previously saved using save()
from a file.
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
Object loaded from fname.
object
AttributeError – When called on an object instance instead of class (this is a class method).
preprocess_text
(text)¶Apply self.character_filters, self.tokenizer, self.token_filters to a single text document.
text (str) – Document read from plain-text file.
List of tokens extracted from text.
list of str
sample_texts
(n, seed=None, length=None)¶Generate n random documents from the corpus without replacement.
n (int) – Number of documents we want to sample.
seed (int, optional) – If specified, use it as a seed for local random generator.
length (int, optional) – Value will used as corpus length (because calculate length of corpus can be costly operation). If not specified - will call __length__.
ValueError – If n less than zero or greater than corpus size.
Notes
Given the number of remaining documents in a corpus, we need to choose n elements. The probability for the current element to be chosen is n / remaining. If we choose it, we just decrease the n and move to the next element.
list of str – Sampled document as sequence of tokens.
save
(*args, **kwargs)¶Saves corpus in-memory state.
Warning
This save only the “state” of a corpus class, not the corpus data!
For saving data use the serialize method of the output format you’d like to use
(e.g. gensim.corpora.mmcorpus.MmCorpus.serialize()
).
save_corpus
(fname, corpus, id2word=None, metadata=False)¶Save corpus to disk.
Some formats support saving the dictionary (feature_id -> word mapping), which can be provided by the optional id2word parameter.
Notes
Some corpora also support random access via document indexing, so that the documents on disk
can be accessed in O(1) time (see the gensim.corpora.indexedcorpus.IndexedCorpus
base class).
In this case, save_corpus()
is automatically called internally by
serialize()
, which does save_corpus()
plus saves the index
at the same time.
Calling serialize() is preferred to calling :meth:`gensim.interfaces.CorpusABC.save_corpus()
.
fname (str) – Path to output file.
corpus (iterable of list of (int, number)) – Corpus in BoW format.
id2word (Dictionary
, optional) – Dictionary of corpus.
metadata (bool, optional) – Write additional metadata to a separate too?
step_through_preprocess
(text)¶Apply preprocessor one by one and generate result.
Warning
This is useful for debugging issues with the corpus preprocessing pipeline.
text (str) – Document text read from plain-text file.
(callable, object) – Pre-processor, output from pre-processor (based on text)
gensim.corpora.wikicorpus.
extract_pages
(f, filter_namespaces=False, filter_articles=None)¶Extract pages from a MediaWiki database dump.
f (file) – File-like object.
filter_namespaces (list of str or bool) – Namespaces that will be extracted.
tuple of (str or None, str, str) – Title, text and page id.
gensim.corpora.wikicorpus.
filter_example
(elem, text, *args, **kwargs)¶Example function for filtering arbitrary documents from wikipedia dump.
The custom filter function is called _before_ tokenisation and should work on the raw text and/or XML element information.
The filter function gets the entire context of the XML element passed into it,
but you can of course choose not the use some or all parts of the context. Please
refer to gensim.corpora.wikicorpus.extract_pages()
for the exact details
of the page context.
elem (etree.Element) – XML etree element
text (str) – The text of the XML node
namespace (str) – XML namespace of the XML element
title (str) – Page title
page_tag (str) – XPath expression for page.
text_path (str) – XPath expression for text.
title_path (str) – XPath expression for title.
ns_path (str) – XPath expression for namespace.
pageid_path (str) – XPath expression for page id.
Example
>>> import gensim.corpora
>>> filter_func = gensim.corpora.wikicorpus.filter_example
>>> dewiki = gensim.corpora.WikiCorpus(
... './dewiki-20180520-pages-articles-multistream.xml.bz2',
... filter_articles=filter_func)
gensim.corpora.wikicorpus.
filter_wiki
(raw, promote_remaining=True, simplify_links=True)¶Filter out wiki markup from raw, leaving only text.
raw (str) – Unicode or utf-8 encoded string.
promote_remaining (bool) – Whether uncaught markup should be promoted to plain text.
simplify_links (bool) – Whether links should be simplified keeping only their description text.
raw without markup.
str
gensim.corpora.wikicorpus.
find_interlinks
(raw)¶Find all interlinks to other articles in the dump.
raw (str) – Unicode or utf-8 encoded string.
List of tuples in format [(linked article, the actual text found), …].
list
gensim.corpora.wikicorpus.
get_namespace
(tag)¶Get the namespace of tag.
tag (str) – Namespace or tag.
Matched namespace or tag.
str
gensim.corpora.wikicorpus.
init_to_ignore_interrupt
()¶Enables interruption ignoring.
Warning
Should only be used when master is prepared to handle termination of child processes.
gensim.corpora.wikicorpus.
process_article
(args, tokenizer_func=<function tokenize>, token_min_len=2, token_max_len=15, lower=True)¶Parse a Wikipedia article, extract all tokens.
Notes
Set tokenizer_func (defaults is tokenize()
) parameter for languages
like Japanese or Thai to perform better tokenization.
The tokenizer_func needs to take 4 parameters: (text: str, token_min_len: int, token_max_len: int, lower: bool).
args ((str, bool, str, int)) – Article text, lemmatize flag (if True, lemmatize()
will be used), article title,
page identificator.
tokenizer_func (function) – Function for tokenization (defaults is tokenize()
).
Needs to have interface:
tokenizer_func(text: str, token_min_len: int, token_max_len: int, lower: bool) -> list of str.
token_min_len (int) – Minimal token length.
token_max_len (int) – Maximal token length.
lower (bool) – Convert article text to lower case?
List of tokens from article, title and page id.
(list of str, str, int)
gensim.corpora.wikicorpus.
remove_file
(s)¶Remove the ‘File:’ and ‘Image:’ markup, keeping the file caption.
s (str) – String containing ‘File:’ and ‘Image:’ markup.
Сopy of s with all the ‘File:’ and ‘Image:’ markup replaced by their corresponding captions.
str
gensim.corpora.wikicorpus.
remove_markup
(text, promote_remaining=True, simplify_links=True)¶Filter out wiki markup from text, leaving only text.
text (str) – String containing markup.
promote_remaining (bool) – Whether uncaught markup should be promoted to plain text.
simplify_links (bool) – Whether links should be simplified keeping only their description text.
text without markup.
str
gensim.corpora.wikicorpus.
remove_template
(s)¶Remove template wikimedia markup.
s (str) – String containing markup template.
Сopy of s with all the wikimedia markup template removed.
str
Notes
Since template can be nested, it is difficult remove them using regular expressions.
gensim.corpora.wikicorpus.
tokenize
(content, token_min_len=2, token_max_len=15, lower=True)¶Tokenize a piece of text from Wikipedia.
Set token_min_len, token_max_len as character length (not bytes!) thresholds for individual tokens.
content (str) – String without markup (see filter_wiki()
).
token_min_len (int) – Minimal token length.
token_max_len (int) – Maximal token length.
lower (bool) – Convert content to lower case?
List of tokens from content.
list of str