gensim logo

gensim
gensim tagline

Get Expert Help From The Gensim Authors

Consulting in Machine Learning & NLP

• Commercial document similarity engine: ScaleText.ai

Corporate trainings in Python Data Science and Deep Learning

corpora.wikicorpus – Corpus from a Wikipedia dump

corpora.wikicorpus – Corpus from a Wikipedia dump

Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump.

Notes

If you have the pattern package installed, this module will use a fancy lemmatization to get a lemma of each token (instead of plain alphabetic tokenizer). The package is available at [1] .

See make_wiki for a canned (example) script based on this module.

References

[1]https://github.com/clips/pattern
gensim.corpora.wikicorpus.ARTICLE_MIN_WORDS = 50

Ignore shorter articles (after full preprocessing).

gensim.corpora.wikicorpus.IGNORED_NAMESPACES = ['Wikipedia', 'Category', 'File', 'Portal', 'Template', 'MediaWiki', 'User', 'Help', 'Book', 'Draft', 'WikiProject', 'Special', 'Talk']

MediaWiki namespaces [2] that ought to be ignored.

References

[2]https://www.mediawiki.org/wiki/Manual:Namespace
gensim.corpora.wikicorpus.RE_P0 = <_sre.SRE_Pattern object>

Comments.

gensim.corpora.wikicorpus.RE_P1 = <_sre.SRE_Pattern object>

Footnotes.

gensim.corpora.wikicorpus.RE_P10 = <_sre.SRE_Pattern object>

Math content.

gensim.corpora.wikicorpus.RE_P11 = <_sre.SRE_Pattern object>

All other tags.

gensim.corpora.wikicorpus.RE_P12 = <_sre.SRE_Pattern object>

Table formatting.

gensim.corpora.wikicorpus.RE_P13 = <_sre.SRE_Pattern object>

Table cell formatting.

gensim.corpora.wikicorpus.RE_P14 = <_sre.SRE_Pattern object>

Categories.

gensim.corpora.wikicorpus.RE_P15 = <_sre.SRE_Pattern object>

Remove File and Image templates.

gensim.corpora.wikicorpus.RE_P16 = <_sre.SRE_Pattern object>

Capture interlinks text and article linked

gensim.corpora.wikicorpus.RE_P2 = <_sre.SRE_Pattern object>

Links to languages.

gensim.corpora.wikicorpus.RE_P3 = <_sre.SRE_Pattern object>

Template.

gensim.corpora.wikicorpus.RE_P4 = <_sre.SRE_Pattern object>

Template.

gensim.corpora.wikicorpus.RE_P5 = <_sre.SRE_Pattern object>

Remove URL, keep description.

gensim.corpora.wikicorpus.RE_P6 = <_sre.SRE_Pattern object>

Simplify links, keep description.

gensim.corpora.wikicorpus.RE_P7 = <_sre.SRE_Pattern object>

Keep description of images.

gensim.corpora.wikicorpus.RE_P8 = <_sre.SRE_Pattern object>

Keep description of files.

gensim.corpora.wikicorpus.RE_P9 = <_sre.SRE_Pattern object>

External links.

class gensim.corpora.wikicorpus.WikiCorpus(fname, processes=None, lemmatize=True, dictionary=None, filter_namespaces=('0', ), tokenizer_func=<function tokenize>, article_min_tokens=50, token_min_len=2, token_max_len=15, lower=True)

Bases: gensim.corpora.textcorpus.TextCorpus

Treat a wikipedia articles dump as a read-only corpus.

Supported dump formats:

  • <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2
  • <LANG>wiki-latest-pages-articles.xml.bz2

The documents are extracted on-the-fly, so that the whole (massive) dump can stay compressed on disk.

Notes

Dumps for English wikipedia can be founded here.

metadata

bool – Whether to write articles titles to serialized corpus.

Warning

“Multistream” archives are not supported in Python 2 due to limitations in the core bz2 library.

Examples

>>> from gensim.corpora import WikiCorpus, MmCorpus
>>>
>>> wiki = WikiCorpus('enwiki-20100622-pages-articles.xml.bz2') # create word->word_id mapping, takes almost 8h
>>> MmCorpus.serialize('wiki_en_vocab200k.mm', wiki) # another 8h, creates a file in MatrixMarket format and mapping

Initialize the corpus.

Unless a dictionary is provided, this scans the corpus once, to determine its vocabulary.

Parameters:
  • fname (str) – Path to file with wikipedia dump.
  • processes (int, optional) – Number of processes to run, defaults to number of cpu - 1.
  • lemmatize (bool) – Whether to use lemmatization instead of simple regexp tokenization. Defaults to True if pattern package installed.
  • dictionary (Dictionary, optional) – Dictionary, if not provided, this scans the corpus once, to determine its vocabulary (this needs really long time).
  • filter_namespaces (tuple of str) – Namespaces to consider.
  • tokenizer_func (function, optional) – Function that will be used for tokenization. By default, use tokenize(). Need to support interface: tokenizer_func(text: str, token_min_len: int, token_max_len: int, lower: bool) -> list of str.
  • article_min_tokens (int, optional) – Minimum tokens in article. Article will be ignored if number of tokens is less.
  • token_min_len (int, optional) – Minimal token length.
  • token_max_len (int, optional) – Maximal token length.
  • lower (bool, optional) – If True - convert all text to lower case.
get_texts()

Iterate over the dump, yielding list of tokens for each article.

Notes

This iterates over the texts. If you want vectors, just use the standard corpus interface instead of this method:

>>> for vec in wiki_corpus:
>>>     print(vec)
Yields:
  • list of str – If metadata is False, yield only list of token extracted from the article.
  • (list of str, (int, str)) – List of tokens (extracted from the article), page id and article title otherwise.
getstream()

Generate documents from the underlying plain text collection (of one or more files).

Yields:str – Document read from plain-text file.

Notes

After generator end - initialize self.length attribute.

init_dictionary(dictionary)

Initialize/update dictionary.

Parameters:dictionary (Dictionary, optional) – If a dictionary is provided, it will not be updated with the given corpus on initialization. If None - new dictionary will be built for the given corpus.

Notes

If self.input is None - make nothing.

classmethod load(fname, mmap=None)

Load a previously saved object (using save()) from file.

Parameters:
  • fname (str) – Path to file that contains needed object.
  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Returns:Object loaded from fname.
Return type:object
Raises:IOError – When methods are called on instance (should be called from class).
preprocess_text(text)

Apply self.character_filters, self.tokenizer, self.token_filters to a single text document.

Parameters:text (str) – Document read from plain-text file.
Returns:List of tokens extracted from text.
Return type:list of str
sample_texts(n, seed=None, length=None)

Generate n random documents from the corpus without replacement.

Parameters:
  • n (int) – Number of documents we want to sample.
  • seed (int, optional) – If specified, use it as a seed for local random generator.
  • length (int, optional) – Value will used as corpus length (because calculate length of corpus can be costly operation). If not specified - will call __length__.
Raises:

ValueError – If n less than zero or greater than corpus size.

Notes

Given the number of remaining documents in a corpus, we need to choose n elements. The probability for the current element to be chosen is n / remaining. If we choose it, we just decrease the n and move to the next element.

Yields:list of str – Sampled document as sequence of tokens.
save(*args, **kwargs)

Saves corpus in-memory state.

Warning

This save only “state” of corpus class (not corpus-data at all), for saving data please use save_corpus() instead`.

Parameters:
  • *args – Variable length argument list.
  • **kwargs – Arbitrary keyword arguments.
static save_corpus(corpus, id2word=None, metadata=False)

Saves given corpus to disk, should be overridden in inheritor class.

Some formats support saving the dictionary (feature_id -> word mapping), which can be provided by the optional id2word parameter.

Notes

Some corpus also support an index of where each document begins, so that the documents on disk can be accessed in O(1) time (see the gensim.corpora.indexedcorpus.IndexedCorpus base class). In this case, save_corpus() is automatically called internally by serialize(), which does save_corpus() plus saves the index at the same time.

Calling serialize() is preferred to calling :meth:`~gensim.interfaces.CorpusABC.save_corpus().

Parameters:
  • fname (str) – Path to output file.
  • corpus (iterable of list of (int, number)) – Corpus in BoW format.
  • id2word (Dictionary, optional) – Dictionary of corpus.
  • metadata (bool, optional) – If True, will write some meta-information to fname too.
step_through_preprocess(text)

Apply preprocessor one by one and generate result.

Warning

This is useful for debugging issues with the corpus preprocessing pipeline.

Parameters:text (str) – Document text read from plain-text file.
Yields:(callable, object) – Pre-processor, output from pre-processor (based on text)
gensim.corpora.wikicorpus.extract_pages(f, filter_namespaces=False)

Extract pages from a MediaWiki database dump.

Parameters:
  • f (file) – File-like object.
  • filter_namespaces (list of str or bool) – Namespaces that will be extracted.
Yields:

tuple of (str or None, str, str) – Title, text and page id.

gensim.corpora.wikicorpus.filter_wiki(raw, promote_remaining=True, simplify_links=True)

Filter out wiki markup from raw, leaving only text.

Parameters:
  • raw (str) – Unicode or utf-8 encoded string.
  • promote_remaining (bool) – Whether uncaught markup should be promoted to plain text.
  • simplify_links (bool) – Whether links should be simplified keeping only their description text.
Returns:

raw without markup.

Return type:

str

Find all interlinks to other articles in the dump.

Parameters:raw (str) – Unicode or utf-8 encoded string.
Returns:Mapping from the linked article to the actual text found.
Return type:dict
gensim.corpora.wikicorpus.get_namespace(tag)

Get the namespace of tag.

Parameters:tag (str) – Namespace or tag.
Returns:Matched namespace or tag.
Return type:str
gensim.corpora.wikicorpus.init_to_ignore_interrupt()

Enables interruption ignoring.

Warning

Should only be used when master is prepared to handle termination of child processes.

gensim.corpora.wikicorpus.process_article(args, tokenizer_func=<function tokenize>, token_min_len=2, token_max_len=15, lower=True)

Parse a wikipedia article, extract all tokens.

Notes

Set tokenizer_func (defaults is tokenize()) parameter for languages like japanese or thai to perform better tokenization. The tokenizer_func needs to take 4 parameters: (text: str, token_min_len: int, token_max_len: int, lower: bool).

Parameters:
  • args ((str, bool, str, int)) – Article text, lemmatize flag (if True, lemmatize() will be used), article title, page identificator.
  • tokenizer_func (function) – Function for tokenization (defaults is tokenize()). Needs to have interface: tokenizer_func(text: str, token_min_len: int, token_max_len: int, lower: bool) -> list of str.
  • token_min_len (int) – Minimal token length.
  • token_max_len (int) – Maximal token length.
  • lower (bool) – If True - convert article text to lower case.
Returns:

List of tokens from article, title and page id.

Return type:

(list of str, str, int)

gensim.corpora.wikicorpus.remove_file(s)

Remove the ‘File:’ and ‘Image:’ markup, keeping the file caption.

Parameters:s (str) – String containing ‘File:’ and ‘Image:’ markup.
Returns:Сopy of s with all the ‘File:’ and ‘Image:’ markup replaced by their corresponding captions. [3]
Return type:str

References

[3]http://www.mediawiki.org/wiki/Help:Images
gensim.corpora.wikicorpus.remove_markup(text, promote_remaining=True, simplify_links=True)

Filter out wiki markup from text, leaving only text.

Parameters:
  • text (str) – String containing markup.
  • promote_remaining (bool) – Whether uncaught markup should be promoted to plain text.
  • simplify_links (bool) – Whether links should be simplified keeping only their description text.
Returns:

text without markup.

Return type:

str

gensim.corpora.wikicorpus.remove_template(s)

Remove template wikimedia markup.

Parameters:s (str) – String containing markup template.
Returns:Сopy of s with all the wikimedia markup template removed. See [4] for wikimedia templates details.
Return type:str

Notes

Since template can be nested, it is difficult remove them using regular expressions.

References

[4]http://meta.wikimedia.org/wiki/Help:Template
gensim.corpora.wikicorpus.tokenize(content, token_min_len=2, token_max_len=15, lower=True)

Tokenize a piece of text from wikipedia.

Set token_min_len, token_max_len as character length (not bytes!) thresholds for individual tokens.

Parameters:
  • content (str) – String without markup (see filter_wiki()).
  • token_min_len (int) – Minimal token length.
  • token_max_len (int) – Maximal token length.
  • lower (bool) – If True - convert content to lower case.
Returns:

List of tokens from content.

Return type:

list of str