parsing.preprocessing – Functions to preprocess raw text

This module contains methods for parsing and preprocessing strings.

Examples

>>> from gensim.parsing.preprocessing import remove_stopwords, preprocess_string
>>> remove_stopwords("Better late than never, but better never late.")
u'Better late never, better late.'
>>>
>>> preprocess_string("<i>Hel 9lo</i> <b>Wo9 rld</b>! Th3     weather_is really g00d today, isn't it?")
[u'hel', u'rld', u'weather', u'todai', u'isn']
gensim.parsing.preprocessing.lower_to_unicode(text, encoding='utf8', errors='strict')

Lowercase text and convert to unicode, using gensim.utils.any2unicode().

Parameters
  • text (str) – Input text.

  • encoding (str, optional) – Encoding that will be used for conversion.

  • errors (str, optional) – Error handling behaviour, used as parameter for unicode function (python2 only).

Returns

Unicode version of text.

Return type

str

See also

gensim.utils.any2unicode()

Convert any string to unicode-string.

gensim.parsing.preprocessing.preprocess_documents(docs)

Apply DEFAULT_FILTERS to the documents strings.

Parameters

docs (list of str) –

Returns

Processed documents split by whitespace.

Return type

list of list of str

Examples

>>> from gensim.parsing.preprocessing import preprocess_documents
>>> preprocess_documents(["<i>Hel 9lo</i> <b>Wo9 rld</b>!", "Th3     weather_is really g00d today, isn't it?"])
[[u'hel', u'rld'], [u'weather', u'todai', u'isn']]
gensim.parsing.preprocessing.preprocess_string(s, filters=[<function <lambda>>, <function strip_tags>, <function strip_punctuation>, <function strip_multiple_whitespaces>, <function strip_numeric>, <function remove_stopwords>, <function strip_short>, <function stem_text>])

Apply list of chosen filters to s.

Default list of filters:

Parameters
  • s (str) –

  • filters (list of functions, optional) –

Returns

Processed strings (cleaned).

Return type

list of str

Examples

>>> from gensim.parsing.preprocessing import preprocess_string
>>> preprocess_string("<i>Hel 9lo</i> <b>Wo9 rld</b>! Th3     weather_is really g00d today, isn't it?")
[u'hel', u'rld', u'weather', u'todai', u'isn']
>>>
>>> s = "<i>Hel 9lo</i> <b>Wo9 rld</b>! Th3     weather_is really g00d today, isn't it?"
>>> CUSTOM_FILTERS = [lambda x: x.lower(), strip_tags, strip_punctuation]
>>> preprocess_string(s, CUSTOM_FILTERS)
[u'hel', u'9lo', u'wo9', u'rld', u'th3', u'weather', u'is', u'really', u'g00d', u'today', u'isn', u't', u'it']
gensim.parsing.preprocessing.read_file(path)
gensim.parsing.preprocessing.read_files(pattern)
gensim.parsing.preprocessing.remove_short_tokens(tokens, minsize=3)

Remove tokens shorter than minsize chars.

Parameters
  • tokens (iterable of str) – Sequence of tokens.

  • minsize (int, optimal) – Minimal length of token (include).

Returns

List of tokens without short tokens.

Return type

list of str

gensim.parsing.preprocessing.remove_stopword_tokens(tokens, stopwords=None)

Remove stopword tokens using list stopwords.

Parameters
  • tokens (iterable of str) – Sequence of tokens.

  • stopwords (iterable of str, optional) – Sequence of stopwords If None - using STOPWORDS

Returns

List of tokens without stopwords.

Return type

list of str

gensim.parsing.preprocessing.remove_stopwords(s, stopwords=None)

Remove STOPWORDS from s.

Parameters
  • s (str) –

  • stopwords (iterable of str, optional) – Sequence of stopwords If None - using STOPWORDS

Returns

Unicode string without stopwords.

Return type

str

Examples

>>> from gensim.parsing.preprocessing import remove_stopwords
>>> remove_stopwords("Better late than never, but better never late.")
u'Better late never, better late.'
gensim.parsing.preprocessing.split_alphanum(s)

Add spaces between digits & letters in s using RE_AL_NUM.

Parameters

s (str) –

Returns

Unicode string with spaces between digits & letters.

Return type

str

Examples

>>> from gensim.parsing.preprocessing import split_alphanum
>>> split_alphanum("24.0hours7 days365 a1b2c3")
u'24.0 hours 7 days 365 a 1 b 2 c 3'
gensim.parsing.preprocessing.split_on_space(s)

Split line by spaces, used in gensim.corpora.lowcorpus.LowCorpus.

Parameters

s (str) – Some line.

Returns

List of tokens from s.

Return type

list of str

gensim.parsing.preprocessing.stem(text)

Transform s into lowercase and stem it.

Parameters

text (str) –

Returns

Unicode lowercased and porter-stemmed version of string text.

Return type

str

Examples

>>> from gensim.parsing.preprocessing import stem_text
>>> stem_text("While it is quite useful to be able to search a large collection of documents almost instantly.")
u'while it is quit us to be abl to search a larg collect of document almost instantly.'
gensim.parsing.preprocessing.stem_text(text)

Transform s into lowercase and stem it.

Parameters

text (str) –

Returns

Unicode lowercased and porter-stemmed version of string text.

Return type

str

Examples

>>> from gensim.parsing.preprocessing import stem_text
>>> stem_text("While it is quite useful to be able to search a large collection of documents almost instantly.")
u'while it is quit us to be abl to search a larg collect of document almost instantly.'
gensim.parsing.preprocessing.strip_multiple_whitespaces(s)

Remove repeating whitespace characters (spaces, tabs, line breaks) from s and turns tabs & line breaks into spaces using RE_WHITESPACE.

Parameters

s (str) –

Returns

Unicode string without repeating in a row whitespace characters.

Return type

str

Examples

>>> from gensim.parsing.preprocessing import strip_multiple_whitespaces
>>> strip_multiple_whitespaces("salut" + '\r' + " les" + '\n' + "         loulous!")
u'salut les loulous!'
gensim.parsing.preprocessing.strip_non_alphanum(s)

Remove non-alphabetic characters from s using RE_NONALPHA.

Parameters

s (str) –

Returns

Unicode string with alphabetic characters only.

Return type

str

Notes

Word characters - alphanumeric & underscore.

Examples

>>> from gensim.parsing.preprocessing import strip_non_alphanum
>>> strip_non_alphanum("if-you#can%read$this&then@this#method^works")
u'if you can read this then this method works'
gensim.parsing.preprocessing.strip_numeric(s)

Remove digits from s using RE_NUMERIC.

Parameters

s (str) –

Returns

Unicode string without digits.

Return type

str

Examples

>>> from gensim.parsing.preprocessing import strip_numeric
>>> strip_numeric("0text24gensim365test")
u'textgensimtest'
gensim.parsing.preprocessing.strip_punctuation(s)

Replace ASCII punctuation characters with spaces in s using RE_PUNCT.

Parameters

s (str) –

Returns

Unicode string without punctuation characters.

Return type

str

Examples

>>> from gensim.parsing.preprocessing import strip_punctuation
>>> strip_punctuation("A semicolon is a stronger break than a comma, but not as much as a full stop!")
u'A semicolon is a stronger break than a comma  but not as much as a full stop '
gensim.parsing.preprocessing.strip_short(s, minsize=3)

Remove words with length lesser than minsize from s.

Parameters
  • s (str) –

  • minsize (int, optional) –

Returns

Unicode string without short words.

Return type

str

Examples

>>> from gensim.parsing.preprocessing import strip_short
>>> strip_short("salut les amis du 59")
u'salut les amis'
>>>
>>> strip_short("one two three four five six seven eight nine ten", minsize=5)
u'three seven eight'
gensim.parsing.preprocessing.strip_tags(s)

Remove tags from s using RE_TAGS.

Parameters

s (str) –

Returns

Unicode string without tags.

Return type

str

Examples

>>> from gensim.parsing.preprocessing import strip_tags
>>> strip_tags("<i>Hello</i> <b>World</b>!")
u'Hello World!'