parsing.preprocessing
– Functions to preprocess raw text¶This module contains methods for parsing and preprocessing strings. Let’s consider the most noticeable:
remove_stopwords()
- remove all stopwords from string
preprocess_string()
- preprocess string (in default NLP meaning)
Examples
>>> from gensim.parsing.preprocessing import remove_stopwords
>>> remove_stopwords("Better late than never, but better never late.")
u'Better late never, better late.'
>>>
>>> preprocess_string("<i>Hel 9lo</i> <b>Wo9 rld</b>! Th3 weather_is really g00d today, isn't it?")
[u'hel', u'rld', u'weather', u'todai', u'isn']
STOPWORDS - Set of stopwords from Stone, Denis, Kwantes (2010).
RE_PUNCT - Regexp for search an punctuation.
RE_TAGS - Regexp for search an tags.
RE_NUMERIC - Regexp for search an numbers.
RE_NONALPHA - Regexp for search an non-alphabetic character.
RE_AL_NUM - Regexp for search a position between letters and digits.
RE_NUM_AL - Regexp for search a position between digits and letters .
RE_WHITESPACE - Regexp for search space characters.
DEFAULT_FILTERS - List of function for string preprocessing.
gensim.parsing.preprocessing.
preprocess_documents
(docs)¶Apply DEFAULT_FILTERS
to the documents strings.
docs (list of str) –
Processed documents split by whitespace.
list of list of str
Examples
>>> from gensim.parsing.preprocessing import preprocess_documents
>>> preprocess_documents(["<i>Hel 9lo</i> <b>Wo9 rld</b>!", "Th3 weather_is really g00d today, isn't it?"])
[[u'hel', u'rld'], [u'weather', u'todai', u'isn']]
gensim.parsing.preprocessing.
preprocess_string
(s, filters=[<function <lambda>>, <function strip_tags>, <function strip_punctuation>, <function strip_multiple_whitespaces>, <function strip_numeric>, <function remove_stopwords>, <function strip_short>, <function stem_text>])¶Apply list of chosen filters to s.
Default list of filters:
s (str) –
filters (list of functions, optional) –
Processed strings (cleaned).
list of str
Examples
>>> from gensim.parsing.preprocessing import preprocess_string
>>> preprocess_string("<i>Hel 9lo</i> <b>Wo9 rld</b>! Th3 weather_is really g00d today, isn't it?")
[u'hel', u'rld', u'weather', u'todai', u'isn']
>>>
>>> s = "<i>Hel 9lo</i> <b>Wo9 rld</b>! Th3 weather_is really g00d today, isn't it?"
>>> CUSTOM_FILTERS = [lambda x: x.lower(), strip_tags, strip_punctuation]
>>> preprocess_string(s, CUSTOM_FILTERS)
[u'hel', u'9lo', u'wo9', u'rld', u'th3', u'weather', u'is', u'really', u'g00d', u'today', u'isn', u't', u'it']
gensim.parsing.preprocessing.
read_file
(path)¶gensim.parsing.preprocessing.
read_files
(pattern)¶gensim.parsing.preprocessing.
remove_stopwords
(s)¶Remove STOPWORDS
from s.
s (str) –
Unicode string without STOPWORDS
.
str
Examples
>>> from gensim.parsing.preprocessing import remove_stopwords
>>> remove_stopwords("Better late than never, but better never late.")
u'Better late never, better late.'
gensim.parsing.preprocessing.
split_alphanum
(s)¶Add spaces between digits & letters in s using RE_AL_NUM
.
s (str) –
Unicode string with spaces between digits & letters.
str
Examples
>>> from gensim.parsing.preprocessing import split_alphanum
>>> split_alphanum("24.0hours7 days365 a1b2c3")
u'24.0 hours 7 days 365 a 1 b 2 c 3'
gensim.parsing.preprocessing.
stem
(text)¶Transform s into lowercase and stem it.
text (str) –
Unicode lowercased and porter-stemmed version of string text.
str
Examples
>>> from gensim.parsing.preprocessing import stem_text
>>> stem_text("While it is quite useful to be able to search a large collection of documents almost instantly.")
u'while it is quit us to be abl to search a larg collect of document almost instantly.'
gensim.parsing.preprocessing.
stem_text
(text)¶Transform s into lowercase and stem it.
text (str) –
Unicode lowercased and porter-stemmed version of string text.
str
Examples
>>> from gensim.parsing.preprocessing import stem_text
>>> stem_text("While it is quite useful to be able to search a large collection of documents almost instantly.")
u'while it is quit us to be abl to search a larg collect of document almost instantly.'
gensim.parsing.preprocessing.
strip_multiple_whitespaces
(s)¶Remove repeating whitespace characters (spaces, tabs, line breaks) from s
and turns tabs & line breaks into spaces using RE_WHITESPACE
.
s (str) –
Unicode string without repeating in a row whitespace characters.
str
Examples
>>> from gensim.parsing.preprocessing import strip_multiple_whitespaces
>>> strip_multiple_whitespaces("salut" + '\r' + " les" + '\n' + " loulous!")
u'salut les loulous!'
gensim.parsing.preprocessing.
strip_non_alphanum
(s)¶Remove non-alphabetic characters from s using RE_NONALPHA
.
s (str) –
Unicode string with alphabetic characters only.
str
Notes
Word characters - alphanumeric & underscore.
Examples
>>> from gensim.parsing.preprocessing import strip_non_alphanum
>>> strip_non_alphanum("if-you#can%read$this&then@this#method^works")
u'if you can read this then this method works'
gensim.parsing.preprocessing.
strip_numeric
(s)¶Remove digits from s using RE_NUMERIC
.
s (str) –
Unicode string without digits.
str
Examples
>>> from gensim.parsing.preprocessing import strip_numeric
>>> strip_numeric("0text24gensim365test")
u'textgensimtest'
gensim.parsing.preprocessing.
strip_punctuation
(s)¶Replace punctuation characters with spaces in s using RE_PUNCT
.
s (str) –
Unicode string without punctuation characters.
str
Examples
>>> from gensim.parsing.preprocessing import strip_punctuation
>>> strip_punctuation("A semicolon is a stronger break than a comma, but not as much as a full stop!")
u'A semicolon is a stronger break than a comma but not as much as a full stop '
gensim.parsing.preprocessing.
strip_punctuation2
(s)¶Replace punctuation characters with spaces in s using RE_PUNCT
.
s (str) –
Unicode string without punctuation characters.
str
Examples
>>> from gensim.parsing.preprocessing import strip_punctuation
>>> strip_punctuation("A semicolon is a stronger break than a comma, but not as much as a full stop!")
u'A semicolon is a stronger break than a comma but not as much as a full stop '
gensim.parsing.preprocessing.
strip_short
(s, minsize=3)¶Remove words with length lesser than minsize from s.
s (str) –
minsize (int, optional) –
Unicode string without short words.
str
Examples
>>> from gensim.parsing.preprocessing import strip_short
>>> strip_short("salut les amis du 59")
u'salut les amis'
>>>
>>> strip_short("one two three four five six seven eight nine ten", minsize=5)
u'three seven eight'
Remove tags from s using RE_TAGS
.
s (str) –
Unicode string without tags.
str
Examples
>>> from gensim.parsing.preprocessing import strip_tags
>>> strip_tags("<i>Hello</i> <b>World</b>!")
u'Hello World!'