parsing.preprocessing
– Functions to preprocess raw text¶
This module contains methods for parsing and preprocessing strings.
Examples
>>> from gensim.parsing.preprocessing import remove_stopwords, preprocess_string
>>> remove_stopwords("Better late than never, but better never late.")
u'Better late never, better late.'
>>>
>>> preprocess_string("<i>Hel 9lo</i> <b>Wo9 rld</b>! Th3 weather_is really g00d today, isn't it?")
[u'hel', u'rld', u'weather', u'todai', u'isn']
- gensim.parsing.preprocessing.lower_to_unicode(text, encoding='utf8', errors='strict')¶
Lowercase text and convert to unicode, using
gensim.utils.any2unicode()
.- Parameters
text (str) – Input text.
encoding (str, optional) – Encoding that will be used for conversion.
errors (str, optional) – Error handling behaviour, used as parameter for unicode function (python2 only).
- Returns
Unicode version of text.
- Return type
str
See also
gensim.utils.any2unicode()
Convert any string to unicode-string.
- gensim.parsing.preprocessing.preprocess_documents(docs)¶
Apply
DEFAULT_FILTERS
to the documents strings.- Parameters
docs (list of str) –
- Returns
Processed documents split by whitespace.
- Return type
list of list of str
Examples
>>> from gensim.parsing.preprocessing import preprocess_documents >>> preprocess_documents(["<i>Hel 9lo</i> <b>Wo9 rld</b>!", "Th3 weather_is really g00d today, isn't it?"]) [[u'hel', u'rld'], [u'weather', u'todai', u'isn']]
- gensim.parsing.preprocessing.preprocess_string(s, filters=[<function <lambda>>, <function strip_tags>, <function strip_punctuation>, <function strip_multiple_whitespaces>, <function strip_numeric>, <function remove_stopwords>, <function strip_short>, <function stem_text>])¶
Apply list of chosen filters to s.
Default list of filters:
- Parameters
s (str) –
filters (list of functions, optional) –
- Returns
Processed strings (cleaned).
- Return type
list of str
Examples
>>> from gensim.parsing.preprocessing import preprocess_string >>> preprocess_string("<i>Hel 9lo</i> <b>Wo9 rld</b>! Th3 weather_is really g00d today, isn't it?") [u'hel', u'rld', u'weather', u'todai', u'isn'] >>> >>> s = "<i>Hel 9lo</i> <b>Wo9 rld</b>! Th3 weather_is really g00d today, isn't it?" >>> CUSTOM_FILTERS = [lambda x: x.lower(), strip_tags, strip_punctuation] >>> preprocess_string(s, CUSTOM_FILTERS) [u'hel', u'9lo', u'wo9', u'rld', u'th3', u'weather', u'is', u'really', u'g00d', u'today', u'isn', u't', u'it']
- gensim.parsing.preprocessing.read_file(path)¶
- gensim.parsing.preprocessing.read_files(pattern)¶
- gensim.parsing.preprocessing.remove_short_tokens(tokens, minsize=3)¶
Remove tokens shorter than minsize chars.
- Parameters
tokens (iterable of str) – Sequence of tokens.
minsize (int, optimal) – Minimal length of token (include).
- Returns
List of tokens without short tokens.
- Return type
list of str
- gensim.parsing.preprocessing.remove_stopword_tokens(tokens, stopwords=None)¶
Remove stopword tokens using list stopwords.
- Parameters
tokens (iterable of str) – Sequence of tokens.
stopwords (iterable of str, optional) – Sequence of stopwords If None - using
STOPWORDS
- Returns
List of tokens without stopwords.
- Return type
list of str
- gensim.parsing.preprocessing.remove_stopwords(s, stopwords=None)¶
Remove
STOPWORDS
from s.- Parameters
s (str) –
stopwords (iterable of str, optional) – Sequence of stopwords If None - using
STOPWORDS
- Returns
Unicode string without stopwords.
- Return type
str
Examples
>>> from gensim.parsing.preprocessing import remove_stopwords >>> remove_stopwords("Better late than never, but better never late.") u'Better late never, better late.'
- gensim.parsing.preprocessing.split_alphanum(s)¶
Add spaces between digits & letters in s using
RE_AL_NUM
.- Parameters
s (str) –
- Returns
Unicode string with spaces between digits & letters.
- Return type
str
Examples
>>> from gensim.parsing.preprocessing import split_alphanum >>> split_alphanum("24.0hours7 days365 a1b2c3") u'24.0 hours 7 days 365 a 1 b 2 c 3'
- gensim.parsing.preprocessing.split_on_space(s)¶
Split line by spaces, used in
gensim.corpora.lowcorpus.LowCorpus
.- Parameters
s (str) – Some line.
- Returns
List of tokens from s.
- Return type
list of str
- gensim.parsing.preprocessing.stem(text)¶
Transform s into lowercase and stem it.
- Parameters
text (str) –
- Returns
Unicode lowercased and porter-stemmed version of string text.
- Return type
str
Examples
>>> from gensim.parsing.preprocessing import stem_text >>> stem_text("While it is quite useful to be able to search a large collection of documents almost instantly.") u'while it is quit us to be abl to search a larg collect of document almost instantly.'
- gensim.parsing.preprocessing.stem_text(text)¶
Transform s into lowercase and stem it.
- Parameters
text (str) –
- Returns
Unicode lowercased and porter-stemmed version of string text.
- Return type
str
Examples
>>> from gensim.parsing.preprocessing import stem_text >>> stem_text("While it is quite useful to be able to search a large collection of documents almost instantly.") u'while it is quit us to be abl to search a larg collect of document almost instantly.'
- gensim.parsing.preprocessing.strip_multiple_whitespaces(s)¶
Remove repeating whitespace characters (spaces, tabs, line breaks) from s and turns tabs & line breaks into spaces using
RE_WHITESPACE
.- Parameters
s (str) –
- Returns
Unicode string without repeating in a row whitespace characters.
- Return type
str
Examples
>>> from gensim.parsing.preprocessing import strip_multiple_whitespaces >>> strip_multiple_whitespaces("salut" + '\r' + " les" + '\n' + " loulous!") u'salut les loulous!'
- gensim.parsing.preprocessing.strip_non_alphanum(s)¶
Remove non-alphabetic characters from s using
RE_NONALPHA
.- Parameters
s (str) –
- Returns
Unicode string with alphabetic characters only.
- Return type
str
Notes
Word characters - alphanumeric & underscore.
Examples
>>> from gensim.parsing.preprocessing import strip_non_alphanum >>> strip_non_alphanum("if-you#can%read$this&then@this#method^works") u'if you can read this then this method works'
- gensim.parsing.preprocessing.strip_numeric(s)¶
Remove digits from s using
RE_NUMERIC
.- Parameters
s (str) –
- Returns
Unicode string without digits.
- Return type
str
Examples
>>> from gensim.parsing.preprocessing import strip_numeric >>> strip_numeric("0text24gensim365test") u'textgensimtest'
- gensim.parsing.preprocessing.strip_punctuation(s)¶
Replace ASCII punctuation characters with spaces in s using
RE_PUNCT
.- Parameters
s (str) –
- Returns
Unicode string without punctuation characters.
- Return type
str
Examples
>>> from gensim.parsing.preprocessing import strip_punctuation >>> strip_punctuation("A semicolon is a stronger break than a comma, but not as much as a full stop!") u'A semicolon is a stronger break than a comma but not as much as a full stop '
- gensim.parsing.preprocessing.strip_short(s, minsize=3)¶
Remove words with length lesser than minsize from s.
- Parameters
s (str) –
minsize (int, optional) –
- Returns
Unicode string without short words.
- Return type
str
Examples
>>> from gensim.parsing.preprocessing import strip_short >>> strip_short("salut les amis du 59") u'salut les amis' >>> >>> strip_short("one two three four five six seven eight nine ten", minsize=5) u'three seven eight'
- gensim.parsing.preprocessing.strip_tags(s)¶
Remove tags from s using
RE_TAGS
.- Parameters
s (str) –
- Returns
Unicode string without tags.
- Return type
str
Examples
>>> from gensim.parsing.preprocessing import strip_tags >>> strip_tags("<i>Hello</i> <b>World</b>!") u'Hello World!'