similarities.fastss
– Fast Levenshtein edit distance¶
Fast approximate string similarity search using the FastSS algorithm.
-
class
gensim.similarities.fastss.
FastSS
¶ Fast implementation of FastSS (Fast Similarity Search): https://fastss.csg.uzh.ch/
FastSS enables fuzzy search of a dynamic query (a word, string) against a static dictionary (a set of words, strings). The “fuziness” is configurable by means of a maximum edit distance (Levenshtein) between the query string and any of the dictionary words.
Create a FastSS index. The index will contain encoded variants of all indexed words, allowing fast “fuzzy string similarity” queries.
max_dist: maximum allowed edit distance of an indexed word to a query word. Keep max_dist<=3 for sane performance.
-
add
¶ Add a string to the index.
-
query
¶ Find all words from the index that are within max_dist of word.
-
-
gensim.similarities.fastss.
bytes2set
()¶ Deserialize bytes into a set of unicode strings.
>>> bytes2set(b'a
-
gensim.similarities.fastss.
editdist
()¶ Return the Levenshtein distance between two strings.
Use max_dist to control the maximum distance you care about. If the actual distance is larger than max_dist, editdist will return early, with the value max_dist+1. This is a performance optimization – for example if anything above distance 2 is uninteresting to your application, call editdist with max_dist=2 and ignore any return value greater than 2.
Leave max_dist=None (default) to always return the full Levenshtein distance (slower).
-
gensim.similarities.fastss.
indexkeys
()¶ Return the set of index keys (“variants”) of a word.
>>> indexkeys('aiu', 1) {'aiu', 'iu', 'au', 'ai'}
-
gensim.similarities.fastss.
set2bytes
()¶ Serialize a set of unicode strings into bytes.
>>> set2byte({u'a', u'b', u'c'}) b'a