`similarities.fastss` – Fast Levenshtein edit distance¶

Fast approximate string similarity search using the FastSS algorithm.

class gensim.similarities.fastss.FastSS(words=None, max_dist=2)¶

Fast implementation of FastSS (Fast Similarity Search): https://fastss.csg.uzh.ch/

FastSS enables fuzzy search of a dynamic query (a word, string) against a static dictionary (a set of words, strings). The “fuziness” is configurable by means of a maximum edit distance (Levenshtein) between the query string and any of the dictionary words.

Create a FastSS index. The index will contain encoded variants of all indexed words, allowing fast “fuzzy string similarity” queries.

max_dist: maximum allowed edit distance of an indexed word to a query word. Keep max_dist<=3 for sane performance.

add(word)¶: Add a string to the index.

query(word, max_dist=None)¶: Find all words from the index that are within max_dist of word.

gensim.similarities.fastss.bytes2set()¶

Deserialize bytes into a set of unicode strings.

>>> bytes2set(b'a

gensim.similarities.fastss.editdist()¶

Return the Levenshtein distance between two strings.

Use max_dist to control the maximum distance you care about. If the actual distance is larger than max_dist, editdist will return early, with the value max_dist+1. This is a performance optimization – for example if anything above distance 2 is uninteresting to your application, call editdist with max_dist=2 and ignore any return value greater than 2.

Leave max_dist=None (default) to always return the full Levenshtein distance (slower).

gensim.similarities.fastss.indexkeys()¶

Return the set of index keys (“variants”) of a word.

>>> indexkeys('aiu', 1)
{'aiu', 'iu', 'au', 'ai'}

gensim.similarities.fastss.set2bytes()¶

Serialize a set of unicode strings into bytes.

>>> set2byte({u'a', u'b', u'c'})
b'a

Please sponsor Gensim to help sustain this open source project!

similarities.fastss – Fast Levenshtein edit distance¶

`similarities.fastss` – Fast Levenshtein edit distance¶