similarities.fastss – Fast Levenshtein edit distance

Fast approximate string similarity search using the FastSS algorithm.

class gensim.similarities.fastss.FastSS

Fast implementation of FastSS (Fast Similarity Search): https://fastss.csg.uzh.ch/

FastSS enables fuzzy search of a dynamic query (a word, string) against a static dictionary (a set of words, strings). The “fuziness” is configurable by means of a maximum edit distance (Levenshtein) between the query string and any of the dictionary words.

Create a FastSS index. The index will contain encoded variants of all indexed words, allowing fast “fuzzy string similarity” queries.

max_dist: maximum allowed edit distance of an indexed word to a query word. Keep max_dist<=3 for sane performance.

add

Add a string to the index.

query

Find all words from the index that are within max_dist of word.

gensim.similarities.fastss.bytes2set()

Deserialize bytes into a set of unicode strings.

>>> bytes2set(b'a
gensim.similarities.fastss.editdist()

Return the Levenshtein distance between two strings.

Use max_dist to control the maximum distance you care about. If the actual distance is larger than max_dist, editdist will return early, with the value max_dist+1. This is a performance optimization – for example if anything above distance 2 is uninteresting to your application, call editdist with max_dist=2 and ignore any return value greater than 2.

Leave max_dist=None (default) to always return the full Levenshtein distance (slower).

gensim.similarities.fastss.indexkeys()

Return the set of index keys (“variants”) of a word.

>>> indexkeys('aiu', 1)
{'aiu', 'iu', 'au', 'ai'}
gensim.similarities.fastss.set2bytes()

Serialize a set of unicode strings into bytes.

>>> set2byte({u'a', u'b', u'c'})
b'a