similarities.fastss
– Fast Levenshtein edit distance¶
Fast approximate string similarity search using the FastSS algorithm.
- class gensim.similarities.fastss.FastSS(words=None, max_dist=2)¶
Fast implementation of FastSS (Fast Similarity Search): https://fastss.csg.uzh.ch/
FastSS enables fuzzy search of a dynamic query (a word, string) against a static dictionary (a set of words, strings). The “fuziness” is configurable by means of a maximum edit distance (Levenshtein) between the query string and any of the dictionary words.
Create a FastSS index. The index will contain encoded variants of all indexed words, allowing fast “fuzzy string similarity” queries.
max_dist: maximum allowed edit distance of an indexed word to a query word. Keep max_dist<=3 for sane performance.
- add(word)¶
Add a string to the index.
- query(word, max_dist=None)¶
Find all words from the index that are within max_dist of word.
- gensim.similarities.fastss.bytes2set()¶
Deserialize bytes into a set of unicode strings.
>>> bytes2set(b'a
- gensim.similarities.fastss.editdist()¶
Return the Levenshtein distance between two strings.
Use max_dist to control the maximum distance you care about. If the actual distance is larger than max_dist, editdist will return early, with the value max_dist+1. This is a performance optimization – for example if anything above distance 2 is uninteresting to your application, call editdist with max_dist=2 and ignore any return value greater than 2.
Leave max_dist=None (default) to always return the full Levenshtein distance (slower).
- gensim.similarities.fastss.indexkeys()¶
Return the set of index keys (“variants”) of a word.
>>> indexkeys('aiu', 1) {'aiu', 'iu', 'au', 'ai'}
- gensim.similarities.fastss.set2bytes()¶
Serialize a set of unicode strings into bytes.
>>> set2byte({u'a', u'b', u'c'}) b'a