java - Improving performance of fuzzy string matching against a dictionary -

- January 15, 2015

so i'm working using secondstring fuzzy string matching, have large dictionary compare (with each entry in dictionary has associated non-unique identifier). using hashmap store dictionary.

when want fuzzy string matching, first check see if string in hashmap , iterate through of other potential keys, calculating string similarity , storing k,v pair/s highest similarity. depending on dictionary using can take long time ( 12330 - 1800035 entries ). there way speed or make faster? writing memoization function/table way of speeding up, can else think of better way improve speed of this? maybe different structure or else i'm missing.

many in advance,

nathan

what looking bktree (bk-tree) combined levenshtein distance algorithm. lookup performance in bktree depends on how "fuzzy" search is. fuzzy defined number of distance (edits) between search word , matches.

here blog on subject: http://blog.notdot.net/2007/4/damn-cool-algorithms-part-1-bk-trees

some notes on performance: http://www.kafsemo.org/2010/08/03_bk-tree-performance-notes.html

notes on http://en.wikipedia.org/wiki/levenshtein_distance algorithm.

also, here bk-tree written in java. should give idea of interface: http://code.google.com/p/java-bk-tree/