for a little project i need to compare a string against a large set of strings. it should not only match the exact strings, it also should match strings which are similar. to find out if two strings are similar there exists an algorithm called “levenshtein”. it takes two strings as an argument and returns the distance between these strings. if the distance is zero the strings are equal. the bigger the distance the more the strings differ.
to slow
i use it to compare strings which are about 200 chars long and there are at the moment 40′000 strings. to compare one string to the existing set i need to call the levenshtein algorithm 40′000 times. because the algorithm itself is not super fast it takes a long time to do the comparison. i took an implementation from
Levenshtein: Java Implementation to test and i might get this implementation faster, if i write an optimized version, but i doubt that i get out much. the problem is the 40′000 calls to it.
you don’t need 40′000 levenshtein calls
one attempt to make it faster is to rule out some strings by comparing lengths first. if you say your maximum distance for two strings to match is 10 then you can discard strings which difference in character count is bigger than 10. it’s already much faster, for the 40′000 and for my usage almost fast enough but the faster the better. in this case i still have to check one strings length against 40′000 other strings length. this can be speed up by putting the sets of the strings with the same length into an array where the index is the length of the sets strings. so i don’t have to compare it against all strings. if l is the length of the string i want to test then i need to test it only against the strings in the sets for the indexes with a length from l-10 to l+10.
probably i can rule out with this technique even more strings, by example count the number of words in the strings instead of the number of characters or the number of vowels. these approaches could be combined together and it probably will speed it up another bit. but due to statistics the result would probably be about the same like in the case where i use the total length of the string.
how much faster is it?
i measured the time by feeling and don’t know how fast it is exactly. with the string length approach i needed to test only about 85% of the strings against each other. this is faster but still 6′000 hits. if the set of strings to test against gets bigger i’ll soon have the same problem. if i include the other two approaches or similar one i might get it maybe to 90% or even 95% but this still wont help if the set is getting ten or a hundred times bigger.
a better approach?
a better approach would probably be to do the test against all strings in one go. for this the levenshtein algorithm has to be rewritten and i don’t know if it’s even possible. because i’m not a mathematician i might have problem with this approach but i’ll try it and post it here if it works.