8

I have a set of $n = $ 100 million strings of length $l = 20$, and for each string in the set, I would like to find all the other strings in the set with Levenshtein distance $\le d = 4$ from that string. The Levenshtein distance (also called the edit distance) between two strings is the number of insertions, deletions and/or replacements required to convert one string into the another.

This should be possible in $O((d + 1)^{2d + 1} \cdot l \cdot n)$ time with a Levenshtein transducer, which takes a single query string at a time and find all the matches in a set with Levenshtein distance $\le d$. However, using the implementation at that link, it appears to take more like $O(n \log n)$ time rather than $O(n)$, and uses more than 200 GB of memory.

Is there an alternate $O(n)$ approach that might be faster in practice? The Levenshtein transducer is more general than it needs to be for this application, since it finds matches for each string independently and doesn't exploit the fact that you're comparing every string against every other string.

D.W.
  • 159,275
  • 20
  • 227
  • 470
1''
  • 183
  • 1
  • 6
  • Your current algorithm is $\Omega(n^2)$, which seems infeasible for your parameters. So I don't really understand your statement that the algorithm "take more like $O(n \log n)$ time". – Yuval Filmus Feb 18 '16 at 19:42
  • 1
    I think this or somthing very similar has been asked before. Definitely related: this and this. Community votes, please: is this a duplicate? – Raphael Feb 18 '16 at 19:49
  • Your output has worst-case size in $\Omega(n^2)$ so no algorithm can be faster in the worst case. – Raphael Feb 18 '16 at 19:50
  • @Raphael Thanks, the first one is almost a duplicate except that it's with Damerau-Levenshtein distance. They say that the naive solution with dynamic programming is O(n^2). I'd argue that the worst-case output size is O(n) since each string has a finite number of possible neighbours (depending on d and l). Also, the Levenshtein transducer's runtime is supposed to be O(1) in the size of the set, per query string. – 1'' Feb 18 '16 at 20:34
  • To quote the link in the question, "Forget about performing a linear scan over your dictionary to find all terms that are sufficiently-close to the user's query, using a quadratic implementation of the Levenshtein distance or Damerau-Levenshtein distance, these babies find all the terms from your dictionary in linear time on the length of the query term (not on the size of the dictionary, on the length of the query term)." – 1'' Feb 18 '16 at 20:35
  • 1
    Relevant: the BK-Tree, a datastructure that claims to solve your problem in $O(n~log~n)$ time. – Rainer P. Feb 18 '16 at 23:28
  • @RainerP. Thanks for the reference. The Levenshtein transducer appears to be strictly better since it takes O(n) time? – 1'' Feb 18 '16 at 23:41
  • Correct me if I am wrong (I don´t know a thing about levenstein transducers), but it seems to me that you still need to feed any string to any transducer, what makes it $O(n^2)$. The BK-Tree seems to do a word vs. dictionary lookup in $O(log~n)$ and thus a dictionary vs. dictionary lookup in $O(n~log~n)$. The tree construction takes also $O(n~log~n)$. – Rainer P. Feb 18 '16 at 23:50
  • The transducer has a word vs dictionary lookup in O(1) and a dictionary vs dictionary lookup in O(n), if I understand it correctly. – 1'' Feb 19 '16 at 01:17
  • What's the size of the alphabet? In other words, for each symbol, how many possibilities are there? 2. I think your quoted complexity for the Levenshtein transducer is wrong: in that complicated expression, I suspect the $n$ should be a $n^2$.
  • – D.W. Feb 19 '16 at 07:22
  • 1
    @D.W. It's a bioinformatics application, so there are 4 letters. I agree the n seems like it's wrong, but lemma 9.0.3 of the paper and the author of the Levenshtein transducer library both seem to say it's n, not n^2. – 1'' Feb 19 '16 at 09:13
  • I don't see how you're getting that from Lemma 9.0.3 -- I think you might want to to read it again. The lemma says the running time for a single query is $O(\max(nl,l))$. (Here $nl$ counts the total length of the "text", since you have a set of $n$ words each of length $l$.) For this we need to do $n$ queries, so the total running time will be $O(n^2l)$, based on that lemma. The claim by the author of the library is informal and should not be taken that seriously. – D.W. Feb 19 '16 at 17:00
  • I noticed some time ago that strings with small Levenshtein distance tend to be near each other at many of their entries in a suffix array or BWT. It's similar to the k-gram counting, but smaller. – KWillets Feb 19 '16 at 20:10
  • @D.W. I realize there's some ambiguity in 9.0.3 - does a "text of words of length h" mean that there are h words in the text, or that the words are h letters long? – 1'' Feb 19 '16 at 23:41