Find all pairs of strings in a set with Levenshtein distance < d

Question

I have a set of $n = $ 100 million strings of length $l = 20$, and for each string in the set, I would like to find all the other strings in the set with Levenshtein distance $\le d = 4$ from that string. The Levenshtein distance (also called the edit distance) between two strings is the number of insertions, deletions and/or replacements required to convert one string into the another.

This should be possible in $O((d + 1)^{2d + 1} \cdot l \cdot n)$ time with a Levenshtein transducer, which takes a single query string at a time and find all the matches in a set with Levenshtein distance $\le d$. However, using the implementation at that link, it appears to take more like $O(n \log n)$ time rather than $O(n)$, and uses more than 200 GB of memory.

Is there an alternate $O(n)$ approach that might be faster in practice? The Levenshtein transducer is more general than it needs to be for this application, since it finds matches for each string independently and doesn't exploit the fact that you're comparing every string against every other string.

Your current algorithm is $\Omega(n^2)$, which seems infeasible for your parameters. So I don't really understand your statement that the algorithm "take more like $O(n \log n)$ time". — Yuval Filmus, Feb 18 '16 at 19:42
I think this or somthing very similar has been asked before. Definitely related: this and this. Community votes, please: is this a duplicate? — Raphael, Feb 18 '16 at 19:49
Your output has worst-case size in $\Omega(n^2)$ so no algorithm can be faster in the worst case. — Raphael, Feb 18 '16 at 19:50
@Raphael Thanks, the first one is almost a duplicate except that it's with Damerau-Levenshtein distance. They say that the naive solution with dynamic programming is O(n^2). I'd argue that the worst-case output size is O(n) since each string has a finite number of possible neighbours (depending on d and l). Also, the Levenshtein transducer's runtime is supposed to be O(1) in the size of the set, per query string. — 1'', Feb 18 '16 at 20:34
To quote the link in the question, "Forget about performing a linear scan over your dictionary to find all terms that are sufficiently-close to the user's query, using a quadratic implementation of the Levenshtein distance or Damerau-Levenshtein distance, these babies find all the terms from your dictionary in linear time on the length of the query term (not on the size of the dictionary, on the length of the query term)." — 1'', Feb 18 '16 at 20:35
Relevant: the BK-Tree, a datastructure that claims to solve your problem in $O(n~log~n)$ time. — Rainer P., Feb 18 '16 at 23:28
@RainerP. Thanks for the reference. The Levenshtein transducer appears to be strictly better since it takes O(n) time? — 1'', Feb 18 '16 at 23:41
Correct me if I am wrong (I don´t know a thing about levenstein transducers), but it seems to me that you still need to feed any string to any transducer, what makes it $O(n^2)$. The BK-Tree seems to do a word vs. dictionary lookup in $O(log~n)$ and thus a dictionary vs. dictionary lookup in $O(n~log~n)$. The tree construction takes also $O(n~log~n)$. — Rainer P., Feb 18 '16 at 23:50
The transducer has a word vs dictionary lookup in O(1) and a dictionary vs dictionary lookup in O(n), if I understand it correctly. — 1'', Feb 19 '16 at 01:17
@D.W. It's a bioinformatics application, so there are 4 letters. I agree the n seems like it's wrong, but lemma 9.0.3 of the paper and the author of the Levenshtein transducer library both seem to say it's n, not n^2. — 1'', Feb 19 '16 at 09:13
I don't see how you're getting that from Lemma 9.0.3 -- I think you might want to to read it again. The lemma says the running time for a single query is $O(\max(nl,l))$. (Here $nl$ counts the total length of the "text", since you have a set of $n$ words each of length $l$.) For this we need to do $n$ queries, so the total running time will be $O(n^2l)$, based on that lemma. The claim by the author of the library is informal and should not be taken that seriously. — D.W., Feb 19 '16 at 17:00
I noticed some time ago that strings with small Levenshtein distance tend to be near each other at many of their entries in a suffix array or BWT. It's similar to the k-gram counting, but smaller. — KWillets, Feb 19 '16 at 20:10
@D.W. I realize there's some ambiguity in 9.0.3 - does a "text of words of length h" mean that there are h words in the text, or that the words are h letters long? — 1'', Feb 19 '16 at 23:41

D.W. · Accepted Answer · 2019-09-24T15:49:18.590

There's a "trick" you can use that might potentially speed up your algorithm a little: shingling. No guarantees that it'll necessarily help in your particular case, though.

Lemma. If the edit distance between two words $w,x$ is $\le 4$, and the two words both have length 20, then there exists some 4-gram that is common to both words (i.e., some $u$ of length 4 that is a substring of both $w$ and $x$).

This lemma then suggests the algorithmic trick. Build a table with $4^4$ buckets, one bucket for each possible $4$-gram. Given a word $w$, extract all of its 4-grams (there will be 17 of them) and insert it into those 17 buckets. Do this for each word.

Now, if there is a pair of words at edit distance $\le 4$, then there must be some bucket that contains both words. So, for each bucket, we search the set of words within that bucket for a pair of words at edit distance $\le 4$. Use whatever algorithm you like for that task. I would suggest trying BK-trees or metric trees.

Will this trick help? The only way to know for sure is to try it. But here's the intuition. We'll have 256 buckets. Each word will get mapped into 17 buckets. Thus, heuristically, we expect each bucket to have $n'=17n/256$ words in it. If we have an algorithm that can search for pairs of words at edit distance $\le 4$ among a set of words in $T(n)$ time, then applying that directly to the original set of words takes $T(n)$ time. In contrast, after applying this trick, we apply it once to each bucket. Now if $256 T(17n/256)$ is smaller than $T(n)$, this trick might yield improvements in running time. My prediction is that it'll yield improvements for $\Theta(n^2)$-runtime algorithms, but not for $O(n \lg n)$ or $O(n)$ time algorithms.

This trick can be combined with any algorithm for looking for a pair of words with edit distance $\le 4$ in a set of $n$ words.

The "trick" can be improved further. Define a "slate" of the word $w$ to be a pair $(i,u)$ where $i$ is a position ($1 \le i \le 17$) and $u$ is the 4-gram that occurs at position $i$ within $w$. From each word you get 17 slates. Now we get a refined lemma:

Lemma. If the edit distance between two words $w,x$ of length 20 is $\le 4$, then there exists a slate $(i,u)$ of $w$ and a slate $(j,v)$ of $x$ such that $u=v$ and $|i \le j| \le 2$.

(It might be possible to improve the latter condition to $|i \le j| \le 1$, but I haven't tried to check.)

We'll do the same thing as before, but identifying each bucket with a slate. We get $20 \times 256$ different buckets. Now, for each word $w$, extract its 17 slates and store the word in the corresponding 17 buckets. Let $B_{i,u}$ denote the set of words in the bucket for the slate $(i,u)$. Finally, for each slate $(i,u)$, we check whether there exists a word $w \in B_{i-2,u} \cup B_{i-1,u} \cup B_{i,u} \cup B_{i+1,u} \cup B_{i+2,u}$ that is within edit distance $\le 4$ of some word in $B_{i,u}$.

Heuristically, we expect each bucket to have $n'=17n/(20 \times 256) \approx 0.0033 n$ words in it. We expect the running time for each bucket to be something like $5 T(n')$, and we need to do that $20 \times 256$ times, for a total running time of $25600 T(0.0033n)$.

Is this better than the prior trick? The only way to know for sure is to try it. My prediction is that this'll be better for $\Theta(n^2)$-runtime algorithms, but worse for $O(n \lg n)$ or $O(n)$ time algorithms.

Finally, I don't think you have the running time of your scheme right. I believe the Levenshtein transducer automata approach will have running time $\Omega(n^2)$ in your setting, as several folks have outlined in the comment thread under your question.

@BenoîtLegat, the poster specified in a comment that the alphabet size is 4 (perhaps these are sequences of DNA bases): see https://cs.stackexchange.com/questions/53299/find-all-pairs-of-strings-in-a-set-with-levenshtein-distance-d/53353?noredirect=1#comment111813_53299. But yeah, if not for that obscure comment, that would be a good catch. :-) — D.W., Sep 24 '19 at 15:47
Thanks for the quick reply, I had indeed missed that from the question. — Benoît Legat, Sep 27 '19 at 12:09

Find all pairs of strings in a set with Levenshtein distance < d

1 Answers1

Linked