How fast can we identifiy almost-duplicates in a list of strings?

Question

I'm having trouble figuring out the upper bound running time for this scenario:

Input:

$N$ number of strings
$M$ upper bound of string length
$T$ threshold for edit distance (2 strings with a Damerau-Levenshtein edit distance lower than $T$ are considered "duplicate")

Expected values

$N \approx 1,000,000$

$M \approx 200$

$T \leq 2$

The algorithm should do the following:

For each string in the list, find any other items in that list that have an edit distance smaller than some threshold, and mark them as "duplicate" (e.g. add them to some other list that tracks duplicates)

I'm having trouble calculating the upper bound for the optimal solution

What is the most optimal upper bound for such an algorithm?

I guess first I need to understand what is the best algorithm for the edit distance itself (is it $O(M \cdot N)$?) then it's simply $N^2$ times that, right? so it is surely "slower" than $N^2$ but my question is how slower?

It depends. What is the context in which you have run across this question? If it is a practical situation, there are potentially better algorithms than computing the edit distance $N^2$ times. If it is practical, can you give us a sense of the rough size of $N$, $M$, and $T$? Also, what research have you done? We expect you to do a significant amount of research before asking. There are lots of resources on how to compute the edit distance and the asymptotic running time of doing so. — D.W., Jun 09 '14 at 19:17
@D.W. Thanks! it's for finding duplicate addresses, N can be very large, M is the maximum length of a possible address (any address in the world) I would assume the maximum longest address possible (longest street name + place name) is still below 200 chars. for simplicity we can assume even smaller M, e.g. 50 will most likely be covering 99% of the cases. N can be in the range of several millions. let's say 1 million for simplicity. {T} is probably 2-3 max, although some address variations are swapping whole worlds, not just chars. (e.g. Foo avenue / Avenu of Foo). but let's ignore this :) — Eran Medan, Jun 09 '14 at 20:26
Great, and what's $T$? There are some algorithms that are much more efficient than pairwise edit distances, but they depend heavily on $T$. — D.W., Jun 09 '14 at 20:32
Note that "edit distance < $T$" is not transitive, i.e. you might have words $u$, $v$, and $w$, where $u$ and $v$ resp. $v$ and $w$ are duplicates of each other by your definition, but $u$ and $w$ are not. How do you want to deal with such cases? Also, the info from your previous comment should be incorporated into the question, in order to make it self contained. — FrankW, Jun 09 '14 at 21:38
@D.W. sorry, I edited the comment, T is probably between 1 and 3. — Eran Medan, Jun 10 '14 at 15:41
@Raphael, 1) That's why I asked for the context and whether this is a practical situation. Based upon the 2nd comment, the answer appears to be yes, it's a practical situation (in which case typical values are relevant for choosing which algorithms are likely to be most suitable in his context). Probably the question should be edited to ask for a good algorithm for his situation rather than the best possible big-O bounds. — D.W., Jun 11 '14 at 16:09
@Raphael thanks for the comments! I guess I shouldn't have used "expected values" since it has a statistical meaning, I meant more of "assumed values" e.g. an upper bound. — Eran Medan, Jun 11 '14 at 18:49

score 3 · Answer 1 · edited Apr 13 '17 at 12:48

Comparing two strings

Algorithms for computing the Levenshtein edit distance between a pair of strings can be found in the Wikipedia page on the edit distance. As that page explains, the running time for computing the edit distance between two strings of length $M$ is $O(M^2)$ time and $O(M)$ space. If you only care about whether the edit distance is $\le T$ or $>T$, then the running time can be reduced to $O(MT)$ time and $O(M)$ space.

Algorithms for computing the (full) Damerau-Levenshtein distance are linked to on the corresponding Wikipedia page. As Wikipedia explains, the (full) Damerau-Levenshtein distance can be computed in $O(M^2)$ time and $O(M^2)$ space; the algorithm is found in the Lowrance and Wagner paper cited there, or in Appendix A of the Bard paper cited there.

Wikipedia also shows how to compute what is known as restricted Damerau-Levenshtein edit distance; that is considerably easier than computing the full Damerau-Levenshtein distance, and the restricted Damerau-Levenshtein edit distance can be computed using a simple modification to the standard dynamic programming algorithm for the Levenshtein edit distance. The running time is $O(M^2)$, and it can be reduced to $O(MT)$ time if you only care about whether the distance is $\le T$ or $>T$.

Wikipedia also claims that for spelling correction and typo detection in natural language, the restricted Damerau-Levenshtein edit distance rarely differs from the full Damerau-Levenshtein distance, so you might as well use the restricted distance as it is easier to understand and implement. I have no personal knowledge of this; just passing along what is claimed in Wikipedia.

I recommend reading Appendix A of the Bard paper cited at Wikipedia for further implementation details.

Comparing all pairs: naive computation of pairwise distances

Therefore, the naive algorithm of computing the edit distance between each pair of strings will take $O(N^2 MT)$ time and $O(M)$ space.

However, for real-world problems of the sort you mentioned, this naive algorithm is typically not optimal. If $T$ is fairly small, there are much better algorithms; the best algorithm depends heavily on $T$. But since you haven't specified $T$ in your question, you haven't provided us enough information to suggest a specific algorithm.

Comparing all pairs: better data structures

A related problem is: given the set of $N$ strings, compute a data structure that allows us to quickly find the closest one in the set to any given string (under edit distance as our measure of "closeness"). There is lots of theoretical work on this problem, e.g., using BK-trees, Levenshtein automata, and many other techniques. See especially Efficient map data structure supporting approximate lookup and https://cstheory.stackexchange.com/q/4165/5038 for an entry into the literature.

In your case, this might be overkill, though.

The case where $T=1$: sorting

In the special case where $T=1$, there is a clean algorithm. In other words, we want to consider two strings $X,Y$ if their edit distance is $1$ (i.e., $X$ can be transformed into $Y$ with a single insertion, deletion, or substitution, transposition).

Notice that if $d(X,Y)=1$, then the single difference must appear in either the first half of the string or in the last half of the string. In other words, either the first half of $X$ matches the first half of $Y$ (they have a long matching prefix), or the second half of $X$ matches the second half of $Y$ (they have a long matching suffix).

This immediately suggests a nice two-stage algorithm:

Sort the entire list of strings. For each string $X$, enumerate all other strings $Y$ that agree with $Y$ in its first half and check whether any of them are at edit distance $1$ from $X$. This can be done through a linear scan of a short contiguous stretch of the sorted list, and the start and end of that stretch can be found quickly with binary search.
Next, reverse all the strings and do the same thing again.

(Note that you need to be a little bit careful about the definition of "half" above, to take into account that $X,Y$ might differ in their length. However these pesky details can be easily worked out with a little bit of care.)

The space requirement is $O(N)$. Sorting takes $O(N \lg N)$ time. If you're lucky and the strings are fairly random, then hopefully each string will have only a small number of other candidates that match in the first half or last half. If this is the case, the running time might be in the vicinity $O(N \lg N)$.

Does this generalize to $T>1$? Maybe, but it gets messier.

One way to try to generalize is to split the string into $T+1$ equal-length segments, and note that if $d(X,Y)\le T$, then there must be some segment where they match. However, this runs into messy details because the length of $X$ might differ from the length of $Y$. Therefore, if $T>1$, I don't recommend this approach.

Another way to try to generalize is to break each string down into different fields: say the street address, the city, the country, the postal code. Now if two addresses $X,Y$ have edit distance $\le 2$, then either (a) some field of $X$ is at edit distance $2$ from $Y$, and all other fields match exactly, or (b) there are two fields where they have edit distance $1$, and all other fields match exactly. You can find all matches of type (a) by, for each field, building a hashtable keyed on everything but that field; now any two strings that match will be mapped to the same hash bucket, so you just need to solve the problem separately for each bucket. You can also find all matches of type (b) similarly. This might be more promising, depending upon the structure of your problem.

Pragmatics

In practice, there are some techniques that I suspect will probably be helpful. One is to try to use spell-correction on each field of the address first, and then use the spell-corrected addresses to look for duplicates (confirming them by reference to the original value).

Another possible heuristic might be to map each address to geographical coordinates on the sphere, and look for pairs of addresses that are locally very near each other. This won't detect typos in (say) the country or state, but it might be helpful in detecting typos in (say) street address, by reducing the set of pairwise comparisons you need to make.

Other resources

You might also like to look into the literature on how to build fast spell checkers, as that is a closely related problem. Take a look at Peter Norvig's article on that problem: http://norvig.com/spell-correct.html

Why was this demoted? It seems like a perfectly reasonable answer. — Ari Trachtenberg, Jun 10 '14 at 13:27
Not sure why someone downvoted your elaborate answer. In any case, I did link to the Wikipedia article, and I was talking about a different algorithm (Damerau Levenshtein distance) which is not exactly the same as the edit distance page on Wikipedia. The sole reason I was asking the question here is that the Wikipedia articles are very confusing. e.g. in the http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance page (which I link to in the question) it is mentioned it can be improved to O(M*N) "Using the ideas of Lowrance and Wagner" without explicitly stating what are N and M, etc — Eran Medan, Jun 10 '14 at 15:51
@EranMedan, thank you. I've revised my answer to talk about Damerau-Levenshtein edit distance. The fact that you were interested in Damerau-Levenshtein distance rather than Levenshtein distance was not clear from the question (for one thing, it wasn't mentioned at all in the question; it only appears in the title). Anyway, my answer now focuses on what you were looking for. Also, it would have helped if you had provided the context you mention in your comment, in the question. This would be a great opportunity to edit your question to incorporate this additional context. — D.W., Jun 10 '14 at 23:33
@D.W. Thanks, will edit the question, all good points, sorry for the confusion — Eran Medan, Jun 11 '14 at 01:42
One technique to speed up the naive algorithm is to sort strings alphanumerically, fill the matrix of string pairs row by row and re-use the still valid parts of the dynamic programming matrix from pair to pair. — Raphael, Jun 11 '14 at 06:24

Ari Trachtenberg · Answer 2 · 2014-06-09T21:49:01.467

The answer depends on $N$, $M$, and $T$. For example, if $T=M$, then you can determine the answer in constant time (i.e. there is one string, all the rest are duplicates).

The standard dynamic programming solution to the edit-distance problem runs in roughly quadratic time. In theory you can get it down to linear if $T$ is very large; see Lipton's blog for a nice description of the computational complexity of the edit-distance problem.