5

I'm having trouble figuring out the upper bound running time for this scenario:

Input:

  • $N$ number of strings
  • $M$ upper bound of string length
  • $T$ threshold for edit distance (2 strings with a Damerau-Levenshtein edit distance lower than $T$ are considered "duplicate")

Expected values

$N \approx 1,000,000$

$M \approx 200$

$T \leq 2$

The algorithm should do the following:

For each string in the list, find any other items in that list that have an edit distance smaller than some threshold, and mark them as "duplicate" (e.g. add them to some other list that tracks duplicates)

I'm having trouble calculating the upper bound for the optimal solution

What is the most optimal upper bound for such an algorithm?

I guess first I need to understand what is the best algorithm for the edit distance itself (is it $O(M \cdot N)$?) then it's simply $N^2$ times that, right? so it is surely "slower" than $N^2$ but my question is how slower?

Raphael
  • 72,336
  • 29
  • 179
  • 389
Eran Medan
  • 431
  • 1
  • 4
  • 12
  • It depends. What is the context in which you have run across this question? If it is a practical situation, there are potentially better algorithms than computing the edit distance $N^2$ times. If it is practical, can you give us a sense of the rough size of $N$, $M$, and $T$? Also, what research have you done? We expect you to do a significant amount of research before asking. There are lots of resources on how to compute the edit distance and the asymptotic running time of doing so. – D.W. Jun 09 '14 at 19:17
  • @D.W. Thanks! it's for finding duplicate addresses, N can be very large, M is the maximum length of a possible address (any address in the world) I would assume the maximum longest address possible (longest street name + place name) is still below 200 chars. for simplicity we can assume even smaller M, e.g. 50 will most likely be covering 99% of the cases. N can be in the range of several millions. let's say 1 million for simplicity. {T} is probably 2-3 max, although some address variations are swapping whole worlds, not just chars. (e.g. Foo avenue / Avenu of Foo). but let's ignore this :) – Eran Medan Jun 09 '14 at 20:26
  • Great, and what's $T$? There are some algorithms that are much more efficient than pairwise edit distances, but they depend heavily on $T$. – D.W. Jun 09 '14 at 20:32
  • 2
    Note that "edit distance < $T$" is not transitive, i.e. you might have words $u$, $v$, and $w$, where $u$ and $v$ resp. $v$ and $w$ are duplicates of each other by your definition, but $u$ and $w$ are not. How do you want to deal with such cases? Also, the info from your previous comment should be incorporated into the question, in order to make it self contained. – FrankW Jun 09 '14 at 21:38
  • @D.W. sorry, I edited the comment, T is probably between 1 and 3. – Eran Medan Jun 10 '14 at 15:41
  • The "expected values" are utterly irrelevant for O-bounds. 2) There is no single upper O-bound. Do you mean $\Theta$? 3) Optimal algorithms are notoriously hard to find. Are you satisfied with any "good" one? How good is good enough for you? 4) What FrankW said. If you don't fix one element, the problem is not well-defined.
  • – Raphael Jun 11 '14 at 06:19
  • 1
    @Raphael, 1) That's why I asked for the context and whether this is a practical situation. Based upon the 2nd comment, the answer appears to be yes, it's a practical situation (in which case typical values are relevant for choosing which algorithms are likely to be most suitable in his context). Probably the question should be edited to ask for a good algorithm for his situation rather than the best possible big-O bounds. – D.W. Jun 11 '14 at 16:09
  • @Raphael thanks for the comments! I guess I shouldn't have used "expected values" since it has a statistical meaning, I meant more of "assumed values" e.g. an upper bound. – Eran Medan Jun 11 '14 at 18:49