How many strings are close to a given set of strings?

Question

This question has been prompted by Efficient data structures for building a fast spell checker.

Given two strings $u,v$, we say they are $k$-close if their Damerau–Levenshtein distance¹ is small, i.e. $\operatorname{LD}(u,v) \geq k$ for a fixed $k \in \mathbb{N}$. Informally, $\operatorname{LD}(u,v)$ is the minimum number of deletion, insertion, substitution and (neighbour) swap operations needed to transform $u$ into $v$. It can be computed in $\Theta(|u|\cdot|v|)$ by dynamic programming. Note that $\operatorname{LD}$ is a metric, that is in particular symmetric.

The question of interest is:

Given a set $S$ of $n$ strings over $\Sigma$ with lengths at most $m$, what is the cardinality of

$\qquad \displaystyle S_k := \{ w \in \Sigma^* \mid \exists v \in S.\ \operatorname{LD}(v,w) \leq k \}$?

As even two strings of the same length have different numbers of $k$-close strings² a general formula/approach may be hard (impossible?) to find. Therefore, we might have to compute the number explicitly for every given $S$, leading us to the main question:

What is the (time) complexity of finding the cardinality of the set $\{w\}_k$ for (arbitrary) $w \in \Sigma^*$?

Note that the desired quantity is exponential in $|w|$, so explicit enumeration is not desirable. An efficient algorithm would be great.

If it helps, it can be assumed that we have indeed a (large) set $S$ of strings, that is we solve the first highlighted question.

Possible variants include using the Levenshtein distance instead.
Consider $aa$ and $ab$. The sets of $1$-close strings over $\{a,b\}$ are $\{ a, aa,ab,ba,aaa,baa,aba,aab \}$ (8 words) and $\{a,b,aa,bb,ab,ba,aab,bab,abb,aba\}$ (10 words), respectively .

Isn't the highlighted question basically a k-nearest neighbour search? More specifically I'm thinking about spatial indices. There are data structures that support efficient k-NN queries with arbitrary metric (with some constraints) such as M-tree and its variants. Am I missing something or do you think this would this work? — Juho, May 09 '12 at 18:26
@mrm Sure, that would work -- if I were to write down all exponentially many words up to some length (which I don't want to do), compute all pairwise alignments (which I want to circumvent) and then build the tree. — Raphael, May 10 '12 at 00:01
@mrm: Now that I think about it, finding the $k$ nearest neighbours does not solve the problem. We want to find all neighbours (up to a fixed distance). — Raphael, May 10 '12 at 07:25
Right, it's a range query search then. I think there's quite a bit of research on the subject, with huge amounts of data and large databases. But regardless, I see your point now. Maybe there's a more clever way :) — Juho, May 10 '12 at 09:32
A couple of rather easy observations: (1) if only deletions are allowed, then the (second) problem is polynomial; (2) a bound for the count is $O\bigl((|w|+k)^k\bigr)$. — rgrig, May 10 '12 at 21:56
@Raphael, I am almost positive (i.e. I have code that I am very sure does this, based off of this paper) that the levenshtein distance can be computed in $O(\max (u,v))$ time. — soandos, May 10 '12 at 23:51
Perhaps you can compute a DFA recognizing the set $S_k$ (I believe this is known as a Levenshtein automaton) and then use the standard dynamic programming algorithm to count the number of words accepted by this DFA. I see statements that this automaton can be constructed in $O(|w|)$ time when $k$ is fixed, but I can't quite figure out what the dependence on $k$ is. — D.W., Apr 25 '19 at 16:22
@D.W. Interesting! Intuitively, I'd have said it has to be exponential (as the finite (!) languages we handle here most certainly are) -- but a DFA can't have more than $|\Sigma| \cdot |Q| \in O(|\Sigma| \cdot w)$ transitions, so it can't be too bad? — Raphael, Apr 25 '19 at 16:35
@D.W. Of course, if we create an NFA for a set of words using the standard construction and then determinise, all bets are off again. Maybe one can be smarter. — Raphael, Apr 25 '19 at 16:37
@Raphael, the cool thing about the the classic Schulz & Mihov paper is that they construct the DFA directly (without going through a NFA and then determinising). However, their paper only describes the dependence on $|w|$ and not the dependence on $k$; probably because they were primarily interested in small values of $k$ for their application. — D.W., Apr 25 '19 at 16:39
@D.W. That's cool, but I meant using Thompson's construction to build the automaton for $S_k$ from the ones built by Schulz & Mihov which, if I understood you correctly, only handles a single word? — Raphael, Apr 25 '19 at 16:48
@Raphael, Hmm. On further reflection I realize I don't understand what this question is asking. What is ${w}_k$? That notation isn't defined. Schulz & Mihov's algorithm constructs a DFA that accepts all words $v$ that are at distance $\le k$ from some fixed word $w$. I was thinking of using that to count the number of words at distance $\le k$ from a single fixed word $w$. On further reflection, I'm not sure if that is what you are asking or not. — D.W., Apr 25 '19 at 16:54
@D.W. No, I'm looking at a set of words $S$ (think spell-checker dictionary), with $S_k$ being defined in the question. For $S = {w}$ I short-handed to ${w}_k$; certainly not the best notation I ever introduced, erm. — Raphael, Apr 25 '19 at 17:09
@D.W. I just realized I completely ignored the second block-quote question I asked. Sorry -- you seem to have a good answer for that! Mind posting it? (An answer for the more general question (first block quote) follows, even constructively. It seems Schulz & Mihov have quite extensive work on how to solve the motivating problem, too!) — Raphael, Apr 25 '19 at 17:12
I'm not sure that my answer is any good because it might be that the size of the DFA is exponential in $k$, which wouldn't be interesting (you could just enumerate words with breadth-first search with probably the same complexity). So I'm not sure whether I have a good answer or not. By the way, I encourage you to edit the question to it up -- right now it uses non-standard notation ${w}_k$ without defining it. — D.W., Apr 25 '19 at 17:13
@D.W. I see, further investigation is needed. (I disagree. ${w}$ is a set of strings, so the notation -- ill-considered as it may be -- is defined in the first block quote.) — Raphael, Apr 25 '19 at 17:34

score 1 · Answer 1 · answered May 10 '12 at 15:55

1

See Levenshtein's paper. It contains bounds on the number strings obtained from insertion and deletion of a string. If $n$ is the length of the string and the string is binary, then the maximum number of nearest neighbors in the Levenshtein distance is $\Theta(n^2)$. It is comparatively harder to say anything about $k$-nearest neighbours, but one can get bounds. These should give you an estimate on the complexity.

answered May 10 '12 at 15:55

Ankur

163
6

Thanks, but this is neither the correct metric, nor won't a binary alphabet be sufficient (though alphabet size has probably not qualitative impact). I don't speak Russian so I can't check how easily the results can be transferred. – Raphael May 10 '12 at 16:26
Bounds seem easy to find, but the question asks for an exact count. Am I wrong @Raphael? – rgrig May 10 '12 at 17:06
There is an English version of Levenshtein's paper that you should be able to find; it also contains bounds for general alphabet. – Ankur May 10 '12 at 20:04
@rgrig: The question asks for the precise number, but (good) bounds would be appreciated. – Raphael May 14 '12 at 16:09

score 0 · Answer 2 · edited May 14 '12 at 16:09

0

If your $k$ is fixed and you are allowed to do pre-processing, then this is something you might be able to try

Construct a graph such that the nodes are words and an edge exists between two nodes if the distance between those two words is 1.
Get the adjacency matrix corresponding to that graph (say $M$)
Compute $M^k$

Now, you may be able to use the final matrix to answer all the queries. If you can store $M, M^2, M^4, M^8 \ldots$ etc. You might be able to answer for larger range of $k$ instead of fixed $k$, of course one will pay here with the cost of matrix multiplication.

edited May 14 '12 at 16:09

Raphael

72,336
29
179
389

answered May 14 '12 at 00:50

TenaliRaman

124
2

This is a rather naive procedure, isn't it? Computing all pairwise distances and performing breadth-first search up to depth $k$ is already more efficient. – Raphael May 14 '12 at 16:17
I am assuming that you mean breadth-first search in the graph constructed above. In which case, you will be doing the search for every query you do. That would be no better than enumeration (which you specified in your question that you didn't want to do). In my reply above, I compute $M^k$ as a pre-processing step, which has to be done just once. After that, for every query one has to just go through a row/column of that matrix, thereby giving a faster response time. – TenaliRaman May 14 '12 at 18:07
1

Well, both ways can hide their "real" effort as preprocessing. Note that $M$ is exponentially big in maximum length $n$, so "just going through a row/column" is not efficient. Computing the distances themselves is not the bottleneck here. (You would need $\sum_{i=1}^k M^i$, by the way.) – Raphael May 14 '12 at 19:38
Actually $M$ is just num_words x num_words. Also, it is boolean and possibly very sparse. Do you see why? – TenaliRaman May 14 '12 at 19:56
Yes, and no. $S_k$ contains all close words, and there are exponentially many words, i.e. $\text{num_words } = 2^m$. I edited the question to clarify. – Raphael May 14 '12 at 19:58
The $M$ above depends on $S$ and not $S_k$. – TenaliRaman May 14 '12 at 22:16
Then I can not read off $S_k$, and your post does not answer the question. – Raphael May 15 '12 at 08:42
I see, I just re-read your definition of $S_k$ again. You are right, that you cannot read off $S_k$. You can only read $S_k$ restricted to $S$, but obviously that's not what you are looking for. – TenaliRaman May 15 '12 at 09:15

How many strings are close to a given set of strings?

2 Answers2

Linked