6

I need an algorithm to search for substrings. I checked different resources, and it seems that the most known algorithms are the Boyer–Moore and the Knuth–Morris–Pratt.

However, as far as I understand, these operate on "regular" strings, but what I need is a substring search on a circular string.

A circular string as a string characterized only by its size and the order of the elements, i.e. ABCD is the same as BCDA, CDAB and DABC

An source/query example that should succeed:

Source string: EFxxxABCxxxxxD
Query string:  DEF

Do you know of any references on substring search on circular strings? Any advice on how to do this?

(Possibly) related:

kebs
  • 173
  • 1
  • 8
  • Do you have more details about the problem? Can the circular string contain less elements than the substring pattern? Do you use the same pattern repeatedly so that it might be worth compiling it (as in Boyer-Moore or KMP)? – babou May 18 '15 at 17:04
  • @babou Q1: no, Q2: I don't understand, what do you mean by it might be worth compiling it ? At present, I implemented what @Marc-Johnston suggested, seams to work but I haven't done extensive tests by now. – kebs May 18 '15 at 17:05
  • I guess my question may not be properly stated. If you consider the KMP algorithm, there is a cost in building the table, which is O(m), m being the query string size. If the same query string is used many times, then this cost may be ignored as it is amortized on many queries. In that case, considering the cost on concatenating small loops makes sense in assessing complexity. But if you include the table creation in the cost, then the discussion on concatenation cost is pretty much pointless. I am working on how to avoid most of the concatenation, hence the question. – babou May 18 '15 at 17:28
  • When you reply "Q1: no", you mean that the source string is always larger than the query string? right? – babou May 18 '15 at 17:30
  • you mean that the source string is always larger than the query string: yes. – kebs May 18 '15 at 17:32
  • @babou On previous comment, I see what you mean. At present, my concerns are more on other parts of the "global" algorithm on which I'm working on (this is only a"sub"-problem) than on optimizing. And I have rather small strings (~10^2) and query string will be small too ( 1-10). But maybe consider posting an answer ? – kebs May 18 '15 at 17:36
  • I did post an answer, but removed it as there is a subtle graph problem I have to solve better (my solution was not correct because of missing cases). But it concerns only the case when the query is larger than the source. – babou May 18 '15 at 17:44

1 Answers1

10

Create a temporary source string by concatenating itself together until the length of the source string is at least twice the length of the search string. The source string must be concatenated at least once.

Then perform a simple (non-circular) search on that temporary string.

Marc Johnston
  • 216
  • 3
  • 5
  • If the source string is of length $n$ and the query string of length $m$, then this algorithm is $O(n + m)$ if KMP is run on the resulting instance, which is the same complexity as KMP on a noncircular string, which is optimal. Though you should duplicate the string until it has at least $n + m - 1$ characters, as @keb's example shows. – Bryce Sandlund May 15 '15 at 18:43
  • Agreed ... this algorithm is pretty much O(n + m) ... although a binary search tree may be able to make it O(log n + m) – Marc Johnston May 15 '15 at 19:02
  • I updated the algorithm to denote a minimum of 1 concatenation is required. – Marc Johnston May 16 '15 at 01:02
  • Also note this ... the circular problem is really solved only with the concatenation step. Then the problem becomes an algorithm for substring searching. To truly improve the big O for the original problem, the concatenation algorithm needs to be improved, not the substring search. – Marc Johnston May 16 '15 at 01:05
  • Sorry, what do you mean by "improve the concatenation algorithm" ? To me, concatenate is not an algorithm but an operation ("AB" + "CD" = "ABCD"). Or maybe you suggest that a partial contatenation can be enough ? – kebs May 16 '15 at 12:50
  • 1
    I thought about it a little bit... And really the source string only has to be n + m - 1 in length ... doubling the source string in length each time would be wasteful... n+m-1 would be the least amount of characters ... which reduces the search string length – Marc Johnston May 16 '15 at 16:52
  • Another optimization might be... Instead of performing concatenation... implement the substring search manually and do a (mod)ular loop of n+m-1 characters on the source string... depending on how thats implemented it may or may not be faster because built-in substring searches are usually highly optimized – Marc Johnston May 16 '15 at 16:55
  • It can be done without concatenation, using an optimized substring search algorithm. This makes the complexity dependent only on the size of the source string, i.e. just $O(n)$ assuming preprocessing of the query string. cc @BryceSandlund – babou May 18 '15 at 00:09
  • Accepting, with a warning for the future reader: read the comments, some more optimization possible. Or maybe you could edit your answer ? – kebs May 18 '15 at 17:08
  • @babou "assuming prepocessing of the query string" <- this takes $O(m)$ time. I assume you are referring to KMP. – Bryce Sandlund May 29 '15 at 14:29
  • @BryceSandlund Yes, KMP. But there is that discussion on whether one must increase the source string to the size of the pattern, when the pattern is longer. From a complexity view point, worrying about this makes sense only if you abstract away the $O(m)$ preprocessing, by considering it is amortized over many searches. This lead me to a problem I do not know how to solve ... but it could be a nice question. – babou May 29 '15 at 14:45