Finding a match in two collections

Question

Let $A$ and $B$ be two one-dimensional, finite collections of unsigned integers (e.g. arrays). Furthermore, $card(A) = a < b = card(B)$. Both collections are sorted in ascending order. There is at least one item ${x}$ which is contained both in $A$ and $B$.

Question: what is the fastest algorithm to find the smallest $x$ and what is its $T(n)$ in Big O notation?

Note: card() means the size of array (say: card(A) = 10 means a declaration in C/C++ would be int[10] A with indices 0...9).

Note 2: as I am new to CS.SE and learn CS as enthusiast, so far I have not been exposed to any fancy algorithm to solve this. My initial (naive) guess would be brute-search approach, but this is obviously not efficient for large $A$ and $B$. Your thoughts and pieces of advice would be then highly appreciated.

The priority is practical efficiency, so something around polynomial time-algorithm would be nice. Note, that the number of elements in both collections will be very substantial (above 10^30).

Welcome to CS.SE! 1. Does $card(A)$ count the number of distinct integers in $A$, or the length of the array $A$? 2. Can you specify more precisely what you want the output to be? Do you want the algorithm to output any value that's present in both $A$ if $B$ (if there are multiple it doesn't matter which), or to output all values in common? 3. Do you care more about practical running time or theoretical worst-case? 4. What approaches have you considered? What's the fastest algorithm you've found so far? I encourage you to edit your question to improve it with this information. — D.W., Jul 21 '16 at 07:08
You seem to have created multiple accounts, see here for how to merge them. — Tom van der Zanden, Jul 21 '16 at 08:01
In your (pending) edit you mention the collections will have a very large number of elements "above 10^30". That is a completely impractical amount of information, as you would need $10^{17}$ one-terabyte hard-disks in order to store all that (even if each element is just one bit). I don't think you can do better than $O(n)$ for this problem, and you can achieve $O(n)$ by looping over both arrays simultaneously. — Tom van der Zanden, Jul 21 '16 at 08:06
And what did you try? Where did you get stuck? It's very hard to give meaningful help without knowing what level you're at. (And "beginner" doesn't help us -- some beginners have deep knowledge of some small areas; others don't know much about anything.) — David Richerby, Jul 21 '16 at 08:20
If arrays is what you have, this is a rather basic programming question. If we can pick other data structures for sets things become more interesting. — Raphael, Jul 21 '16 at 09:02
"The priority is practical efficiency, so something around polynomial time-algorithm would be nice." -- that's ... not even wrong. Please read this. Given that this problem can be solved in quadratic time by naive brute-force, and linearly with basic ideas, asking for "something around polynomial" is a weird request. You may also see here for how you can analyse algorithms and gain better intution about what is and is not "obviously inefficient". — Raphael, Jul 21 '16 at 09:09

Kaho Chan · Answer 1 · 2016-07-21T10:16:52.947

1

Some ideas for a special case.

If all the elements have a limited range N (all the elements in A and B are bigger than zero and smaller than N), then you can accomplish this task by O(N*log(|A| + |B|)) as following:

for i <- 0, 1, ..., N:
    binary search in A to see whether i is in A
    binary search in A to see whether i is in B
    if i is both in A and B then
        x <- i
        break

This solution will be practically efficient if the N is not so big (for example 10^7) then it will solve the problem even when |A| and |B| are as large as 10^30 (though it is not practical to store them in any place).

The calculation: $N*\log{(|A| + |B|)} = 10^7 * \log{(2*10^{30})} \approx 3*10^8$

Besides, I do not think you can find a practically efficient algorithm for general case for card(A) and card(B) as much as 10^30. The general algorithm, in my opinion, is at least O(|A|+|B|) as following:

ia <- 0
ib <- 0
while A[ia] != B[ib] do
    if A[ia] > B[ib] then
        ib <- ib + 1
    else
        ia <- ia + 1
x <- A[ia]

edited Jul 21 '16 at 10:16

answered Jul 21 '16 at 08:51

Kaho Chan

161
3

1

My bad, the second one is probably the easiest algorithm, and quite efficient. Your analysis if off, though; it's not "at least O(...)" but quite certainly in $O(\max(|A|, |B|))$ is implemented properly (you are missing array bound checks). – Raphael Jul 21 '16 at 10:06
The first solution is for the case when |A| and |B| are extremely large (for example 10^30) and the normal solution won't work. For the complexity, consider the case A = [..., x] and B = [..., x] then the complexity is always |A|+|B|, thus max(A,B) doesn't fit. As for the boundary check, the problem states that there must be one x there, so no need to check then. – Kaho Chan Jul 21 '16 at 10:11
I understand. I don't think that $N$ will be known, though. It's a strong assumption. 2) $O(x + y) = O(\max(x, y))$. But writing $O(x + y)$ is more appropriate here, agreed.

Raphael

Jul 21 '16 at 14:32

Finding a match in two collections

1 Answers1