Infer a set from its differences to other sets

Question

There is a set $S$. We don't know the exact elements of $S$, but only know its cardinality $|S|$.

Now we have some guesses to $S$, i.e., $\{ S_i \}_{i=1}^n$. For each $S_i$, we know its exact elements, and the cardinalities $|S_i \backslash S|$ and $|S \backslash S_i|$.

The goal is to infer all the possible set $S$ that satisfies the given guesses. Note that the universe from which $S$ and all guesses are chosen is given, but may be arbitrarily large.

So, is this problem well-defined/studied in set theory domain? And is there any elegant algorithm to do so, instead of brute-force search?

Here we give an example:

Already know that $|S|=4$.
Guess 1, $S_1 = \{1,2,3,4,5\}$, $|S_1\backslash S|=1$, $|S\backslash S_1|=0$. Then $S$ has 5 possibilities (all subsets of $S_1$ with 4 elements).
Guess 2, $S_2 = \{2,3,4,5\}$, $|S_2\backslash S|=1$, $|S\backslash S_2|=1$. Then $S$ is reduced to only 4 possibilities ($\{2,3,4,5\}$ is impossible).
Guess 3, $S_3 = \{0,2,3,5\}$, $|S_3\backslash S|=1$, $|S\backslash S_3|=1$. Then $S$ is must be $\{1,2,3,5\}$.

Despite the name, "intersection theory" is an unrelated topic. — Noah Schweber, Sep 28 '22 at 19:31

Qiaochu Yuan · Accepted Answer · 2022-09-28T18:59:35.573

As it happens, one of the very first questions on math.SE (question 639!) was about a variant of this question where $|S| = 1, 2$ but you only learned whether or not $S \cap S_i$ was non-empty. There's some good discussion there and also on MathOverflow; this is known as combinatorial group testing.

Write $X$ for the set containing every set in question. Write $|X| = n, |S| = k$. Because $|S|$ and $|S_i|$ are known, learning $|S \setminus S_i|$ and $|S_i \setminus S|$ is equivalent to learning $|S \cap S_i|$. This is an integer between $0$ and $k$ so we have $k+1$ possible answers, which means after $g$ guesses we have at most $(k+1)^g$ possible sets of answers. There are ${n \choose k}$ possibilities for $S$, which we can lower bound as

$${n \choose k} \ge \frac{(n-k)^k}{k!} \ge \frac{1}{ek} \left( \frac{e(n-k)}{k} \right)^k$$

(as in this previous answer). So we need $g$ to satisfy $(k+1)^g \ge {n \choose k}$, hence we need at least

$$g \ge \left\lceil \frac{\log {n \choose k}}{\log (k+1)} \right\rceil \ge \left\lceil \frac{k \log \frac{e(n-k)}{k} - \log ek}{\log (k+1)} \right\rceil$$

guesses in the worst case (the "information-theoretic lower bound"). In general the growth is roughly like $\frac{k \log n}{\log k}$. Note that we can always assume WLOG that $k \le \frac{n}{2}$; if $k > \frac{n}{2}$ then instead of working with $S$ we should work with its complement which will be smaller.

Here is a strategy that takes at most $\boxed{ k \lceil \log_2 n \rceil }$ guesses. We will identify $X$ with the set of non-negative integers $\{ 0, \dots n-1 \}$ and then write these integers in binary, so that we are searching for a set $S$ of binary strings of length at most $\lceil \log_2 n \rceil$ of size $k$. As a warmup, here's the optimal strategy for $k = 1$: we take $S_i$ to be the set of integers whose $i^{th}$ digit in their binary expansion is $1$. Then learning $|S \cap S_i|$ tells us the $i^{th}$ digit in the binary expansion of the single mystery string, so after $\lceil \log_2 n \rceil$ guesses we know $S$ exactly.

The general strategy is a more complicated version of this. We take our first guess to be the same as above: the set of integers whose first digit in their binary expansion is $1$ (we could also start from the last digit, it doesn't matter too much). This tells us, among our $k$ mystery strings, how many of them start with $0$ and how many of them start with $1$.

From here, in general we will store a set of possible prefixes that the $k$ mystery strings can start with, together with a multiplicity describing how many of the mystery strings start with that prefix, and we increase the length of these prefixes by asking, for each prefix $w$, how many of the mystery strings begin with $w0$. Since we already know how many of the mystery strings begin with $w$, asking this also tells us how many of the mystery strings begin with $w1$. This requires at most $k$ guesses (since we necessarily store at most $k$ prefixes), and after these $k$ guesses we've increased the length of all our stored prefixes by one.

We continue this procedure until our prefixes have length $\lceil \log_2 n \rceil$ at which point we've identified the digits in every element of $S$. The $i^{th}$ step, where we learn our prefixes up to length $i$, requires $\text{min}(2^{i-1}, k)$ guesses, which for simplicity we'll just bound as $k$. In total, this gives us a procedure that takes a little under $k \lceil \log_2 n \rceil$ guesses, as desired.

To put some specific numbers to this, suppose $n = 1024, k = 10$, so we're looking for $10$ mystery binary strings among the binary strings of length $10$. The information-theoretic lower bound is that this requires at least $\left\lceil \frac{\log {1024 \choose 10}}{\log 11} \right\rceil = 23$ guesses in the worst case. Our strategy takes at most $k \lceil \log_2 n \rceil = 100$ guesses, which is not optimal but could be worse. Counting a bit more carefully, our strategy takes at most $1 + 2 + 4 + 8 + 10 \cdot 6 = 75$ guesses.

Infer a set from its differences to other sets

1 Answers1