Best way we know search for an integer

Question

Basically, you have $n$ integers. The data structure is for your choice, it is ok to do polynomial time preprocessing on them. Then you have multiple questions "Is an integer $k$ in the set?"

My question is about the best way we know in terms of time/space to answer this query. Ultimately, the time should be guaranteed (not randomized) and the space must be polynomial of $n$.

Firstly, one can use let's say 2-3 Trees or another type of search trees to do this in $O(log n)$ time and $O(n)$ space.

Secondly, one might say that it can be done in $O(1)$ with hashing, however I should note that this time is random and we are looking only for deterministic algorithms. Universal hashing requires a lot of space.

So, one way to approach the main question is to answer if there is a perfect hashing constructable in poly-time and poly-space?

Another approaches are also appreciated. Thanks.

how big is the set from which k-values are drawn? Are you going to make deletions from your structure, or will be an insert/search only data structure? — Curious_Dim, Jan 26 '18 at 23:59
No modifications required. The integers themselves are not too big, but the number of queries is significant (might be multiple queries, though not guaranteed). — Eugene, Jan 27 '18 at 00:08
By saying significant number of queries, do you mean different type of queries? — Curious_Dim, Jan 27 '18 at 00:24
Significant number of queries "Is an integer $k$ in the set? (with different $k$'s obviously). Hope this helps. — Eugene, Jan 27 '18 at 00:26
will there are deletions from the set? i.e. will you ever remove an integer $k$ from the set so subsequent search queries should return "there is no such integer in the set" ? — Curious_Dim, Jan 27 '18 at 00:29
Polynomial in what? Best in what sense? I also think this is too broad in its current form, as you are asking for the result of a big literature review. — Raphael, Jan 27 '18 at 00:37

score 2 · Answer 1 · answered Jan 27 '18 at 00:46

A very simple solution if your integers are not too big, let's say biggest integer is $U-1$ and there are $U$ different integers (between 0 and $U-1$), is to build a table T (bit matrix) of size $U$ with all bits set to $0$. Upon occurrence of item $k$ set $T(k)\leftarrow 1$.

If you want to search for an item $k$ you simply check if $T(k)$ is $1$ or $0$. This solution takes $\Theta(U)$ space and $\Theta(1)$ time per inserted item, or per search. If $U$ is much less than $n$ and you don't make deletions then I think this solution is ideal in your case. If you make deletions of $U > n$ then more sophisticated solutions are needed

Pseudonym · Answer 2 · 2018-01-31T04:53:32.330

You probably want to look into the field of succinct rank/select data structures. Assuming that the set is static or semi-static, there is a good collection of data structures in Okanohara and Sadakane's paper, Practical Entropy-Compressed Rank/Select Dictionary.

We will suppose you want to represent a subset $S \subset \{0\ldots n-1\}$, where $\left| S \right| = m$. There are ${n \choose m}$ such subsets, so we need at least $\log_2 {n \choose m}$ bits to represent any subset. What you want is a data structure which uses about that much space, and which supports a membership test in close to constant time.

First, a quick note on measuring time vs space requirements. In the case of time, we customarily measure in big-oh (e.g. $O(1)$) because what we're really interested in, the time measured on a clock, depends on the language, compiler, hardware, etc.

In the case of space requirements, however, we can measure the space usage in bits, because we customarily measure space in bits, not in physical units (e.g. cubic metres of RAM).

If information theory tells us that we need at least $f(n)$ bits to store some data structure. Then:

If we have a data structure which uses $f(n) + O(1)$ bits, we call it implicit.
If we have a data structure which uses $f(n) + O(f(n)) = f(n) (1 + O(1))$ bits, we call it compact. Intuitively, the $O(1)$ means "constant relative overhead".
If we have a data structure which uses $f(n) + o(f(n)) = f(n) (1 + o(1))$ bits, we call it succinct. Intuitively this means that as the data structure gets bigger, the relative overhead eventually becomes negligible.

Additionally, we will use the word RAM model and assume that the machine word size is $\Theta(\log n)$, that is, that an "integer" can be stored in a constant number of words.

If the subset is "dense", that is if $m \approx \frac{n}{2}$, then $\log_2 {n \choose m} \approx n$, so in that case you can't do better than a bit vector.

Things get more interesting when the set is sparse or almost full. If $m > \frac{n}{2}$, you can always store the complement of the subset instead. So we can just consider the case of a sparse subset where $m < \frac{n}{2}$.

The esp variant from that paper uses $n H_0(S) + o(n)$ bits of storage, and supports both rank and select queries in $O(1)$ time assuming the word-RAM model. Since a membership test can be trivially implemented with two rank queries, this supports membership tests in $O(1)$ time, and the space requirement is essentially a zero-order entropy compressed representation of the subset plus some overhead. This is pretty good, but if $m \ll n$, the $o(n)$ overhead term dominates the space requirement.

Because of this, the sdarray variant is more practical for sparse sets. The space requirement for sdarray is $m \log_2 \frac{n}{m} + 2m + o(m)$ bits, although if you only need membership tests, this can be reduced to $m \log_2 \frac{n}{m} + m + o(m)$ bits. Note if $m\ll n$, $\log_2 {n \choose m} \approx m \log_2 \frac{n}{m}$ and so this data structure is succinct. A membership test query takes $O(\log \frac{n}{m}) + O(\log^4 m/\log n)$ time in the worst case, but is typically constant in practice.

Good stuff! The worst-case running time of the sdarray seems like it might be worse than using a balanced binary search tree (which is like $O(\log m)$, I think) -- does that sound right? — D.W., Jan 27 '18 at 01:53
I just looked at the paper again, and I think it would be more accurate to characterise that term as $O(\min{ \log m, \log \frac{n}{m} })$.
The main factor that determines the time behaviour is how evenly-distributed the subset is. The more even the distribution, the more likely you are to get constant time queries. Otherwise, it falls back to binary search on some range. — Pseudonym, Jan 27 '18 at 04:46

Best way we know search for an integer

2 Answers2