7

Is there a data structure that efficiently supports the following operations?

  1. Add set
  2. Query whether any subset of a set has been added.

This could be implemented with linear overhead by testing every added set during every query. Can it be implemented more efficiently? Small probabilities of false positives/negatives are acceptable (e.g. Bloom-filter style).

  • Is it even possible to check faster than o(n) if some set was added, where n is the number of added sets? How? – Albert Hendriks May 02 '17 at 18:50
  • @alberthendriks If all the sets are singleton you can do it in log time, so I'm wondering if it can be done more efficiently in this more general setting – Elliot Gorokhovsky May 02 '17 at 18:51
  • 1
    Can you give more details about your sets? Are they sets of integers? If not, are do they at least have a weak ordering? – orlp May 02 '17 at 22:12
  • @orlp They are sets of integers, I should have said that. I don't think that matters though because you can always hash to get something orderable. – Elliot Gorokhovsky May 02 '17 at 22:13
  • 1
    @RenéG Are these integers bounded or not? If yes, what is the bound? Regardless, please edit this information into the question. – orlp May 02 '17 at 22:15
  • @orlp They are bounded but again it doesn't matter because you can always hash to get bounded, because total number of elements is bounded. – Elliot Gorokhovsky May 02 '17 at 22:16
  • By "Query whether any subset of a set has been added", do you mean "Determine whether any subset of some given query set has been added"? (Another interpretation is "After adding some sets, determine if any of them is a subset of another". A data structure supporting the first interpretation can be used to answer queries of this type, but not necessarily vice versa, unless efficient set deletion is also possible.) – j_random_hacker May 05 '17 at 18:15
  • Related: https://cs.stackexchange.com/questions/75915/state-of-the-art-of-subset-set-containment-and-partial-match-queries – xavierm02 May 28 '17 at 00:07

1 Answers1

2

Let's say all your sets are finite subsets of $\mathbb N$. Let $S\subseteq \mathcal P( \mathbb N)$ denote your set of sets.

You want two operations:

  • $O_1(S,s')$: For any $s'\subseteq \mathbb N$, add $s'$ to $S$

  • $O_2(S,s')$: For any $s'\subseteq \mathbb N$, is there some $s\in S$ so that $s\subseteq s'$?


Here are a few ideas to speed things up:

  • You're going to test if a set if a subset of another a lot so you should probably keep the size $|s|$ of each set $s$ available in $O(1)$ so that when you need to test if $s\subseteq s'$, you start by checking if $|s|\le |s'|$ and if not, you can return false right away. And it you indeed have $|s|\le |s'|$, then you just run the normal slow test.

  • Note that if you have $s_1\in S$ and $s_2\in S$, so that $s_1\subseteq s_2$, then if $s_2\subseteq s'$, you also have $s_1\subseteq s'$. So you don't need to keep $s_2$ in $S$ for $O_2$. So you can represent $S$ by a set of sets so that $s\in S$ and $s\subsetneq s'$ implies $s'\not \in S$. In other words, you only need to keep track of the sets in $S$ that are minimal for inclusion. This can be implemented pretty efficiently: When adding a set $s'$, for all sets $s\in S$ so that $|s|\le |s'|$ (ordered by increasing cardinal), if $s\subseteq s'$, then don't add $s'$ because it won't be minimal (or is already in $S$). Otherwise, add $s'$ and then among sets $s\in S$ so that $|s'|<|s|$, remove those so that $s'\subseteq s$ (because they are no longer minimal).

  • Keep a set $t$ that's equal to the union of all sets in $S$. Then, instead of running $O_2(S,s')$, you can run $O_2(S,s'\cap t)$ instead (because if for some $s\in S$, $s\subseteq s'$, then since $s\subseteq t$, $s\subseteq s'\cap t$ and, if $s\subseteq s'\cap t$, then $s\subseteq s'\cap t \subseteq s'$).

With these ideas in mind, I'd represent $S$ by a dictionnary (implemented as a doubly linked list of pairs $(key,value)$ with the keys in increasing order) $d$ so that $d(k)$ is a doubly linked list containing exactly the minimal (for inclusion) sets in $S$ of cardinal $k$.

O1(S,s')
  if O2(S,s')
    return
  if d(k) doesn't exist
    d(k) := new_doubly_linked_list()
  add(d(k),s')
  S.t := union(S.t, s')
  for each key k of d so that |s'|+1 <= k
    for s in d(k)
      if subset(s', s)
        remove s

_O2(S,s')
  for each key k of d so that k <= |s'|
    for s in d(k)
      if subset(s,s')
        return true
  return false

O2(S,s')
  return _O2(S,inter(S.t,s'))

(Notice that even though I didnd't do it explicitely in the code of O1, you can do a single traversal of the doubly linked list representing d)

I don't think this improves too much in the worst case but in average it should.

xavierm02
  • 1,255
  • 6
  • 14
  • These look like useful practical improvements. But I think you mean that $d$ should be a linked list of map data structures (which could themselves be linked lists ordered by key -- though hashtables or balanced trees would be much faster). – j_random_hacker May 05 '17 at 18:40
  • 3
    Also regarding your second suggestion, in the worst case there can still be $n \choose {\frac{n}{2}}$ sets, none of which include the other, according to Sperner's Theorem. This occurs when the universe has $n$ elements, and you add all $\frac{n}{2}$-size sets of them. – j_random_hacker May 05 '17 at 18:43
  • Finally, I think you have the "direction" of the query around the wrong way -- I take the OP's question to ask whether there any subset has yet been added that contains the query set. Of course this can be remedied easily (by DeMorgan's law, you could even just "wrap" your implementation with functions that "invert" all input sets (i.e., take the symmetric difference with the universe) and also invert the return value of the query). – j_random_hacker May 05 '17 at 18:50
  • In your second bullet, s should be a strict subset to imply non-membership. Anyway, I think this answer is probably the best possible, so I'll accept it. – Elliot Gorokhovsky May 06 '17 at 21:11
  • @j_random_hacker Right, but you only need a set (i.e. a map to unit). If I denote by $\leq_{lex}$ the lexicographical order on sets given by seeing them as functions from the universe to ${0,1}$, then the map could be implemented using that order and balanced trees. Or alternatively, represent it as a binary decision diagram and put the value associated to a key in the corresponding leaf. – xavierm02 May 09 '17 at 12:13
  • Now that I think of it, maybe the whole thing could be implemented using reduced ordered binary decision diagrams and a set representing the union. The "keep only minimal sets" heuristic would be done automagically because we take them reduced. O1 would be slower because making sure the BDD is reduced can take a while but O2 would be linear in the size of s'. – xavierm02 May 09 '17 at 12:25