6

I have collection $U$ of sets, where each set is of size at most 95 (corresponding to each printable ASCII character). For example, $\{h,r,l,a\}$ is one set, and $U = \{\{h,r,l,a\}, \{l,e,d\}, \ldots\}$. The number of sets in $U$ is nearly a million. Also a set in $U$ will mostly contains 8-20 elements.

I am looking for a datastructure for storing collection of sets that support following operations:

  1. set matching, e.g. check if set $\{h,r,l,a\}$ is present in $U$
  2. subset matching e.g. check if set $\{h,r,l\}$ is subset of any set in $U$
  3. superset matching e.g. check if set $\{h,r,l,a,s\}$ is superset of any set in $U$
  4. union matching e.g. check if set $\{h,r,l,a,e,d\}$ is union of sets in $U$
  5. approximate set matching e.g. check if set $\{h,r,l,e\}$ is present in $U$, should return true

In particular, we can assume that once the data structure is built, no modifications are made but only queries of the above type (the structure is static).

I was thinking of trie data structure. But, it demands storing data in some order. So I have to store every set as a bit vector, but then the trie becomes binary decision tree. Am I in the right direction? Any pointers will be appreciated.

Raphael
  • 72,336
  • 29
  • 179
  • 389
Curious
  • 171
  • 4
  • The trivial solution would be to compare the input set with every set in U. I am looking for a more efficient solution. The number of sets in U is over a million. – Curious Mar 02 '15 at 06:56
  • So, with bitvector representation, I have to compare my input against every set in U? – Curious Mar 02 '15 at 09:46
  • Which operations need to be sublinear? I don't think you can have all operations sublinear. So a set contains maximal 4 elements? and approximate set matching is when a it only differs by one element? and in your sets the position of the elements is not important? – user3613886 Mar 02 '15 at 09:54
  • But the question talks about searching subset, superset in collection U. Most of the data structures related to set talk about searching an element in a set. – Curious Mar 02 '15 at 10:07
  • 1
    Can we talk about at least subset and superset operations? – Curious Mar 02 '15 at 10:10
  • 1
    @Curious here or here; if it is static, you have memory, and want all operatiosn be sublinear, then build a seperate structure for each operation, this is imo the only way to achieve sublinearty for all operations. – user3613886 Mar 02 '15 at 10:14
  • In the future, when you cross-post, please link back to the other versions of the question http://cstheory.stackexchange.com/q/30655/8067 "datastructure for collection of sets" http://stackoverflow.com/q/28802695 "Data structure for approximate set matching" – Zsbán Ambrus Mar 02 '15 at 10:41
  • 1
    What is the underlying problem? This smells like maybe there's a better solution if you start over. – Raphael Mar 02 '15 at 12:01

1 Answers1

8

Generically, these are sometimes called subset/containment dictionaries. The fact that you had partial matching in your question (but deleted it) is actually not a coincidence, because subset/containment queries and partial matching are equivalent problems for sets.

You probably want to use an UBTree (unlimited branching tree) for this; it's basically a modified trie. See Hoffmann and Koehler (1998) for more details.

You could also have a look at the more recent set-trie data structure proposed by Savnik (2013) for the same purpose. (If you miss the color in the graphs in that preprint paper and don't have access to the official Springer publication [in which the colors aren't missing], there's precursor to that paper which has almost the same graphs/info and no missing colors.)

Both papers (Hoffmann & Koehler, and respectively Savnik) have experimental results, so you can have some idea what they can handle in practice. Both seem to handle data sets with a cardinality of U around 1M.

If you somehow have TCAM hardware (or the money for it), there's a way to leverage that for subset/superset queries. You can actually do both subset/superset queries in parallel assuming you have enough TCAM words (2x |U|). Since TCAM words can be configured to be 144-bit wide, you and you have only 95 bits/letters to test, you wouldn't even need to bother with Bloom/hashing, you'd have an exact test using TCAM; this is trivial enough I'll even say here how: every {0, 1} bit-vector corresponding to every set in your U is simply turned into a {0, *} vector for subset queries and to a {1, *} vector for superset queries.

A more general problem tackled in ETI (no free copy, sorry) is finding a set with given similarity measure. For example, the similarity measure can be $J(Q,S)=\frac{|Q\cap S|}{{|Q\cup S|}}$ and the query for a given $Q$ can be to find all $S$ (in the database) with $J(Q,S)\geq 0.3$. Both the constant and the similarity measure are user-definable as a predicate in ETI. (The $J$ given as example here is called the Jaccard index.)

  • I was also thinking of modifying trie data structure for set operations. The paper by Savnik provided me the right direction. Also the more general similarity measure is [Tversy index("http://en.wikipedia.org/wiki/Tversky_index")]. Jaccard index and Dice's coeff are just special cases of Tversky index. – Curious Mar 04 '15 at 07:06