1

I have a sequence of n non-distinct numbers and I want to find the number of combinations of half the size.

For example:

Given 0.5, 0.5, 1, 1 there are 3 combinations of size 2:
0.5, 0.5
1, 1
1, 0.5

If I have all numbers that are the same, e.g. 1, 1, 1, 1, then there is only 1 combination of size 2: 1, 1.

Is there a algorithm/formula I can use to get what I want?

jignatius
  • 111

1 Answers1

1

The short answer is that there is no known formula for counting these solutions in polynomial time. The topic has been broached in previous Questions, but it can appear in a variety of contexts, so a clear statement of the known theory bears repeating at least once more. [Multinomials, as suggested in a Comment on the Question, solve a different kind of problem, counting the arrangements/permutations with specified repetitions of items, eg. the MISSISSIPPI problem.]

You ask about splitting a multiset $M$ into two equal-size sub-multisets. Obvious a necessary and sufficient condition for this to be possible is that the size (counting multiplicities of elements) of $M$ is an even integer, let's say $2n$. However counting the number of ways this can be done depends on the details of those multiplicities (repetitions of items). So we begin by taking a step back and going over the parameters needed to define $M$ as a multiset.

The basic difference between a set and a multiset is the requirement that every element in a set is distinguishable; there are no repetitions. So every multiset $M$ has an underlying set $S$ which consists of the same elements but taken without regard to possible repetitions. Conversely we can prescribe any multiset $M$ by specifying the underlying set $S$ that contains all its elements and a function $d:S \to \mathbb N$ which tells us the number of times such elements are repeated in $M$. In that narrow way the repetitions will always be positive counting numbers.

In some contexts (including the present one!) it can be convenient to allow $S$ to be strictly bigger than the underlying set of $M$ and to assign "zero repetitions" to any elements appearing in $S$ but not in $M$. The utility of this generalization will appear shortly, but we can already hint that in splitting $M = M_1 \cup M_2$, it may happen that some elements of $M$ would not appear in both sub-multisets, yet using the same set $S$ to prescribe both $M_1,M_2$ will simplify our expressions.

For each $i\in S$ we have a repetition count $d(i)$ for the number of times element $i$ appears in $M$. Assume that the total size of $M$ is positive even integer $2n$, so that our task is to count the number of sub-multisets $M_1 \subseteq M$ which have size $n$. Equivalently we are counting the number of ways $M = M_1 \cup M_2$ where the sizes of sub-multisets $M_1,M_2$ are equal.

Let $d_1:S\to \mathbb N$ and $d_2:S\to \mathbb N$ be the respective functions that provide the repetition counts for $M_1,M_2$. By construction we have $d_1(i)+d_2(i)=d(i)$ for each $i\in S$. If the set $S$ is $\{1,2,\ldots,m\}$, then these values can be visualized by what statisticians call a $2\times m$ contingency table:

$$ \begin{array}{|c|c|c|c|c} \hline d_1(1) & d_1(2) & \ldots & d_1(m) & n \\ \hline d_2(1) & d_2(2) & \ldots & d_2(m) & n \\ \hline d(1) & d(2) & \ldots & d(m) & 2n \end{array} $$

where the marginal sums are shown as bottom row and rightmost column.

Counting the exact number of ways to fill in even such a simple contingency table with the given row and column sums is known to be a hard problem. More precisely, from Sampling Contingency Tables (1995) by Dyer, Kannan, and Mount:

We will show that exactly counting contingency tables is hard, even in a very restricted case...

Theorem 1 The problem of determining the exact number of contingency tables with prescribed row and column sums is $\#P$-complete, even in the $2\times n$ case.

Here the $\#P$-completeness of this family of computations implies that if a polynomial time algorithm for them exists, then $P=NP$, which is famously an open problem. So while difficult, the problem of counting the number of specified contingency tables is important, and statisticians are thus interested in approximating those counts or in methods for uniformly sampling them.

A fairly recent (2015) paper in this area is Random Sampling of Contingency Tables via Probabilistic Divide-and-Conquer by DeSalvo and Zhao. Their references provide a sense of the history of such research.

hardmath
  • 37,015