2

I have a set of multisets $S = \{ X_1, \dots, X_K\}$ where $X_i \subset \mathbb{R}$. I need to find an optimal partition $L^*, R^*$ such that this $E(L) + E(R)$ is minimized. Denote $K(X) = \cup_{I \in X} I$, then $E(X) := \sum_{i \in K(X)} |i - \text{median}(K(X))|$, where $|.|$ is the absolute value. $X_i$ might contains duplicated elements and all operations are on multisets; in $K(X)$, $\cup$ is a union of multisets, and in $E(X)$, $\sum$ adds "with repetition" (repeated elements are summed multiple times).

I want to prove this problem is hard, but I don't have a very straightforward way to prove it's NP-complete. What I did instead was assuming given extra information, suppose I know the median of both $K(L^*)$ and $K(R^*)$, and then I can show find the optimal partition is an integer linear programming problem, which is NP-complete. Can I conclude the original problem is at least NP-complete?


For the sake of sharing, a wrong attempt I did on proving NP-completeness of with extra information problem.

I was converting this "given $m_L = \textbf{median}(L), m_R = \textbf{median}(R)$ and $S$, find $L$ and $R$" problem into a integer programming feasibility problem. Thanks to D.W helped me, I should do the opposite. Maybe doing the opposite is also possible. So sharing my effort in case it helps.

We can collect two vectors $A, B$ with length $K$, for every $X_i \in S$, get the imbalance with respect to $m_L$ and $m_R$. Namely $A[i] = \sum_{j \in X_i} sign(m_L - j) $, $B[i] = \sum_{j \in X_i} sign(m_R - j) $, where $$\text{sign}(x) = \begin{cases} 1 &\text{if }x>0\\ 0 &\text{if }x=0\\ -1 &\text{if }x<0 \end{cases}$$

Now we can introduce another boolean indicator vector $I = \{0, 1\}^K$. When $I[i] = 0$, we assign $X_i$ to $L$, if $I[i] = 1$, we assigns $X_i$ to $R$.

$$ \langle I, A \rangle = 0 \\ \langle 1-I, B \rangle = 0 $$

Since $m_l$ is the median of $L$, we would definitely have the sum of imbalance equals to 0. The same applies to $m_R$. And here $A, B$ are coefficients and $I$ is the variable. If such a problem has any solutions, then we found $L$, $R$.

And clearly, the newly formed problem is an integer programming problem. Which is well-known NP-complete. Whether if it is possible to convert any such type of problem into our set partition problem in polynomial time, I still need some help

yupbank
  • 215
  • 1
  • 6

1 Answers1

2

No. Your argument is not valid; it doesn't prove NP-completeness. You have shown that an ILP solver can be used to solve your problem. But an ILP solver can be used to solve easy problems, too, so it doesn't rule out the possibility that your problem might be easy (solvable in polynomial time). The reduction needs to go the other way. I suggest studying standard material on NP-completeness and reductions, e.g., What is the definition of P, NP, NP-complete and NP-hard? and What are common techniques for reducing problems to each other? and How do I construct reductions between problems to prove a problem is NP-complete?.

As it happens, I believe your problem can be solved in polynomial time; in other words, I believe it is not hard. I believe the following algorithm solves your problem:

  • Enumerate over all possibilities $m_L,m_R$ for $\text{median}(K(L))$ and $\text{median}(K(R))$. For each such possibility:

    • Find the partition $L,R$ that minimizes $E(L)+E(R)$, subject to the requirement that $\text{median}(L)=\mu_L$ and $\text{median}(R)=\mu_R$.
  • Output the best solution found at any of these stages.

I show below how to do the innermost step in polynomial time. This gives a polynomial-time algorithm for your problem, as there are only $O(n^2)$ possibilities for $m_L,m_R$ (i.e., each $m_L,m_R$ must be an element of $\cup_i X_i$).

We'll use dynamic programming for this. Let $\text{imb}(T,m)$ be defined by

$$\text{imb}(T,m) = (\sum_{t \in T} \text{sgn}(t-m), \sum_{t \in T} (t=m))$$

where the sum is "with repetition", and where

$$\text{sgn}(x) = \begin{cases} 1 &\text{if }x>0\\ 0 &\text{if }x=0\\ -1 &\text{if }x<0 \end{cases}$$

$$(x=y) = \begin{cases} 1 &\text{if }x=y\\ 0 &\text{otherwise} \end{cases}$$

Define the imbalance of a partition $L,R$ with respect to $m_L,m_R$ to be the pair $(\vec{b_L},\vec{b_R})$ where $\vec{b_L}=\text{imb}(L,m_L)$ and $\vec{b_R}=\text{imb}(R,m_R)$. Then we can reformulate the problem as follows:

Given $m_L,m_R$, find $L,R$ that minimizes $E(L)+E(R)$ subject to the requirement that the imbalance of $L,R$ with respect to $m_L,m_R$ is $(0,0)$ (or $+\infty$ if no such partition exists).

Now we can solve that using dynamic programming. Define $A[k,\vec{b_L},\vec{b_R}]$ to be the smallest possible value of $E(L)+E(R)$ subject to the requirements that $L,R$ are a partition of $X_1,\dots,X_k$ with imbalance $(\vec{b_L},\vec{b_R})$ with respect to $m_L,m_R$ (or $+\infty$ if no such partition exists). Then $A$ satisfies the recurrence relation

$$A[k,\vec{b_L},\vec{b_R}] = \min(A[k-1,\vec{b_L}-\text{imb}(X_k,m_L),\vec{b_R}] + \sum_{i \in X_k} |i - m_L|,A[k-1,\vec{b_L},\vec{b_R}-\text{imb}(X_k,m_R)] + \sum_{i \in X_k} |i - m_R|).$$

Using this recurrence, you can fill in the array $A$ in $O(Kn^4)$ time, where $n=|\cup_i X_i|$ is the number of elements (counting repetitions): you just initialize every entry to $+\infty$ except that $A[0,0,0]=0$, then fill it in, in order of increasing $k$. Then, scan all $A[K,(a,b),(c,d)]$ such that (i) $|a|<b$ and $b>0$, or $a=0$, and (ii) $|c|<d$ and $d>0$, or $c=0$; the smallest such value is the answer to the original problem.

D.W.
  • 159,275
  • 20
  • 227
  • 470
  • thanks for the detailed answer! even though there is only $O(n^2)$ possibilities for $m_L, m_R$. but I fail to follow how can you determine if a specific $m_L, m_R$ pair is feasible to have at least one corresponding partition in polynomial time. – yupbank May 10 '21 at 20:26
  • @yupbank, That's done using dynamic programming, as explained in my answer. Without knowing why you're unable to follow it, it's hard to know what to say to help you understand (or what issue might exist with my solution). – D.W. May 10 '21 at 20:28
  • @yupbank, I've edited the answer to be even more concrete about this. – D.W. May 10 '21 at 20:30
  • trying to unroll the initial few steps of $A$, $A[0, b_L, b_R] = +\inf$, $A[1, ?, ?] = min(A[0, -imb(X_1, m_L), 0], A[0, 0, -imb(X_1, m_R)])$ ? why would the $A[K, 0, 0]$ be filled if arbitrary $m_L, m_R$ was chosen in the first place? – yupbank May 10 '21 at 20:39
  • and for the context, I had a very similar idea previously https://math.stackexchange.com/a/4012889/472765 , and then I start to question my self about the correcentess – yupbank May 10 '21 at 20:42
  • @yupbank, sorry, I got the initialization wrong. See edited answer. You might need to work through the details of the dynamic programming algorithm to see if it works and if any adjustments are needed. – D.W. May 10 '21 at 20:50
  • I'm still missing the $b_L, b_R$ index when filling for $A[1, b_L, b_R]$ – yupbank May 10 '21 at 20:54
  • @yupbank, I suspect you'll need to study this more on your own and see if you can work out the details yourself; or else figure out how to formulate a precise, concrete question. This site isn't designed to support extensive back-and-forth conversations, interaction, or tutorial/teaching. – D.W. May 10 '21 at 20:58
  • dynamic programming aside, this is a greedy algorithm. and from a high level, it relies on the order of $X_1, X_2, \cdots X_K$, which if I shuffle it, then the greedy algorithm might end up in a different conclusion. it won't guarantee the optimality of the final solution – yupbank May 10 '21 at 21:17
  • 1
    @yupbank, this is not a greedy algorithm, and I believe the solution it provides does not depend on the order of the $X_1,\dots,X_K$. I believe it does guarantee optimality of the final solution. – D.W. May 10 '21 at 23:27
  • So the size of $A$ is $Knn$, but if the initial value for $A[0, 0, 0] = 0$ and everything else is $+inf$, then no matter how iteration is performed using the recurrence relation, the value can only be among min$(0, +\inf)$ or min$(+\inf, +\inf)$ or min$(0, 0)$. not sure how can you assositae the $E(L)+E(R)$ into the value assignment of $A$. – yupbank May 11 '21 at 00:06
  • @yupbank, oh, good point, my recurrence relation was badly broken. Sorry for all the errors! Thank you for pointing this out. No wonder this solution didn't make any sense to you. I've edited to (hopefully) correct the recurrence relation. Take a look and see if it looks right. I hope there aren't any other errors, but there might well be. I apologize about that. – D.W. May 11 '21 at 03:06
  • No worries, and thanks for your patience with me:) – yupbank May 11 '21 at 03:08
  • Let’s assume $imb(X_1,m_L)=-2$ then to obtain $A[1,0,0]$, using the recurring equation, we need to access $A[0,-2,0]$ ? – yupbank May 11 '21 at 03:54
  • @yupbank, sure, imbalance can be positive or negative. I suggest working through some small examples to see how it could work, and if there are some details that are unclear, trying to fill them in yourself. It might help to do some practice with dynamic programming problems, as my answer is written for someone who is already familiar with that. – D.W. May 11 '21 at 04:20
  • I've updated my problem with the wrong attempt I tried to prove the NP-completeness, and my gut feeling tells me the opposite conversion is possible. would love to hear your comments too – yupbank May 11 '21 at 04:22
  • I believe this produces an optimal solution, though I found your earlier non-DP-based description in the comment much clearer. I'd like to take a shot at showing optimality of that in my own answer, and then if you like you could incorporate any parts you like from that? – j_random_hacker May 11 '21 at 05:04
  • @j_random_hacker, I believe my earlier algorithm was flawed. When I went to write it up, I realized there were problems with my proposed algorithm, hence the more complicated one shown here. One reason is that it's not clear how to deal with "ties" (where you can add the set to either $L$ or $R$ while increasing the objective function the same amount either way) in my prior attempt, if that makes any sense. If you think you can address that or can prove it correct anyway, I look forward to seeing it! – D.W. May 11 '21 at 06:12
  • If my current thinking is correct, that kind of "tie" only means that your earlier algorithm can produce a different solution (than some optimal solution that we may assume exists) -- but not a different-cost solution, i.e., it will still produce some optimal solution. Let's see if my current thinking pans out ;) – j_random_hacker May 11 '21 at 06:19
  • @j_random_hacker, my concern is that one of those two choices might make the median "correct" and the other might not, so the choice seems like it matters. If there are hundreds of such ties, then it seems non-trivial to figure out how to resolve each one to make all the medians work out. I don't know if I'm thinking about that right or not. – D.W. May 11 '21 at 08:58
  • i think i get this algorithm now, basically, it's factorizing the exponential placement $O(2^K)$ space into imbalance$O(n^2)$ space. For any placement resulting in the same imbalance, we only need the minimum objective. the objective function is monotonic, so $K$ can relies on $K-1$ solutions. this is amazing, thanks you! I'll implement and run some simulation and report back. – yupbank May 11 '21 at 15:32
  • I've got some bad news, the result is not optimal, still trying to understand why, but from my simulation. even if I give it the optimal $m_L$ and $m_R$, it is not able to get the optimal objective, e.g. S =[array([0, 1, 8, 9, 0]), array([9, 4, 0, 1, 9]), array([4, 6, 8, 1, 8]), array([8, 6, 4, 3, 0])] , optimal m_l, m_r is 1.0 and 4.0. https://colab.research.google.com/drive/1AZ0ySIwxmktc3wI1tHI9y9k_z7CDIweO?usp=sharing – yupbank May 11 '21 at 19:51
  • the optimal solution is located in $A[K, 0, 4], A[K, 0, 5], A[K, 0, 7]$ instead of $A[K, 0, 0]$. felt like missing some edge case handling.. – yupbank May 11 '21 at 19:57
  • @yupbank, on that example, what value of the objective function does the algorithm give, with what partition? And what is the optimal partition, and what value of the objective function does it give? Have you tried working through that example by hand to see what is happening and to see if each $A[\cdot, \cdot, \cdot]$ cell holds the correct value according to the definition I gave in the answer? – D.W. May 11 '21 at 23:12
  • so it's the $imb(T, m)$ logic that needs to be refined. will share the correction too. now I really appreciate this neat solution! – yupbank May 11 '21 at 23:12
  • to elaborate, when $X_i=[3, 4, 4, 4, 6, 6, 8], m_L=4$, current $imb(X_i, m_L)$ would return 2. which is not ture, it should be 0. as $X_i$ is balanced with $m_L$ as median – yupbank May 11 '21 at 23:21
  • @yupbank: There's a bug in your implementation of E(): median = median or np.median(x) means that if 0 is passed as a median, it will be ignored and the median of x used instead. (Not sure if this affects the issue you're having.) – j_random_hacker May 11 '21 at 23:24
  • no, that bit is fine. it's the $imb$ definition, I'll change the output of $imb$ into explicit (# of elements in T strictly less than m, # of elements in T equal to m, # of elements in T strictly greater than m) to resolve that – yupbank May 11 '21 at 23:27
  • 1
    So to quantify the imbalance state of one side of a partition, we need two dimensions instead of one. (# bigger than m - # smaller than m, # equals to m). which would make the worst-case state-space complexity $O(n^4)$? for example, even if we have a minimum $A[K][0][0]$ with $m_L, m_R$, if $m_L$ is not in the $L$ side, then it's still invalid as a solution. – yupbank May 11 '21 at 23:49
  • @yupbank, ahh, I see the issue! Yup, looks like you are right. Still polynomial time, but $O(n^4)$ instead of $O(n^2)$. Would you like to suggest an edit to show a correct[ed] algorithm? – D.W. May 11 '21 at 23:58
  • yes, of course! – yupbank May 12 '21 at 00:00
  • and here is a working version implementation https://colab.research.google.com/drive/1tH0JUp0Th2uBqT0dvsuCzr3tLA09grFS?usp=sharing – yupbank May 12 '21 at 00:32
  • @yupbank, thank you for the corrections and edits and improvements! – D.W. May 12 '21 at 00:37
  • @yupbank: Note that your new implementation has a bug with solutions where a median is half way between two input points, so e.g. construct_imbalance_space([np.array([0, 1]), np.array([2, 3])], 0.5, 2.5) returns inf instead of 2. D.W.'s edits fix the update condition. – j_random_hacker May 12 '21 at 01:10
  • this is not a bug, in your example 0.5 and 2.5 is only a proposal for $m_L, m_R$, we need to ensure both of them exist in $S$, and I helped a bit on @D.W. ‘s edit, this checking logic is included too – yupbank May 12 '21 at 01:18
  • Don't you wish to recover the optimal median pair, in addition to the cost? – j_random_hacker May 12 '21 at 01:19
  • (If it's really not a bug, then you might as well exclude all medians of even-sized sets from generate_pair_median(), as they have no effect on the algorithm at present.) – j_random_hacker May 12 '21 at 01:24
  • I just realised that once you have an optimal partition, you can recover the medians easily, so there's no need to try medians of even-sized sets even if you do want an optimal median pair. – j_random_hacker May 12 '21 at 01:30
  • hmm... you are right, there might still be edge cases that need to be addressed. and yeah the generate_pair_median function really would produce redundant proposals(i was only doing this to ensure there is indeed a polynomial solution. And if I remember correctly, you still have a solution to share! – yupbank May 12 '21 at 01:31
  • @j_random_hacker yeah, I think the "even median" is being accidentally addressed, and the generate_pair_median doesn't need the second combinations_with_replacement and corresponding median. many thanks! – yupbank May 12 '21 at 01:50
  • Provided you recompute the medians from an optimal partition, I think you're good. I was stuck on the idea that you need a path in the DP matrix corresponding to the optimal solution, but due to a quirk of how medians work, you actually don't here! I never had ideas for other algorithms to solve this problem. I did think I could prove D.W.'s original "guess+greedy" algorithm correct, but having thought about his response I no longer think it's possible to salvage unfortunately. – j_random_hacker May 12 '21 at 01:55
  • I think $O(Kn^2)$ time can be restored by a technique equivalent to adding distinct, tiny values to each input value. Let $u_j$ be the $j$-th number in the multiset union of the $X_i$, in some arbitrary order. Replace it with the pair $(u_j, j)$, which we will treat as an "extended number". When performing arithmetic with an "extended number", it reduces to just its first element, but comparisons change: $(u_i, i) < (u_j, j) \iff u_i < u_j \lor (u_i = u_j \land i < j)$. I'll see if this pans out and if so write it up. – j_random_hacker May 12 '21 at 02:17
  • Drop me an email ([email protected]) and we can move discussion over there? The problem is part of a machine learning project I’m studying. And I’m planning to write a paper around it, and collaborations are very welcome! Or if it’s too much, I’ll cite this answer instead. – yupbank May 12 '21 at 03:33
  • @j_random_hacker, oh, very nice, that's a lovely solution! Looks promising to me! – D.W. May 12 '21 at 04:38