3

Suppose you have a dependency graph of "packages" registered in the ecosystem of a given programming language. We can model each package as a tuple (name, version) where name is the package name (like Plots) and version is a semantic version number (like 1.2.3).

Each package has a set of dependencies, which may include constraints. Perhaps Plot-1.2.3 depends on JSON with a constraint >= 1.0 && < 3.0. Thus, there could be a number of JSON-X.Y.Z versions that are compatible with Plot-1.2.3.

The goal of this problem is to construct a "minimal" set of nodes to form a useful package snapshot.

Let's define a "useful" package snapshot as following:

  1. It contains the latest version of every package.
  2. Every package in the snapshot is "resolvable" (EDIT: as long as it's resolvable in the original ecosystem), i.e. the snapshot must include a compatible version of every necessary dependency of that package.
  3. Every two packages in the snapshot are "pairwise resolvable," meaning that if there is a working set of dependencies to install them simultaneously in the original ecosystem, then there is also such a working set in the snapshot. (EDIT: note that when you're installing a set of packages, you only get to choose one version of a particular dependency package. Thus, if packages P and Q both depend on dependency D, pairwise resolvability means that there must be one single version of D compatible with both P and Q in the snapshot, as long as such is true in the full ecosystem.)

By this definition the full ecosystem is a useful snapshot. But the key part of this goal is "minimal." I want to find the smallest useful package snapshot that satisfies the rules above. I feel like there must be some prior art or theory that can help here. Graph coloring? Satisfiability? Please help :)


In case it's of interest, this is a real problem -- I want to package such a snapshot in the Nix package manager, so that a user can express something like "Please give me an environment with packages A, B, and C" and get a working environment with the latest version of those packages, as long as such is possible. There are about 80,000 nodes in the real graph. The solution doesn't need to be perfectly optimal but the smaller I can get it, the better.

tom
  • 133
  • 5
  • "Pairwise resolvable" seems poorly motivated and unimportant in practice. I wonder if you are solving an artificial problem (that is also artificially harder than it needs to be). – D.W. Nov 28 '22 at 07:29
  • On the contrary, I think pairwise resolvable is very important and is exactly what I want. When I was thinking about ad-hoc ways to do this, it was easy to come up with package bounds and strategies (like "just take the latest version of every dependency") that violate this. I don't want users to find that suddenly two packages can't be installed alongside each other when they can in the original ecosystem. In fact, if it were feasible to do 3-way resolvability or arbitrary N-way for some N, I'd be interested in that too. – tom Nov 28 '22 at 20:27
  • I just edited to clarify rule 2 that a package must be resolvable only if it's resolvable in the original ecosystem. With that clarification, the full set of nodes is such a snapshot. So such a snapshot does always exist. – tom Nov 28 '22 at 20:27
  • I understand, but I continue to suspect that your "pairwise resolvable" requirement is an instance of an XY problem. If the goal is to make sure that users can install any subset of packages they want, then there are likely to better ways to achieve that goal. I doubt very much that 2 is special and there is an absolute requirement to satisfy 2-way resolvability but no hard requirement to satisfy 3-way resolvability. – D.W. Nov 28 '22 at 21:52
  • Oh you're right about 2 not being special. In the intended use-case, users will only install a handful of packages, so my intuition is that 2 is enough for the system to work well. But I'd welcome better ways to achieve good resolvability. If you'd like to solve it for ∞-way resolvability then please do :) – tom Nov 28 '22 at 21:55
  • I'm pointing out that your list of requirements is most likely too strict. If it isn't absolutely necessary for 3-way resolvability to be a hard requirement, it most likely isn't absolutely necessary for 2-way resolvability to be a hard requirement, either. Most likely there are other solutions that would be OK. By including an unnecessary requirement, I suspect you may be ruling out solutions that would actually be acceptable for users. It may lead you to unnecessarily reject solutions that would in practice be good enough, because it's too expensive for them to ensure 2-way resolvability. – D.W. Nov 28 '22 at 22:38
  • If you have another way to formulate it, I'd love to hear it. It seems like a reasonable formulation to me because it captures exactly what will matter to users. I'm totally open to alternatives as long as they capture the spirit of the requirement, which is that users will not discover subsets of packages unexpectedly can't be installed together. – tom Nov 28 '22 at 23:54
  • To be extra clear: the ultimate goal would be ∞-way resolvability. But I posed it using (strictly easier) 2-way resolvability because a) I think it would be acceptable and b) I was hoping to solicit answers involving SAT solvers and I was afraid it would be too hard with ∞-way resolvability. – tom Nov 29 '22 at 00:03
  • I know, but I suspect you are taking a flawed approach, because you might be inappropriately rejecting solutions (that don't ensure 2-way resolvability) that may be acceptable in practice. I understand your goal is to avoid a situation where a user tries to install two packages and is told they can't, but I suspect there are other ways to avoid that. I suspect this is a XY problem: there are factors which are not stated in the question which lead you to assume that the best way to achieve your ultimate goal is to ensure 2-resolvability, but that there might be other ways to achieve that goal. – D.W. Nov 29 '22 at 02:38
  • Okay, your suspicion is noted :). I've tried my best to pose the right question. I even formulated a way to think about the different levels of difficulty ("N-way resolvability") for this class of question. There's nothing unstated that I can think of. If I think of a better question later, I'll ask that! But for now this is the best formulation I've got. I'd welcome suggestions. – tom Nov 29 '22 at 02:54
  • My ideas: A) If you can install package P and you can install package Q (separately), I don't understand what would ever prevent you from installing both P and Q. B) What will you do if a user tries to install 3 packages that aren't 3-way resolvable? What prevents you from using the same response for if a user installs 2 packages that aren't 2-way resolvable? C) if the user tries to install a package that isn't resolvable with all the others, solve a new SAT/ILP instance at that time to pick some packages based on the updated knowledge about that pair of packages. – D.W. Nov 29 '22 at 03:17
  • I'm going to hit the button to move this discussion to chat as it suggests--never done this before, hope it works... – tom Nov 29 '22 at 03:35
  • I think it would help if you stated in the question that you can only select at most one version of each package (you're not allowed to choose two or more versions of hte same package). – D.W. Nov 29 '22 at 16:25
  • You're allowed to add two or more versions to the snapshot! It's just the case that when the user selects a set of desired packages (and their dependencies) for their particular environment, the environment can only contain one version of each package. (This is how the build systems of most languages work.) I'll try to edit the question to make this clear. – tom Nov 30 '22 at 05:44

1 Answers1

2

One standard approach for package managers these days seems to be to use a SAT solver (or ILP solver). That seems like a plausible solution. All of your requirements can be expressed in SAT (or ILP).

If you use SAT, you will need to constrain the number of packages selected, using a cardinality constraint: see Reduce the following problem to SAT (also Encoding 1-out-of-n constraint for SAT solvers has a few methods that can be generalized to this). Then, do binary search to find the minimal number of packages that can be selected. If you use ILP, see Express boolean logic operations in zero-one integer linear programming (ILP) for tips on expressing your constraints within ILP.

One approach to enforce 2-resolvability is to add a constraint for every pair of packages that they be simultaneously resolvable. However, a more efficient way might be to use lazy enforcement: initially, ignore those constraints, and solve the SAT/ILP instance without such a restriction. Then, check the resulting solution for any violations of 2-resolvability. For each pair of packages $P,Q$ that fail to be 2-resolvable with that solution, add a new constraint that $P,Q$ must be simultaneously resolvable, then solve the resulting modified SAT/ILP instance. Repeat until all such pairs are resolvable.

Another approach is to ask that there is a high probability that the set $\mathcal{S}$ of packages is simultaneously resolvable, where $\mathcal{S}$ is chosen according to some probability distribution that approximately mimics the packages a user is likely to ask to install. To achieve that, one approach would be to randomly sample $n$ such sets, $S_1,\dots,S_n$, from this distribution, then add constraints requiring that all of $S_1,\dots,S_n$ be simultaneously resolvable. Another variant of that is to add a constraint that requires that at least $0.9n$ of $S_1,\dots,S_n$ be simultaneously resolvable (using another cardinality constraint, if you are using SAT).

D.W.
  • 159,275
  • 20
  • 227
  • 470
  • Can SAT solvers actually try to minimize the number of variables they set to true? That seems like a different problem to me. Remember, including every node in the original ecosystem is a valid solution--the goal here is to minimize. – tom Nov 28 '22 at 20:33
  • ILP seems more likely. But could suggest how you'd encode it? It seems like you would get a painful number of rules, in particular from the pairwise resolvability constraint. Although I think it might be doable if you solve it for one dependency at a time... – tom Nov 28 '22 at 20:37
  • @tom, see edited answer. – D.W. Nov 29 '22 at 16:24
  • Got it--I like the lazy enforcement idea. Seems like a good amount of stuff to try here so I'll have to start messing with solvers and see how it works. Thanks! – tom Nov 30 '22 at 06:06