How to scale down parallel complexity results to constantly many cores?

Question

I have had problems accepting the complexity theoretic view of "efficiently solved by parallel algorithm" which is given by the class NC:

NC is the class of problems that can be solved by a parallel algorithm in time $O(\log^cn)$ on $p(n) \in O(n^k)$ processors with $c,k \in \mathbb{N}$.

We can assume a PRAM.

My problem is that this does not seem to say much about "real" machines, that is machines with a finite amount of processors. Now I am told that "it is known" that we can "efficiently" simulate a $O(n^k)$ processor algorithm on $p \in \mathbb{N}$ processors.

What does "efficiently" mean here? Is this folklore or is there a rigorous theorem which quantifies the overhead caused by simulation?

What I am afraid that happens is that I have a problem which has a sequential $O(n^k)$ algorithm and also an "efficient" parallel algorithm which, when simulated on $p$ processors, also takes $O(n^k)$ time (which is all that can be expected on this granularity level of analysis if the sequential algorithm is asymptotically optimal). In this case, there is no speedup whatsover as far as we can see; in fact, the simulated parallel algorithm may be slower than the sequential algorithm. That is I am really looking for statements more precise than $O$-bounds (or a declaration of absence of such results).

Do you mean $T_p < \frac{W}{p} + D$? If so, this is (afaik) only applicable in certain circumstances and also does not immediately allow to translate runtimes. Or if it does, please elaborate in an answer. — Raphael, May 03 '12 at 09:49
NC answer the question "is it possible to trade-off more hardware for less run-time?" You may want to restrict yourself to constant hardware and this is similar to restricting yourself to constant memory, a better modeling of some problems. For a practical use see carry lookhead adders, more hardware so that addition of $N$ bits is done in $O(N)$. — AProgrammer, May 04 '12 at 04:33

score 13 · Accepted Answer · answered May 03 '12 at 11:36

13

If you assume that the number of processors is bounded by a constant, then you are right that a problem being in NC does not mean much in practice. Since any algorithm on a PRAM with k processors and t parallel time can be simulated with a single-processor RAM in O(kt) time, the parallel time and the sequential time can differ only by a constant factor if k is a constant.

However, if you assume that you can prepare a computer with more processors as the input size grows, then a problem being in NC means that as long as you can prepare more processors, running time will be “very short” or, more precisely, polylogarithmic in the input size. If you think that this assumption is unrealistic, compare it to the assumption of unbounded memory: actual computers have only finite amount of space, but in the study of algorithms and complexity, we almost always assume that a computational device does not have a constant upper bound on space. In practice, this means that we can prepare a computer with more memory as the input size grows, which is how we usually use computers in the real world. NC models an analogous situation in parallel computation.

answered May 03 '12 at 11:36

Tsuyoshi Ito

2,412
20
15

1
Yes, parallelising on constantly many cores can only yield constant speedup. That is inherent and sadly hidden in $O$-terms. The (imho) interesting question is: can I get (optimal) speedup $k$, or only $k/2$, or $k-1$?
While the assumption of infinite memory can be justified by the availability of lots of RAM (and, technically, you can add the hard disk), this is not generally true for processors. Typical (personal) machines have 16 or less cores nowadays. In other words, you can use "normal" results up to relevant problem sizes, many parallel results only up to $n \leq 20$.

Raphael

May 03 '12 at 12:22

4

@Raphael: The question of whether a certain problem belongs to NC or not does not model your question. I am not saying that your question is uninteresting; I am just saying that NC is not the right complexity class to model that. – Tsuyoshi Ito May 03 '12 at 13:40

I am actually happy to hear that; a person claims otherwise, though. Not necessarily with NC but with complexity theoretic results in general. How is it with other classes? – Raphael May 03 '12 at 14:43

A correction: A problem being in NC means that the running time is polylogarithmic if the number of processors is a sufficiently large polynomial in the input size. In the arguably more realistic scenario where the number of processors is a fixed polynomial like $O(\sqrt{n})$, or a slower non-constant function like $O(\log n)$, membership in NC doesn't formally imply anything at all. – JeffE May 04 '12 at 12:01

@JeffE: That is not a correction. I only wrote “prepare more processors” without giving its rigorous meaning (because I thought that doing so would obscure the point). – Tsuyoshi Ito May 15 '12 at 19:50

Massimo Cafaro · Answer 2 · 2012-05-03T22:04:13.333

I agree with you that $NC$ is not the best way to characterize efficient parallel algorithms.

Indeed, by definition NC also includes lots of problems which are not efficiently parallelizable. A common example is parallel binary search. The problem arises because parallel binary search has polylogarithmic time complexity even for $p = 1$. Any sequential algorithm requiring at most logarithmic time in the worst case is in $NC$ regardless of its parallel feasibility.

But wait, there is more.

$NC$ algorithms assume parallel machines with a polynomial number of processors to solve in polylogarithmic time moderately sized problems. However, in practice we use moderately sized machines (in terms of processors) to solve large problems. The number of processor tends to be sub polynomial, even sublinear.

Finally, there are problems in $P$ with sublinear parallel time $O(n^\epsilon), 0 < \epsilon < 1$.Therefore, these problems do not belong to $NC$. Now, sublinear functions may have a relevant asymptotic behavior only for impractically large values of $n$, and may be instead much less progressive for practical values of $n$. As an example, $\sqrt n < \lg^3 n$ for $n \leq 0.5 \times 10^9$. It follows that sublinear parallel time algorithms may run faster than $NC$ algorithms.

In one of the answer, it has been observed that "In practice, this means that we can prepare a computer with more memory as the input size grows, which is how we usually use computers in the real world. NC models an analogous situation in parallel computation".

I partially agree with this point of view. We buy a new parallel computer with more memory when an older supercomputer is decommissioned also because DRAM chips are less expensive upon time and to somewhat balance the parallel computer with regard to its main components (processors, memory, interconnect etc).

However, since memory is a finite resource, there has been a lot of research about using it efficiently, without requiring adding more memory to a supercomputer to solve larger instances of a problem. For instance, Sun and Ni proposed the notion of memory-bounded speedup, and Quinn proposed the so-called scalability function that measures how the amount of memory per processor must grow in order to maintain a constant level of efficiency. In general, since parallel overhead increases when the number of processors increases, we maintain efficiency increasing the size of the problem being solved. But the maximum problem size is limited by the amount of main memory (which is linear in $p$). The scalability function uses the isoefficiency function and another function which denotes the amount of memory required to store a problem of size $n$ to determine how the amount of memory per processor must grow in order to maintain a constant level of efficiency. When this function is a constant, the parallel algorithm is perfectly scalable (from the use of memory perspective). While memory is available, it is possible to maintain the same level of efficiency by increasing the problem size. However, since the memory used per processor increases linearly with $p$, at some point this value will reach the memory capacity of the system. Efficiency cannot be maintained when the number of processors increases beyond this point.

Therefore, it is increasingly important to design memory scalable parallel algorithms, since these are practical for large problems.

A final note on scaling down of parallel algorithms. This only make sense if the parallel algorithm we want to scale down is cost-optimal. Scaling down a cost optimal algorithm still produces a fast (if slower than the original) algorithm. But scaling down a non cost-optimal algorithm may lead to a parallel algorithm which is slower than the best sequential run time (consider scaling down an $n^3$ processor constant time sorting algorithm to $n$ processors). The degree to which cost-optimality is missed impacts upon the range of problem and machine sizes over which an algorithm is useful and is well described in the textbook by Grama, Gupta, Karypis and Kumar.

How to scale down parallel complexity results to constantly many cores?

2 Answers2

Linked