What are GPUs bad at?

Question

I understand that GPUs are generally used to do LOTS of calculations in parallel. I understand why we would want to parallelize processes in order to speed things up. However, GPUs aren't always better than CPUs, as far as I know.

What kinds of tasks are GPUs bad at? When would we prefer CPU over GPU for processing?

Sounds like dupe of https://superuser.com/questions/308771/why-are-we-still-using-cpus-instead-of-gpus — levininja, Feb 25 '20 at 19:54

Mark · Answer 1 · 2020-02-26T00:26:42.490

143

GPUs are bad at doing one thing at a time. A modern high-end GPU may have several thousand cores, but these are organized into SIMD blocks of 16 or 32. If you want to compute 2+2, you might have 32 cores each compute an addition operation, and then discard 31 of the results.

GPUs are bad at doing individual things fast. GPUs only recently topped the one-gigahertz mark, something that CPUs did more than twenty years ago. If your task involves doing many things to one piece of data, rather than one thing to many pieces of data, a CPU is far better.

GPUs are bad at dealing with data non-locality. The hardware is optimized for working on contiguous blocks of data. If your task involves picking up individual pieces of data scattered around your data set, the GPU's incredible memory bandwidth is mostly wasted.

edited Feb 26 '20 at 00:26

answered Feb 24 '20 at 22:46

Mark

900
2
5
11

Concerning: If your task involves doing many things to one piece of data.. CPU is far better, did you mean doing many things after eachother (sequentially) to one piece of data? Because from what I understand from your answer, GPU's are, in general, better at doing many things in parallel, which could, from what I imagine, be interpreted as doing many things to a copied single thing (and merging after computation). In extend to this ambiguity, perhaps you could include whether GPUs can do many different computations in parallel almost as well as a single computation in parallel? – a.t. Feb 25 '20 at 13:33
8

@a.t.: If you copy a single thing, it becomes multiple things. Then you can perform operations on those multiple things, but if it's not pointless (e.g. discarding 31 of 32 results), you have to collect the results, which takes time. – jamesqf Feb 25 '20 at 17:09
2
AMD GPUs has scalar engine although it's limited to operations useful for address arithmetics. OTOH, modern CPUs puts most of their ALUs into SIMD engines, having although more limited functionality that GPU SIMDs. Moreover, Ice Lake has 512-bit CPU SIMD engines while 256-bit only GPU SIMDs.

Bulat

Feb 25 '20 at 17:33

2

Modern GPU is 2 GHz executing 1 operation per cycle while modern CPU is 5 GHz performing 4 scalar or 2 SIMD operations/cycle.

– Bulat Feb 25 '20 at 17:39

score 64 · Answer 2 · answered Feb 25 '20 at 00:35

64

Branching

One piece of hardware that pretty much no GPU has is a Branch Predictor. That's because their primary function is to compute simple functions over large sets of data. The only "branching" that a typical GPU does is the jump at the end of a loop body. CPUs, on the other hand, typically run code that executes branches quite frequently (about 20% of the time), which is why so much die is dedicated to branch prediction on modern CPUs. A branch is one of the most expensive operations because it usually causes a pipeline flush. For deep pipelines (which is typically how you get high clock rates), this can be a massive delay.

As others have noted, random access is also detrimental to GPU workloads. Combining these two are one of the most ubiquitous problems in software engineering: sorting. Or basically, the majority of what a database server does.

answered Feb 25 '20 at 00:35

Lawnmower Man

1,177
5
12

27

Branch prediction itself is relatively simple; the ridiculous things modern CPUs do with executing multiple branches at the same time and apparently time-travelling so that only the correct one "happens", with ridiculous pipelining, out-of-order execution etc. is another matter entirely. – Luaan Feb 25 '20 at 07:59
4

GPU conditional statements typically take the form of computing a bitmask over the lanes (so for a 32-lane unit, you'd have a 32-bit condition test result) and using that to disable register writeback or mask out math results. This means that most branches execute both sides. – Riking Feb 26 '20 at 06:46
3

GPUs can sort just fine with branchless algorithms that use e.g. min(x,y) and max(x,y) as comparators in a sorting "network". Google hits for GPU sorting include https://www.oreilly.com/library/view/advances-in-gpu/9780128037881/B9780128037386000124.xhtml (useful abstract), and a 2017 survey paper linked from https://news.ycombinator.com/item?id=15216142. You can't implement something like InsertionSort or QuickSort, though, AFAIK. A better example would be something like high-quality video encoding where you need to prune the coding search space based on similarity tests. – Peter Cordes Feb 26 '20 at 16:06

score 30 · Accepted Answer · answered Feb 27 '20 at 01:34

GPUs are really good at doing the same simple calculation many times over in parallel. They're usually good at spawning millions of short-lived "threads" that perform the same instruction on multiple bits of data (Same Instruction, Multiple Data, or SIMD). They excel at SIMD situations. They have less memory than the CPU has access to and are not meant as omni-purpose computing units like the CPU is.

Being fast in serial applications

Standard desktop CPU speeds are on the order of several GHz. High-end GPUs are barely over 1 GHz. If you have one task that needs to be computed in serial, you won't get a raw speed benefit from a GPU. A GPU only shines when it computes things in parallel.

Branching Code

If you have a lot of places in your GPU code where different threads will do different things (e.g. "even threads do A while odd threads do B"), GPUs will be inefficient. This is because the GPU can only issue one command to a group of threads (SIMD). If the even threads are told to do one thing while the odd threads are told to do another, the GPU will wait to give the even threads their code to run until the odds have completed their instruction (or vice-versa).

Situations requiring lots of RAM

Good GPUs can have many gigabytes of RAM, but they still lag behind CPUs in terms of how much memory they can access. Compare Nvidia's (currently) high-end Titan V GPU with 12 GB of RAM to a high-end desktop with 64, 128 or even 512 GB of RAM. GPUs cannot dip into the hard drive for more RAM if needed, but CPUs can.

When the code requires that each thread on a GPU have access to large amounts of RAM, this can add up.

Situations requiring networking

To my knowledge, there are no GPUs that can spawn lots of network connections. Because networking is a serial task, it is best done on a CPU.

Situations requiring lots of inter-thread communication

In massive simulations, threads need to frequently and regularly communicate with specific other threads. This involves a lot of branching, which as mentioned above, slows GPUs down tremendously.

Talking to hardware or running an operating system

Ironically, while GPUs excel at rendering the pixels on computer screens, they're not designed to manage a window, get data from the mouse or keyboard, or other things. CPUs are meant to do anything and everything a computer could possibly want to do. A GPU is designed to do one thing really, really well.

GPUs do not currently have the flexibility or power to run an operating system. They do not talk directly to the operating system via system calls in the same way that a CPU can.

File I/O

GPUs are designed to aid the CPU in doing lots of computations, but if you want to load or write data to/from a file, let the CPU do this. Because GPUs spawn millions of "threads", they are not well-suited for situations requiring serial I/O.

User I/O

If you want the user to enter a number, ask the CPU to ask the user instead of the GPU. Some GPUs may be able to handle some user I/O, but this is generally a serial interaction, and therefore is not where the GPU shines.

Summary

CPUs are designed to do a lot of different things and have a ton of different capabilities. GPUs are designed with a much more narrow goal. Because of this, they excel in certain tasks and perform poorly (if at all) for others.

"Because networking is a serial task, it is best done on a CPU." What about networking with hundreds of connections simultaneously? I'm picturing a switch with hundreds of Ethernet cables coming out of it - would such a device benefit from using a GPU to handle its connections rather than a CPU? — nick012000, Mar 03 '20 at 06:19
@nick012000 I think yes! Feel free to write a research paper :) But actual switches have ASICs. — user253751, Sep 08 '20 at 18:50

score 28 · Answer 4 · answered Feb 24 '20 at 16:03

First, not every problem is easily amenable to a parallel solution. If it's not possible to formulate your problem as such, you might not gain anything from using a GPU or any parallel approach to begin with.

Second, it takes time to move data from the host to the device (i.e., the GPU). You can waste time doing many such transfers and the potentially fast execution on the device won't result in speedups over CPU computation. You want enough data so that the fast execution on the GPU will outweigh the time spent on transfers.

Finally, the modern x86 CPU has specialized instructions for certain (low level) problems the GPU architecture doesn't. One such example is carry-less multiplication for which the CPU tends to be a lot faster (unless you do a massive amount of such multiplications in parallel with a good GPU implementation, but this requires some work).

Artelius · Answer 5 · 2020-02-24T22:52:21.567

Leaving aside specifics of CPUs vs GPUs, let's simplify the discussion to a single powerful processor (or a handful of them, i.e. multicore) vs an array of 1000s of "slower" processors.

Some workloads are classified embarrassingly parallel since they consist of many tasks that are pretty much independent. These are ideal for GPUs.

Some workloads have irregular parallelism, which is when parallel tasks of irregular lengths branch and merge based on the results of earlier tasks (thus the branching and merging cannot be predicted and scheduled in advance). These are much harder to parallelise and there are problems like many cores being idle while waiting for tasks, or a lot of overhead scheduling small tasks. Good solutions have been found for the most important/common algorithms (like Delaunay mesh refinement), but it's hard.

Some workloads are fully sequential which means that they don't benefit from parallelism at all (except instruction-level parallelism which CPUs have super-mega-crazy optimisations for and GPUs do not). For instance, certain parsing tasks are very challenging to parallelise since every step is highly dependent on the context created by previous steps.

In "very sequential" workloads that can for instance only run on one core, the choice is between a beefy CPU core and a modest GPU core. The CPU core might be an order of magnitude faster, or even more! Those beasts pack mind-boggling optimisations.

The Tao of Parallelism is a good read.

A good example here is high-quality (per bitrate) video encoding, where the top results are generally still from x264 / x265 software running on CPUs. In theory some GPU acceleration of some parts of the process could be possible, but full fixed-function and/or GPGPU HW encoding still AFAIK doesn't match the quality of x265 -preset slower or veryslow. The coding search space is so huge that it needs to be pruned dynamically based on early computations. A 5-year-old answer: Why processor is "better" for encoding than GPU? GPU encoding is not as bad now — Peter Cordes, Feb 25 '20 at 07:44

score 11 · Answer 6 · answered Feb 26 '20 at 13:03

Perhaps a prime example would be cryptographic functions such as KDFs, which are specifically designed to benefit as little as possible from parallelization that GPUs and special cracking hardware offers.

There's a whole class of so-called sequential memory-hard functions which make it difficult to benefit from parallel computing (a) in a single computation scenario due to data dependencies and (b) in multiple computation scenario due to memory requirements. A typical algorithm looks like this

Generate a large pseudo-random array $H$ with password as a seed. This occupies a large portion of RAM and covers the case (b)
Starting with a given index $I_k$, compute $I_{k+1}$ based on $H[I_k]$ and the password. This introduces a data dependency between every two steps to cover the case (a).
Repeat step 2 a large number of times: $k=1..N$
Use the $H[I_N]$ to construct an encryption key or a password hash

The idea is that an attacker trying to guess a password from a known key / hash will have to spend a significant time checking one password after the other no matter how many GPU cores they have at their disposal, while a legitimate user will compute a key/hash relatively quickly using a single core because they have to do it only once.

Probably a good idea to add a salt in step 1, to avoid precalculated tables. O(1) impact for the legitimate user. — MSalters, Feb 27 '20 at 13:17

score 10 · Answer 7 · answered Feb 24 '20 at 18:44

To augment @Juho’s answer a little: for workloads which are easy to multi-thread, there may also be significant instruction level parallelism that a multi-core superscalar CPU can take advantage of; current GPUs typically run at a lower clock speed than their CPU counterparts; GPU threads are not truly independent, implemented as 16 or 32-way SIMD, and divergent code will slow down GPU performance significantly.

score 10 · Answer 8 · edited Jun 16 '20 at 10:30

10

GPUs are bad at linear/quadratic programming:

Gurobi is watching GPUs closely, but up to this point all of the evidence indicates that they aren't well suited to the needs of an LP/MIP/QP solver. Specifically:

GPUs don't work well for sparse linear algebra, which dominates much of linear programming. GPUs rely on keeping hundreds or even thousands of independent processors busy at a time. The extremely sparse matrices that are typical in linear programming don't admit nearly that level of parallelism.

GPUs are built around SIMD computations, where all processors perform the same instruction in each cycle (but on different data). Parallel MIP explores different sections of the search tree on different processors. The computations required at different nodes in the search tree are quite different, so SIMD computation is not well suited to the needs of parallel MIP.

edited Jun 16 '20 at 10:30

Community

1

answered Feb 25 '20 at 10:37

GB supports the mod strike

209
1
3

If the sparse matrices are large, then they do admit a lot of parallelism. GPUs aren't as good with sparse linear algebra as they are with the dense variety, but they do quite nicely AFAIK. – einpoklum Feb 25 '20 at 22:33
@einpoklum does that encompass LP/MIP/QP optimisation? This isn't my area of expertise, but it is Gurobi's. They and CPLEX are probably the two big names in that area, and they compete heavily on performance benchmarks, but AFAICT neither company sees a good reason to support GPUs so far - what am I missing here? – GB supports the mod strike Feb 26 '20 at 00:42
1

I don't know about LP/MIP/QP specifically. You can have a look at NVIDIA's cuSparse library. – einpoklum Feb 26 '20 at 00:46
2

@einpoklum: The cuSparse library claims look pretty modest; their whitepaper appears to describe getting 2x the speed at 6x the hardware costs and 3x the electricity. – Nat Feb 27 '20 at 10:41
@Nat ...and given that MIP algorithms like branch-and-bound are pretty well suited to multi-CPU methods, it's not obvious that this is a better option than just doubling the CPUs. – GB supports the mod strike Feb 28 '20 at 00:36
1

@GeoffreyBrent: Yeah, would have to agree. Additionally a few other things seemed a bit off; for example, the 2x-average-speed-up claim appears to have been an average of many trials, largely pulled up by an anomalous result (Figure 1 from the white paper). Worse, the average is a straight average (rather than a log-average); so, for example, if they got double-speed in half of the trials and half-speed in the others, then they'd report a 25% speed-up, which would seem misleading. – Nat Feb 28 '20 at 05:42
@Nat huh. That is quite an outlier on that plot. Spiders Georg strikes again! – GB supports the mod strike Feb 28 '20 at 07:34

score 9 · Answer 9 · answered Feb 25 '20 at 13:53

9

Integer arithmetic

GPUs are optimised for doing 3D rendering calculations. Following the history of OpenGL, these are traditionally done using 32-bit floating point numbers arranged as either vectors of four floats or quaternion matrices of 4x4 floats. So that's the capability GPUs are very good at.

If you want to do floating point with more bits, or 64-bit integer arithmetic, you may find it unsupported or slow.

answered Feb 25 '20 at 13:53

pjc50

411
2
4

Can you explain more how exactly GPUs are optimized for 3D rendering? Shouldn't it be as good at n-th dimensional calculations? – ChocolateOverflow Mar 02 '20 at 07:28

reirab · Answer 10 · 2020-02-25T23:47:40.687

Expanding a bit more on Juho's answer, it's true that GPUs are generally bad at branching, but it's also important to point out why. It's not just a matter of less die space dedicated to branch prediction (though that is part of it.) It's also a matter of less die space dedicated to instruction decoding and sequencing in general.

What we call a "core" on a GPU is not at all what we normally think of as a "core" on a CPU. A GPU "core" is really mostly just a glorified floating-point ALU (Arithmetic Logic Unit, that is, the part of the processor that does arithmetic like addition, multiplication, etc.)

A CPU will almost always have (at least) one instruction decoder per "core" and each core can follow code paths completely independent of what other cores are doing with no penalty (aside from maybe more cache misses if they have shared cache.)

A GPU, on the other hand, typically only has one or two instruction decoders per at least a few dozen "cores." As such, a given group of cores can only be executing one or two different code paths at any given time. If different cores within the group need to follow different paths through the code, then the whole group has to take all of those paths and each "core" only commits the results from the instructions on the code path it was supposed to take.

So, GPUs are very good at "Single-Instruction, Multiple-Data" math, where you're doing the exact same set of computations on a large number of different data points. SIMD is well-suited to the task of matrix multiplication, which happens to be the primary job GPUs were designed to do, since that's the majority of the work in 3D rendering algorithms. Conveniently, matrix multiplication and other SIMD-friendly tasks also happen to be very common in science and engineering, so that's why GPUs are commonly used as computational accelerators in supercomputers.

If your algorithm is not SIMD (that is, if you're needing to follow significantly different code paths for each data point,) then it probably won't be very efficient on a GPU.

And, of course, as with any multi-core system, single-threaded sequential code will be a total waste of all but one core.

You said GPUs are great at "Single-Instruction, Multiple-Date" tasks. What about "Multiple-Instructions, Single-Data" tasks? — ChocolateOverflow, Mar 02 '20 at 07:21
@JohnZhau Since those would fall into the "not SIMD" category, GPUs generally won't be good at them. Since there's only one data point to process, only one core within each group would be able to do useful work at any given time. — reirab, Mar 02 '20 at 15:57

jw_ · Answer 11 · 2020-02-27T02:48:19.203

GPU works well or not mainly depends on computing instruction/IO instruction ratio. Here "IO instruction" includes any instruction that send/receive data through the boundary of the basic computation unit in GPU. "Basic computation unit" commonly have like 8-32 ALUs that need to execute instruction together and 16-128KB of registers/RAM and some instruction cache/RAM.

If the inner core of your program mainly fits in the instruction cache in the basic computation unit and the temporary data your proram need to read/write can mainly fit in to the registers/RAM in the basic computation unit, and the data to be feed in/processed and result to be exported is small enough , then you can utilize most of the computation power of GPU.

If not, then the grid/loop network connects the basic computation units and the GDDR memory interface and the PCIe CPU interface will soon render the GPU a CPU or even worse.

For example, each basic computin unit have 16 ALU, each need 2 register/RAM reader and 1 write with 32bit data, then each cycle need 3*16*2GHz*4bytes=384GB/s. The GDDR bandwith is only 300GB/s - not even enough to feed a single basic computin unit, and the GPU may have 100+ such basic computin units. All the magic from GPU is based on this, you need the 400GB/s*100+ bandwidth to make GPU work like magic. Once you need more "IO" bandwith, there is just not enough bandwith and the GPU magic vanish.

As @Bulat said, indeed this is the magic of any **U based on, you need to fit your temporary data mainly in L0 cache and don't let the IO bandwidth be the bottleneck to get nearer to the peak performance. L0 cache means multi-ported register or RAM that support through output of 1 instruction/circle which often need 2 read and 1 write.

Common programs for CPU which is commonly logic code instead of computing kernels mainly doesn't work at L0 cache speed, but at L2 or L3 or bus/DDR speed, this is the common working mode for CPU program and you don't think it is a bad thing for your program not to work at L0 speed on CPU.

But for GPU computing kernel working at L0 speed is the target. If the program doesn't work mainly at L0 speed, the program is less fit for GPU. But even so some program that doesn't fit can still work better on GPU than CPU, the final factor is how IO bandwidth is limiting the program.

CPU's L0 cache is the 8-16 32-64bit registers which is only 128 bytes. Although modern CPU have renaming register like 100+*64 bit, it is still jusg ~1KB and it is only utilized on specific instruction sequences. This is why for most time CPU doesn't work at the L0 speed, the L0 cache is very small only very special computing kernel can keep working at L0 speed. For CPU most code is control logic, you can never let control logic code mostly work at L0 even with more L0 cache so that is just a waste. And more L0 cache mean more registers then longer instructions to encode more registers then equally less instruction cache which is important for logic code. Larger L0 cache also means slower L0 speed - perhaps from 5GHz to 2GHz.

In contrast GPU provide 32-128KB L0 cache in each basic computing unit - hoping the code can run at L0 speed as much as possible, this is possible for small computing kernels.

Another factor of the GPU magic is GPU use more die size for basic computing unit. For example, CPU have 16MB L3 cache, GPU use this for 64KB L0 cache for 256 basic computation units (should be less since L0 cache cosume more area due to more port and control logic overhead). And GPU have lesser control logic to boost single thread performance.

Conclusion: CPU- fit for control code that work with 10MB code/data at L2/L3 speed. GPU - fit for computing kernel that work with 100KB data at L0 speed.

Note: the 100KB GPU L0 cache is devided into several ALUs. For example, 16ALU and 128KB L0 cache, then 8KB for each ALU, that is what your program can use to fit in L0. This introduces another pattern GPU need - your program have to be executing the same task on N different set of data then it can utilize N ALUs of the GPU at the same time. N at least should be larger than the number of ALUs in one basic computing unit.

It will be the same for CPU, actually 2 GHz GPU with 448 GB/s memory looks much more balanced for this task than 5 GHz CPU with 40 GB/s memory :) — Bulat, Feb 25 '20 at 18:54

Bulat · Answer 12 · 2021-05-08T21:16:31.363

1

GPU cores can perform the same operations as CPU ones (f.e. it implements all C operations over SIMD registers). The difference is that each core is ~10x slower, and there is almost no speculative execution. This makes GPU more power-efficient.

To compensate for the lack of speculation, GPU can run 10–20 threads per core, so overall we can have ~100x more threads per GPU than per CPU. At the same time, GPU caches are almost 10x smaller than CPU ones. That results in 1000x smaller cache per thread. It’s OK for graphics code (and it’s why they use this config), but only a few CPU algorithms are keeping their efficiency when caches are essentially non-existant.

edited May 08 '21 at 21:16

answered Feb 27 '20 at 18:17

Bulat

1,873
1
10
17

1

Perhaps it would be good to clarify "same operations". For example, the x86 CPU has special instructions for all sorts of things the GPU simply doesn't. But sure, they can both compute the same things in principle. – Juho Feb 28 '20 at 10:04
@juho yeah, GPUs lacks special instructions like AES, SHA1 and CRC. Otherwise, it has much better set of SIMD instructions than any CPU, but lacks any scalar instructions, making it more suitable for vectorizable code. – Bulat May 08 '21 at 21:15

score 0 · Answer 13 · answered Feb 27 '20 at 13:09

GPU are 1) highly parallel architectures, with additional limitation that 2) are organized in blocks which must perform same operation on different data at the same time.

Therefore, the performance of GPUs is due to extensive and quite constrained paralelism, but their clock rates are not particularly high. So, whenever you have an algorithm where the next step depends on the current step then very little actual(!) parallelism is possible.

Examples of such algorithms: iterative algorithms, many optimization methods.

Side note: you may do speculative calculation in iterative algorithms as well if you have discrete solution space (e.g. if previous step has 16 possible outcomes, you can precalculate the next step in parallel). But this are mostly special cases.

Side note 2: Same limitations apply to FPGAs. Just that clock rates are even lower (e.g. hundreds of Mhz, not Ghz), and penalty for not having parallelism even higher.