Threads lockstep and conditions in compute shader

Question

I'm working with DirectCompute, but this question can be applied to general gpu programming I suppose.

As far as I know the threads in a group works in a lockstep. It means that every instruction for every thread executes at the same time, right? But what if one thread out of 1024 entered if/else condition? All other 1023 will just wait or lockstep condition will be violated?

score 5 · Accepted Answer · edited Apr 13 '17 at 13:00

5

Not all threads will execute in lockstep but they are split into groups whose threads are locked to each other.

This means that if only 1 thread out of all threads enters a branch then only 1 group will need to enter that branch while all the others will skip it.

In that group that has to execute both branches it will actually execute both branches but threads will throw away the result of the branch that it didn't need to go into.

For more information see this question.

edited Apr 13 '17 at 13:00

Community

1

answered May 26 '16 at 12:02

ratchet freak

5,950
16
28

1

@nikitablack depends on the hardware and is kinda impossible to extract naively from any marketing information. – ratchet freak May 26 '16 at 12:13
And also a question. Lets's say one thread in a group reads from some buffer in a branch based on that thread's id. Lets's say the buffer size is 1, so only the first thread can read, all other threads will cause out of bounds read. Is it safe? And one more thing - the read from the system memory is very slow, but all threads will read from it, even if they don't need, right? – nikitablack May 26 '16 at 12:16
1

@nikitablack that's going to depend on a few things; 1) if the driver is smart enough to forward the constant in the test into the then clause, 2) if the hardware can do a "masked read", 3) what exactly happens on out of bounds reads – ratchet freak May 26 '16 at 13:12
2

@nikitablack The lockstep group size is 32 for NVIDIA and 64 for AMD (they call them "warps" and "wavefronts" respectively). For Intel the size is variable from 8 to 32, determined by the shader compiler. Also, out-of-bounds reads are defined to just return zero in DirectX, and I'm pretty sure it's the same for GL and the other APIs. – Nathan Reed May 26 '16 at 17:13
1

For the reason Nathan just described, you should always make your work group size a multiple of 64 to avoid wasting shader invocations in the last warp / wavefront of each group. If you don't need to use shared variables, exactly 64 is usually a good size. – russ May 27 '16 at 07:57

Threads lockstep and conditions in compute shader

1 Answers1

Linked