2

I am doing some program, that can be written in two ways.

  • render full screen quad and send some additional info to fragment shader. But in fragment shader I need to perform for-loop
  • Move the loop to the CPU side of program and call draw-call on fullscreen quad repeatedly. Output will be rendered to texture. I will have two of them and swap active (write) and inactive (last output, read-only).

What can be faster ? I think, that number 1, but on the other side, loops are not very efficient in GLES shaders. And I need to calculate texture coordinates on the fly, which according to Apple doc is slower than if they goes directly from vertex shader and are not changed.

Edit:

Point two can be improved by this:

  • I can put all quads to single buffer and draw them with one draw call (each quad will have different depth, but will be still fulscreen), where depth will be "loop" variable. In shader I take this "depth" and use blending (which can be used on iOS 6 via gl_LastFragData[0])
Martin Perry
  • 1,106
  • 1
  • 12
  • 29
  • Random guess I'd say it really depends on the shader performance of your GPU. The better your GPU the better off you're with the loops in the shader. Regarding your note: Of course recalculating something is slower than just using whatever you got passed. I guess in the end this whole question is about GPU load vs. CPU/memory/bus load. – Mario Oct 05 '13 at 09:40
  • 1
    More than likely the GPU. The GPU has tons of cores and crazy multi-threading. Transfering data to the GPU is usually pretty slow. – RandyGaul Oct 05 '13 at 09:43
  • @RandyGaul Transfering data to GPU is slow, but I will reuse single Vertex Buffer with 1 quad, other data will be just textures and render targets, so they are still on GPU. – Martin Perry Oct 05 '13 at 09:46
  • @MartinPerry Well think about it: you call a function that does an action on the GPU. That is a draw call, and takes a similar amount of time to relay to the GPU as does sending some buffer. This is why you hear so much talk about "lowering draw calls". The ideal use of a GPU involves as little communication at all to/from the GPU. – RandyGaul Oct 05 '13 at 09:48
  • @RandyGaul See my edit... – Martin Perry Oct 05 '13 at 10:00
  • I've understood from a friend of mine working as a game developer that a for loop in a shader is a performance killer. Its faster to render the entire scene multiple times then using a loop. Maybe you can find more info on the subject by taking a look at deferred shading. – Thomas Oct 05 '13 at 14:48
  • @Thomas Well.. deferred rendering is huge performance killer on mobile devices. They have low fillrate. – Martin Perry Oct 05 '13 at 15:31
  • Is the loop a uniform loop? It will be unrolled usually and super efficient on the GPU if it's uniform. It may or may not be slow if it's non-uniform depending on a large number of factors, including the specifics of the loop and target hardware. Try both in your app, profile them on target devices, see which is faster for you in this circumstance. – Sean Middleditch Oct 05 '13 at 18:08
  • @SeanMiddleditch Loop will be uniform (fixed step count), but can be breaked according to if condition – Martin Perry Oct 06 '13 at 09:31
  • if the break is not in a uniform if-statement, that makes it a non-uniform loop. it helps to understand why non-uniform flow makes shaders "slow," which mostly comes down to non-uniform flow and how shaders are made up of a certain set of shared execution resources inside a subset of the execution cores (including the instruction pointer). – Sean Middleditch Oct 06 '13 at 17:49

1 Answers1

3

There's only one way to be sure. Implement both and benchmark them on the hardware you care about. I'd be surprised if multiple passes was quicker though, because of the extra reading and writing of render targets required.

Multiple identical passes are generally only useful to work round the limitations of what a single pass can do (e.g. shader instruction count limits).

The cases when multiple passes are quicker tend to be one of:

  • The shader can be much simpler with more passes (e.g. a separable blur).

  • The extra pass does something quick and simple which makes the rest of the work much cheaper. For example downsizing the source render target, or filling in the stencil buffer.

Adam
  • 7,346
  • 19
  • 25
  • @Krom Stern Well.. doing actual profiling take up long time, because I need to create both versions (and its not that simple, as I put simplified problematic here). Thats why I asked before. – Martin Perry Oct 06 '13 at 09:30