How to improve this shader's performance?

Question

I have a scene with 150000 instances. I use glsl and opengl 4.0. Shader A is 2 times slower than shader B. I.e. with shader A I get 20fps, with shader B I get 40fps on average. What can I do to improve shader A?

Shader A:

#version 400

struct Light {
   vec3 position;
   vec3 intensities; //a.k.a the color of the light
   float ambientCoefficient;
   float attenuation;
};

uniform bool useLight;
uniform mat4 modelMatrix;
uniform bool useTex;
uniform sampler2D tex;
uniform Light light;
uniform vec4 diffuseColor;

in vec2 fragTexCoord;
in vec3 fragNormal;
in vec3 fragVert;

out vec4 finalColor;

void main() {
    vec3 normal = normalize(transpose(inverse(mat3(modelMatrix))) * fragNormal);
    vec3 surfacePos = vec3(modelMatrix * vec4(fragVert, 1));

    vec4 surfaceColor = vec4(0,0,0,1);

    if(useTex) {
        surfaceColor = texture(tex, fragTexCoord);
    }
    else {
        //surfaceColor = diffuseColor;
        surfaceColor = vec4(0,1,0,1);
    }

    if(useLight) {
        vec3 surfaceToLight = normalize(light.position - surfacePos);

        //ambient
        vec3 ambient = light.ambientCoefficient * surfaceColor.rgb * light.intensities;

        //diffuse
        float diffuseCoefficient = max(0.0, dot(normal, surfaceToLight));
        vec3 diffuse = diffuseCoefficient * surfaceColor.rgb * light.intensities;

        //attenuation
        float distanceToLight = length(light.position - surfacePos);
        float attenuation = 1.0 / (1.0 + light.attenuation * pow(distanceToLight, 2));

        //linear color (color before gamma correction)
        vec3 linearColor = ambient + attenuation*(diffuse);

        //final color (after gamma correction)
        vec3 gamma = vec3(1.0/2.2);
        finalColor = vec4(pow(linearColor, gamma), surfaceColor.a);
    }
    else {
        finalColor = surfaceColor;
    }
}

Shader B:

#version 400

struct Light {
   vec3 position;
   vec3 intensities; //a.k.a the color of the light
   float ambientCoefficient;
   float attenuation;
};

uniform bool useLight;
uniform mat4 modelMatrix;
uniform bool useTex;
uniform sampler2D tex;
uniform Light light;
uniform vec4 diffuseColor;

in vec2 fragTexCoord;
in vec3 fragNormal;
in vec3 fragVert;

out vec4 finalColor;

void main() {
    finalColor = vec4(0,0,0.7,1);
}

@LukeG I agree, however I would not be the least bit surprised if it got a lot more traction here, OpenGL is perhaps a little niche there vs bread-and-butter here. Analog would be asking for advice on a shell script on Unix SE. — Jared Smith, Aug 25 '16 at 15:50
@LukeG - it's also the case that there's nothing specifically wrong with this code when reviewed in isolation. One must also consider the platform it is run on, a GPU, and the performance characteristics of that platform in order to get the fuller picture. — Maximus Minimus, Aug 25 '16 at 17:16
I might me missing something here, if so please forgive me. But are you just asking why the code with considerably less operations is faster than the other? — Doddy, Aug 26 '16 at 09:43

kolenda · Accepted Answer · 2016-08-25T11:11:47.053

15

At first you should pre-compute as much data as you can and avoid computing the same values for every pixel.

You have such a fragment:

transpose(inverse(mat3(modelMatrix))

This inverts the matrix, which is not so trivial operation and despite the fact that input data are the same for each pixel (so the results will be the same) it's recomputing this for every pixel. Compute it once, before rendering and pass the result as another matrix, like you do with the modelMatrix.

Later, you're also normalizing the (light.position - surfacePos) vector but you're also computing the length of it, so it results in two sqrt operations instead of one.

Additionally, depending on your hardware you may find that using if's in a pixel shader may lower your performance. If that's the case you could prepare few distinct versions of your shader and batch your instances depending on the useLight and useTex properties.

EDIT:

You may also try to lower the OpenGL version defined in shaders, to be the lowest one supporting your features. In theory it shouldn't do much but depending on drivers and HW vendor the practice may differ... (i.e. if your GPU supports OGL 4.0 it often means that it's fast in OGL 3.0 but very slow in 4.0, but you need to test it on specific case).

edited Aug 25 '16 at 11:11

answered Aug 25 '16 at 11:04

kolenda

1,370
9
12

If vec3 normal depends only on fragment variables wouldn't GPU know it and avoid redundant computation? – user68854 Aug 25 '16 at 11:17
@user68854 - GPUs generally don't work like that: they typically just plough through the work. Your shader compiler might be able to identify this, but it might not, and whether it does or not is not defined by the GL spec; in other words you shouldn't rely on it. – Maximus Minimus Aug 25 '16 at 11:19
@user68854 I don't think so. While you shader compiler may optimize some code for you it can't do this between distinct executions of your shader, it simply doesn't have a 'scratchpad' to put those common data in. So, it MAY optimize your double sqrt case, but even if it realized that the result of your inverse is constant there's no place on the GPU to store this data. It just doesn't work this way (AFAIK). – kolenda Aug 25 '16 at 11:27
@kolenda It might be interesting to ask on http://computergraphics.stackexchange.com/ if such an optimization is possible. – porglezomp Aug 26 '16 at 01:02
1

This confuses me : transpose(inverse(mat3(modelMatrix)) Shouldn't the transpose of a 3x3 rotation matrix already be the inverse? – Sidar Nov 26 '17 at 15:25
@Sidar No because that is a model matrix which may contain a scale component. Scaling breaks the orthonormal properties of the matrix and makes the inverse no longer equal to the transpose. Taking the inverse of the transpose like that is how a normal matrix is calculated as the 3x3 portion of the model matrix alone will not properly transform normals when non-uniform scaling is involved. See this for more info on such matricies. – Lemon Drop Jun 08 '21 at 14:36

score 3 · Answer 2 · answered Aug 25 '16 at 11:01

Looking at this line in particular:

vec3 normal = normalize(transpose(inverse(mat3(modelMatrix))) * fragNormal);

working out the inverse of a matrix is very taxing, and should be precalculated by the cpu instead of forcing the gpu to waste time messing around with it.

score 3 · Answer 3 · answered Aug 25 '16 at 11:08

There's not a whole lot you can do; the simple reality is that shader A does more work than shader B, so it's always going to run slower.

We can close the gap somewhat however. I can't give you definite figures for how much, it all depends on performance characteristics of the rest of your program, so treat these are general good practice.

transpose(inverse(mat3(modelMatrix)))

That's a lot of inversions and transpositions per frame. Do it on the CPU instead, one time only (or at least only when modelMatrix changes) and send the inverse/transpose matrix as an additional uniform. If ALU operations are a bottleneck for you, this should gice you the biggest increase.

if(useTex) {

Branching is not the death that it used to be, but you can still avoid it here (and save a uniform slot) by creating a 1x1 texture (of the appropriate colour) and binding that instead.

if(useLight) {

More branching. In this case there's no obvious alternative (such as the 1x1 texture) so I'd encourage you to split this condition off into a third shader and benchmark both (i.e. 2 shaders with a branch versus 3 shaders without). Depending how often you need to change shaders you may or may not see a performance difference when compared to branching.

1x1 texture seems like experimental solution. Better to make another shader. Anyway why branching is costly? jump with condition on GPU core is expensive? — user68854, Aug 25 '16 at 11:21
@user68854 - GPUs don't work the same way as CPUs. Older GPUs, in particular, had no native support for branching at all but instead executed both sides of the branch then used a step instruction to select the correct one. Nowadays branching is cheaper (but still not free) so it's worth benching if a shader change is cheaper or not. Using a 1x1 texture is a well-known trick, please see http://stackoverflow.com/questions/22703166/in-glsl-is-it-better-to-branch-or-look-up-from-a-dummy-texture for example. — Maximus Minimus, Aug 25 '16 at 11:54
Branching may be costly because of pipeline optimizations. When your GPU does some operation it's in fact preparing for few next ones. If it encounters an if it may prepare to only one branch - if you select the 2nd one then this work is wasted and needs to be restarted. The same happens on CPUs. The trick with 1x1 texture may help or not - it all depends on specific shader, GPU vendor, architecture, drivers, etc. so you just need to test it by yourself. — kolenda, Aug 25 '16 at 12:02

How to improve this shader's performance?

3 Answers3