WebGL packing/unpacking functions that can roundtrip all typical 32-bit floats

Question

I have a WebGL circuit simulator. One of the problems it has is that, due to using quite a lot of intermediate float textures as it simulates, it doesn't work on various mobile devices. They only support byte textures.

My intended solution to this problem is to encode the high-precision (i.e. 32-bit) floats as bytes. Every output float is packed into a nearly-IEEE format (I put the sign bit at the other end to avoid a few shifts, I don't do denormalized values, and I don't do infinities/NaNs). Similarly, every input is unpacked before being used.

I have found various blog posts and answers related to this task out on the internet (example 1, example 2, example 3), but I haven't found any that work properly on all finite non-denormalized floats.

The problem I'm running into is precision. I want to round trip the floats without introducing any error, but I can't seem to make a shader that preserves all 23 bits of the mantissa. There always seems to be some rounding on some machine that loses the last bit, though I can perturb the cases where the rounding happens and it happens differently on the various machines I've tested on.

Here is my packing method:

vec4 packFloatIntoBytes(float val) {
    if (val == 0.0) {
        return vec4(0.0, 0.0, 0.0, 0.0);
    }

    float mag = abs(val);
    float exponent = floor(log2(mag));
    // Correct log2 approximation errors.
    exponent += float(exp2(exponent) <= mag / 2.0);
    exponent -= float(exp2(exponent) > mag);

    float mantissa;
    if (exponent > 100.0) {
        // Not sure why this needs to be done in two steps for the largest float to work.
        // Best guess is the optimizer rewriting '/ exp2(e)' into '* exp2(-e)',
        // but exp2(-128.0) is too small to represent.
        mantissa = mag / 1024.0 / exp2(exponent - 10.0) - 1.0;
    } else {
        mantissa = mag / float(exp2(exponent)) - 1.0;
    }

    float a = exponent + 127.0;
    mantissa *= 256.0;
    float b = floor(mantissa);
    mantissa -= b;
    mantissa *= 256.0;
    float c = floor(mantissa);
    mantissa -= c;
    mantissa *= 128.0;
    float d = floor(mantissa) * 2.0 + float(val < 0.0);
    return vec4(a, b, c, d) / 255.0;
}

And here's my unpacking method:

float unpackBytesIntoFloat(vec4 v) {
    float a = floor(v.r * 255.0 + 0.5);
    float b = floor(v.g * 255.0 + 0.5);
    float c = floor(v.b * 255.0 + 0.5);
    float d = floor(v.a * 255.0 + 0.5);

    float exponent = a - 127.0;
    float sign = 1.0 - mod(d, 2.0)*2.0;
    float mantissa = float(a > 0.0)
                   + b / 256.0
                   + c / 65536.0
                   + floor(d / 2.0) / 8388608.0;
    return sign * mantissa * exp2(exponent);
}

This method is close. It works on my laptop, as far as I can tell. But it doesn't work on my Nexus tablet. For example, the float -0.20717763900756836 should be encoded as [124, 168, 76, 193]. When I unpack that then repack it on the Nexus tablet the output is one ulp lower: [124, 168, 76, 191] (which encodes -0.2071776[5390872955]). Close, but I want perfect.

Mostly I'm at a loss trying to figure out where the precision is being destroyed in this method. Changes almost seem to have random effects, like replacing x * exp2(n) with x / exp2(-n) might fix an error in one place but introduce an error in another place.

Is there any exact way to pack floats into bytes, without losing precision?

Some example values that such a method should work on:

var testValues = new Float32Array([
    0,
    0.5,
    1,
    2,
    -1,
    1.1,
    42,
    16777215,
    16777216,
    16777218,
    0.9999999403953552, // An ulp below 1.
    1.0000001192092896, // An ulp above 1.
    Math.pow(2.0, -126), // Smallest non-denormalized 32-bit float.
    0.9999999403953552 * Math.pow(2.0, 128), // Largest finite 32-bit float.
    Math.PI,
    Math.E
]);

I'm voting to close this question as off-topic because it's about representing floating-point numbers; nothing here is specific to graphics. — Dan Hulme, Oct 20 '16 at 11:40
@DanHulme There seem to be quite a lot of questions on this SE that are "how does X work" with respect to GPUs and opengl/dx. I happen to be only indirectly using the results for drawing things, but that doesn't affect whether the question is useful to other people. Closing feels like splitting hairs to me, and the line I'm over is not at all clear to new users. You're going to be spending all your mod time closing questions if they all have to be sufficiently abstract computer science tasks and no concrete "make GPU do X" questions. — Craig Gidney, Oct 20 '16 at 13:38
We're still quite a new site, so we don't yet have a lot of consistency in closing questions. This one in particular feels to me like a better fit for SO than for us, because it's "how do I do this specific thing with this library" - about WebGL as a programming environment rather than as a graphics platform. But I'm only one close-voter, so we'll see how the community thinks as a whole. Either way, thank you for helping us to think about our scope. — Dan Hulme, Oct 20 '16 at 17:22
On the subject of the question itself, do you get the same problem just from calling the two functions, with no texture? Can you be sure the errors aren't being introduced by some texture filtering? — Dan Hulme, Oct 20 '16 at 17:25
@DanHulme Not sure what you mean by "just calling them". I do have Javascript equivalents of the two functions. The javascript variants work on all the given test values, though that was easier since the intermediate calculations are done with 64 bit precision. I use the JS methods as a comparison tool when debugging the shaders. The shaders do give the right answers on my laptop... just not on tablets and phone. — Craig Gidney, Oct 20 '16 at 17:56
I personally feel this is pretty on topic. The heart of real time computer graphics is often "how can i get my data to the GPU in an efficient and accurate way" and getting a float to a shader is right in that vein. This comes up all the time when doing shadertoy things for example, and likely the best person to answer would be one of those folks who have solved it there, on shadertoy or in demoscene type code, who would be here IMO. — Alan Wolfe, Oct 20 '16 at 17:58
@Craig Gidney example 1 is incorrect but the fixed version works well with values between 0 and 1 (i use a similar version to convert a 24 bit depth buffer value into rgb8 texture with no precision loss). — Raxvan, Oct 27 '16 at 12:57
@Raxvan How did you "fix" it? Floats get more precise as you travel to 0, but the linked example-1 method is a 32-bit fixed-point thing. There's nearly a billion representable values in the range from 2^-126 to 2^-32, and they all get totally trashed by that kind of approach. — Craig Gidney, Oct 27 '16 at 14:51
@Craig Gidney take the example 1 as it is and replace all 256.0 values with 255.0. just remember that that method works for values between 0 and 1 actually ever 1 just [0..1) — Raxvan, Oct 27 '16 at 15:40
@Raxvan Oh. You should replace some of the 256s, but not all of them. And it does round the many values under 2^-32 into 0. — Craig Gidney, Oct 27 '16 at 15:42
Just want to make sure that it's not the sampler. Make sure you are using nearest neighbour sampling only. The sampler will attempt to produce a linear combination of a pixel and its neighbours, so don't do that. If you are just slightly off centre then you might be weighing in a value from a neighbouring pixel. — Wyck, Jan 09 '18 at 18:59
@Wyck Yes, I was using nearest neighbor sampling: https://github.com/Strilanc/Quirk/blob/master/src/webgl/WglTexture.js#L168 — Craig Gidney, Jan 09 '18 at 21:27

WebGL packing/unpacking functions that can roundtrip all typical 32-bit floats

0 Answers0