Conversion of a number from Single precision floating point representation to a Half precision floating point

Question

Hope it is relevant here.

I have a code where I have to work on Half precision floating point representation numbers. To achieve that I have created my own C++ class fp16 with all operators(arithmetic logical, relational) related to this type overloaded with my custom functions, while using a Single precision floating point number with a Half precision floating point number.

Half precision floating point = 1 Sign bit , 5 exponent bits , 10 significand bits = 16 bit

Single precision floating point = 1 Sign bit, 8 exponent bits, 23 significand bits = 32 bits

So what I do to convert from a Single precision floating point number to a Half precision floating point number:-

For significand bits - I use truncation i.e. loose 13 bits from the 32 bits to get 10 bits significand for half precision float.

What should I do to handle the exponent bits. How do I go from 8 exponent bits to 5 exponent bits?

Any good reading material would help.

Exact duplicate of http://programmers.stackexchange.com/q/108487/27671 — Stephen, Sep 15 '11 at 12:20
Generally you should try not to cross post your questions to multiple sites, rather try to find the most appropriate site and post there. — Stephen, Sep 15 '11 at 12:20
Related: http://stackoverflow.com/questions/6162651/half-precision-floating-point-in-java — finnw, Sep 20 '11 at 16:36

Martin Sojka · Answer 1 · 2011-09-18T10:57:17.060

6

Assuming a normal number (subnormal numbers are small enough so that they can be safely set to zero, infinities, zero, negative zero and NaN need to be handled specially anyway), you need to subtract the exponent bias from the exponent of the original floating point format (that's 127 for 32-bit float), and re-add the exponent bias of the new format (that's 15 for the 16-bit half) later.

If the result is between 1 and 30, you can use it.
If it's -9 to 0, you can try constructing a subnormal number. Alternatively, given the performance penalty associated with them on some platforms, you might not, and simply set the number to zero too, cutting off the value space in the result. Which of the choices is the better one depends mostly on your target hardware, but you should be aware that there is a choice.
If it's 31 or more, you have an overflow and should construct an infinity or negative infinity, as applicable (those have the exponent set to 31 and the significant bits all set to zero).
In all other cases (that is, if the resulting exponent would be -10 or smaller), the number is too small to be represented in a half-precision floating number. The result should be set to zero or negative zero.

In all cases: Beware of the exponent shifting up if you decide to round the significand instead of truncating it. Currently you are truncating it (rounding to zero), so that's not an issue.

edited Sep 18 '11 at 10:57

answered Sep 15 '11 at 11:24

Martin Sojka

5,709
30
30

+1 Right answer! is like one who want to store a short into a char. Good if the short us underused but no way to store 1024 into a char (nor 256 to be honest) – FxIII Sep 15 '11 at 18:47
Er, no. You do not have to handle zero or infinites specially, see for instance the table-based approaches. Your answer is really too vague. – sam hocevar Sep 17 '11 at 16:08
@Sam Hocevar: goldenmean explicitly isn't using a table-based approach, but extracting the sign, exponent and significant bits and transforming and re-combining them later on. So he has to handle both zeros and both infinities specially. Which part of the answer is too vague, by the way, so I know where to improve it? – Martin Sojka Sep 17 '11 at 20:27
@Martin: even with a non-table based approach, there is no need to handle zero, negative zero and denormals in a special way: since they have exponent zero, the corrected exponent will be -112 and that case is handled just like an underflow. But I don't really see where table-based approach is explicitly ruled out. As for the vagueness, I should have been more clear, sorry: since you mention these, I suggest explaining 1) that the exponent for Inf should be 31, and 2) how to construct denormals (or a pointer to a method). – sam hocevar Sep 18 '11 at 00:31
@Sam Hocevar: With "you have to handle zero and negative zero specially" I didn't mean you have to handle them somewhat differently from each other, differently from subnormals/denormals or differently from an underflow; just differently from the normal case. For the vague cases, I'll edit some more details in. – Martin Sojka Sep 18 '11 at 10:47

sam hocevar · Answer 2 · 2011-09-17T16:01:58.007

Here are my two implementations. The first one uses branches, which can be a problem on an in-order CPU, but uses very little memory:

/* This method is faster than the OpenEXR implementation (very often
 * used, eg. in Ogre), with the additional benefit of rounding, inspired
 * by James Tursa’s half-precision code. */
static inline uint16_t float_to_half_branch(uint32_t x)
{
    uint16_t bits = (x >> 16) & 0x8000; /* Get the sign */
    uint16_t m = (x >> 12) & 0x07ff; /* Keep one extra bit for rounding */
    unsigned int e = (x >> 23) & 0xff; /* Using int is faster here */

    /* If zero, or denormal, or exponent underflows too much for a denormal
     * half, return signed zero. */
    if (e < 103)
        return bits;

    /* If NaN, return NaN. If Inf or exponent overflow, return Inf. */
    if (e > 142)
    {
        bits |= 0x7c00u;
        /* If exponent was 0xff and one mantissa bit was set, it means NaN,
         * not Inf, so make sure we set one mantissa bit too. */
        bits |= e == 255 && (x & 0x007fffffu);
        return bits;
    }

    /* If exponent underflows but not too much, return a denormal */
    if (e < 113)
    {
        m |= 0x0800u;
        /* Extra rounding may overflow and set mantissa to 0 and exponent
         * to 1, which is OK. */
        bits |= (m >> (114 - e)) + ((m >> (113 - e)) & 1);
        return bits;
    }

    bits |= ((e - 112) << 10) | (m >> 1);
    /* Extra rounding. An overflow will set mantissa to 0 and increment
     * the exponent, which is OK. */
    bits += m & 1;
    return bits;
}

The second one only uses tables, which uses some memory and thus can suffer from cache misses if used only from time to time:

/* These macros implement a finite iterator useful to build lookup
 * tables. For instance, S64(0) will call S1(x) for all values of x
 * between 0 and 63.
 * Due to the exponential behaviour of the calls, the stress on the
 * compiler may be important. */
#define S4(x)    S1((x)),   S1((x)+1),     S1((x)+2),     S1((x)+3)
#define S16(x)   S4((x)),   S4((x)+4),     S4((x)+8),     S4((x)+12)
#define S64(x)   S16((x)),  S16((x)+16),   S16((x)+32),   S16((x)+48)
#define S256(x)  S64((x)),  S64((x)+64),   S64((x)+128),  S64((x)+192)
#define S1024(x) S256((x)), S256((x)+256), S256((x)+512), S256((x)+768)

/* Lookup table-based algorithm from “Fast Half Float Conversions”
 * by Jeroen van der Zijp, November 2008. No rounding is performed,
 * and some NaN values may be incorrectly converted to Inf. */
static inline uint16_t float_to_half_nobranch(uint32_t x)
{
    static uint16_t const basetable[512] =
    {
#define S1(i) (((i) < 103) ? 0x0000 : \
               ((i) < 113) ? 0x0400 >> (113 - (i)) : \
               ((i) < 143) ? ((i) - 112) << 10 : 0x7c00)
        S256(0),
#undef S1
#define S1(i) (0x8000 | (((i) < 103) ? 0x0000 : \
                         ((i) < 113) ? 0x0400 >> (113 - (i)) : \
                         ((i) < 143) ? ((i) - 112) << 10 : 0x7c00))
        S256(0),
#undef S1
    };

    static uint8_t const shifttable[512] =
    {
#define S1(i) (((i) < 103) ? 24 : \
               ((i) < 113) ? 126 - (i) : \
               ((i) < 143 || (i) == 255) ? 13 : 24)
        S256(0), S256(0),
#undef S1
    };

    uint16_t bits = basetable[(x >> 23) & 0x1ff];
    bits |= (x & 0x007fffff) >> shifttable[(x >> 23) & 0x1ff];
    return bits;
}

These are great but I wish I understood all the magic more. In the first one all of the < 103, > 142, - 112 etc. — gman, Jan 08 '17 at 10:32

score 4 · Answer 3 · answered Sep 15 '11 at 11:37

Few years ago I have been solving the same problem - I wanted to use Half Float extension in OpenGL, so I needed to convert Float32 to Float16. I have found somewhere this code, so I hope it will help you too:

#define F16_EXPONENT_BITS 0x1F
#define F16_EXPONENT_SHIFT 10
#define F16_EXPONENT_BIAS 15
#define F16_MANTISSA_BITS 0x3ff
#define F16_MANTISSA_SHIFT (23 - F16_EXPONENT_SHIFT)
#define F16_MAX_EXPONENT (F16_EXPONENT_BITS << F16_EXPONENT_SHIFT)

GLushort F32toF16(GLfloat val)
{
    GLuint f32 = (*(GLuint *) &val);
    GLushort f16 = 0;
    /* Decode IEEE 754 little-endian 32-bit floating-point value */
    int sign = (f32 >> 16) & 0x8000;
    /* Map exponent to the range [-127,128] */
    int exponent = ((f32 >> 23) & 0xff) - 127;
    int mantissa = f32 & 0x007fffff;
    if (exponent == 128) 
    { /* Infinity or NaN */
        f16 = sign | F16_MAX_EXPONENT;
        if (mantissa) f16 |= (mantissa & F16_MANTISSA_BITS);

    } 
    else if (exponent > 15) 
    { /* Overflow - flush to Infinity */
        f16 = sign | F16_MAX_EXPONENT;
    } 
    else if (exponent > -15) 
    { /* Representable value */
        exponent += F16_EXPONENT_BIAS;
        mantissa >>= F16_MANTISSA_SHIFT;
        f16 = sign | exponent << F16_EXPONENT_SHIFT | mantissa;
    }
    else 
    {
        f16 = sign;
    }
    return f16;
}

This code has errors for numbers near 2^15, though. For example, for 1.03125 * 2 ^ -15 = 0,00003147125244140625, encoded as float32 as 0 01110000 00001000000000000000000, is transformed to zero. The correct coding would be, of course, 0 00000 1000010000 (a subnormal number). — Martin Sojka, Sep 15 '11 at 11:50
Silly me, first 2^15 should be 2^-15, of course (more precisely, 2^-24 to 2^-15) — Martin Sojka, Sep 15 '11 at 11:56

score 1 · Answer 4 · answered Sep 19 '11 at 21:25

1

I'll throw another reference implementation into the running:

https://github.com/HeliumProject/Helium/blob/master/Math/Float16.h

https://github.com/HeliumProject/Helium/blob/master/Math/Float16.cpp

answered Sep 19 '11 at 21:25

Zach

2,630
25
24

Conversion of a number from Single precision floating point representation to a Half precision floating point

4 Answers4

Linked