4

A preface: it's well known that IEEE754 defines five rounding modes (in 2008 edition terms, with my abbreviations):

  1. rounding to nearest, ties to even (RNE) - the default mode for binary arithmetic;
  2. rounding to nearest, ties away (RNA) - required only for decimal arithmetic, rarely supported for binary one;
  3. rounding toward zero (RZ);
  4. rounding toward positive (RPI);
  5. rounding toward negative (RMI).

In very most cases, RNE is used; it's required by default for binary calculations and many people are ignorant of other modes, or they are unavailable (e.g. Java and Python standard libraries don't provide rounding mode change).

In RNE and RNA, there is an explicit requirement to generate Infinity in cases a result is sufficiently larger than the largest presented finite number. Literally, citing IEEE754 draft with TeXization, "an infinitely precise result with magnitude at least b^emax * (b – 0.5 * b ^ (1−p)) shall round to ∞ with no change in sign"; technically, this is identical to another description of the same approach: a single value, which would be next representable in case of infinite exponent range (2^128 for "single" floating, 2^1024 for "double" one), is added to the supported set during exact calculation, is possible as target point for rounding, and then converted to infinity when packed to the final operator result. (UPD: variant from @gnasher729: limited mantissa precision but unlimited exponent, and then fitting into exponent limit, is also suitable and even easier to describe.) This corresponds well to the role of "infinity" in floating point arithmetic - to mark a past range overflow, despite formally the result is closer to any finite number than to the real infinity.

Unlike "to nearest" rounding modes, "direct" ones dictate a principally another approach; for RZ and RMI, +∞ is never generated. The maximal finite number in "double" (DBL_MAX) is approximately 1.8e308. If one multiplies 1e308 by 1e308, the result is +∞ for RNE, RNA and RPI, but DBL_MAX for RZ and RMI. For multiplying 1e308 by -1e308, the result is -∞ for RNE, RNA and RMI, but -DBL_MAX for RZ and RPI. This "rounding" is principally a catastrophic, I won't be ashamed of this emphasis, accuracy loss - for order of ~308 decimal magnitudes, or 1024 binary ones. (Yes, I see overflow is signaled in such case, at least in my tests, and according to the standard. But it's unclear where exactly the overflow happened, and the bogus finite value can spoil following results.)

So, finally, the question: why the direct rounding modes don't round to infinity if a operator result is far enough from the represented value range, as "to nearest" modes do? Is this a legacy issue, or an intentional approach? In the latter case, what was the goal?

Netch
  • 1,520
  • Why is this rounding catastrophic? – Robert Harvey Nov 29 '15 at 18:24
  • @RobertHarvey "catastrophic accuracy loss", not "catastrophic rounding", sorry for too complex constructs. – Netch Nov 29 '15 at 18:26
  • Why is it catastrophic accuracy loss if the resulting number is already outside of the representable floating-point range? That's what the infinity is telling you; it's essentially an "out of range" error. Infinity has no accuracy or precision whatsoever. – Robert Harvey Nov 29 '15 at 18:29
  • 1
    @RobertHarvey seems you've misread the post. 1e308*1e308 with RMI or RZ results with finite number DBL_MAX (1.797693134862316e+308), not infinity, so it can't "tell" anything that infinity tells. – Netch Nov 29 '15 at 18:32
  • I wasn't aware that the spec was that specific. I'm inclined to agree with @gnasher's assessment below (his reasoning about imbibed spirits notwithstanding). In any case, if you need the kind of accuracy that you seem to require at the outer edges of a double-precision floating point number, perhaps a different choice of numeric type is in order. – Robert Harvey Nov 29 '15 at 18:38
  • The only possible reasoning I can think of is get rid of overflows at all costs. I would however question the practical usefulness of these rounding modes in the first place, particularly RMI and RPI. – biziclop Nov 29 '15 at 19:53
  • This is an educated guess, not an answer: implementations perform calculations with a higher precision than results, and rounding occurs as part of conversion of this higher-precision form to double. I would agree, however, that that the best explanation is alcohol. – kdgregory Nov 29 '15 at 23:04
  • @kdgregory whatever we can think on IEEE guys, alcohol isn't a proper reasoning for a work done by committee during a few years, with multiple review iterations. I would suspect a viral mental concept, but, anyway, this is the question - how this concept is named and described. – Netch Nov 30 '15 at 03:45
  • 1
    @biziclop if RMI can result in -∞, it's not a "get rid of overflows at all costs", sorry. – Netch Nov 30 '15 at 03:47
  • I'm voting to close this question as off-topic because any answer does not solve a problem. If a correct answer is found, it's just a curiosity (and I'm guessing a definitive correct answer will not be found). – Scant Roger Nov 30 '15 at 05:52
  • 3
    @ScantRoger that's why I posted it not to main SO site, but here. We have a tool which features aren't used at all or are misused, or used without proper caution. Am I missing something important that could lead me to better results? This isn't offtopic for "programmers" site, this just needs a response from anybody who knows the answer. – Netch Nov 30 '15 at 07:04
  • Are you sure about that? My understanding is that rounding is performed as if there was no limited to the numeric range, and then the result is replaced by infinity if it doesn't fit within the double range. In all rounding modes. The reason why I'm saying this is that if you understand the spec correctly, then the only explanation is too much alcohol while writing the spec :-) – gnasher729 Nov 29 '15 at 17:46
  • 1
    When a computation is performed in round-negative mode, the result should be the largest representable number that is not greater than the arithmetically-correct result, if such a number exists. When multiplying 1.0E+300 * 1.0E+300 in RMI mode, the largest representable double is smaller than the arithmetically-correct result. The difference between that value and the arithmetically-correct result may be atypically large, but the that doesn't make it any less valid as a result. – supercat Feb 27 '16 at 00:13

0 Answers0