0

The following describes my understanding of floating point representations.

(For numbers $b, m, E_{min}, E_{max} \in \mathbb{N} \setminus \{0\}$, $b>1$, $E_{min} \leq E_{max}$...)

Let $F(b, m, E_{min}, E_{max})$ be the set of all real numbers $x \in \mathbb{R}$ that can represented as

$x = \sigma \cdot (\sum_{i=0}^{m-1} s_i b^{-i}) \cdot b^E$

where $\sigma \in \{-1, 1\}$, $s_i \in \{0, ..., b-1\}$, $s_0 \neq 0$, $E \in \{E_{min}, ..., E_{max}\}$.

$F(b, m, E_{min}, E_{max})$ is then called a floating point range.

The mapping of an arbitrary real number $x \in \mathbb{R}$ to it's closest floating point representation $x' \in F(b, m, E_{min}, E_{max})$, where the floting point range is given, is called rounding.

Now my question is how I do the rounding for an arbitrary real number given an arbitrary floating point range (Where the base $b$ is not neccesarily $2$ or $10$). And once I have this, I want to know how I can convert floating point representations from one floating point range into another floating point range, for example with another base. There won't be always an exact representation, so I have to find the closest again. How do I do this?

Thank you in advance.

1 Answers1

1

To convert a real number $x$ to the floating-point range $F(b,m, E_\min, E_\max),$ follow the procedure described in this answer to an earlier question.

To convert from the floating-point range $F(b,m, E_\min, E_\max)$ to a different floating-point range, $F(b',m', E'_\min, E'_\max),$ just take the (rounded) number you found in the range $F(b,m, E_\min, E_\max)$ (which is a real number, call it $\bar x$) and follow the procedure above, using the parameters of $F(b',m', E'_\min, E'_\max)$ (for example, find $E'$ such that $b'^{E'} \leq \bar x \leq b'^{E'+1}$). You don't need to be concerned with how the new rounded number will compare with the original number $x$; don't even think about the original number $x$ while doing the conversion.

You can do the arithmetic for the conversion in base $b$ if you find that practical; otherwise, represent $\bar x = \sigma \cdot (\sum_{i=0}^{m-1} s_i b^{-i}) \cdot b^E$ in whatever way you can write it exactly and still do arithmetic on it (for example, a base-ten integer times a power of $b$).

David K
  • 98,388