7

I want to know the algorithm of converting a given float (e.g, 3.14) to binary in the memory.

I read this wikipedia page, but it only mentions about the conversion the other way.

Let me quickly give the details, from the same wikipedia page:

32 bit floating point representation

As you know; the floating number is calculated as:

$value = (-1)^ {sign} \times 2^{exponent-127} \times fraction $

where

$ exponent = 2^2 + 2^3 + 2^4 + 2^5 + 2^6 $

$ fraction = 1 + 2^{-2} $

in this example. Please check the wikipedia page for more detailed info.

So, we can calculate the float number with given binary but how can we do the other way algorithmically?

Thanks in advance.

Sait
  • 299
  • In what way the input is given? – dtldarek May 19 '12 at 12:59
  • Your question is unclear and there is a serious risk of confusion. What is the input representation, among decimal [with fractional part] and IEEE single-precision floating-point ? What is the output representation, among binary integer and string of 0/1 [or other] ? –  Dec 28 '15 at 08:30

4 Answers4

5

enter image description here

Example: Convert 50.75 (in base 10) to binary.

First step (converting 50 (in base 10) to binary):

  1. We divide 50 by 2, which gives 25 with no remainder.
  2. Next, we divide 25 by 2, which gives 12 with a remainder of 1.
  3. We continue like this until we reach 0.
  4. We read the result from bottom to top (as shown in the picture).

Second step (converting 0.75 (in base 10) to binary):

  1. We multiply 0.75 by 2, which gives 1.5. We keep only the integer part, which is 1. Then, we do 1.5 - 1, which gives 0.5.
  2. We multiply 0.5 by 2, which gives 1. We keep only the integer part, which is 1. Then we do 1 - 1, which gives 0.
  3. We're done. We read from top to bottom (check picture).

This method can also be used to convert a float to octal or hexadecimal. Same methodology.

Georan
  • 133
  • Can I get a comment with the down vote? – Georan Oct 18 '15 at 16:01
  • 1
    The next step after the above is to convert to binary scientific notation so that the number is of the form $1.YYYYYYY \times 2^e$. If $e$ falls in the range of ordinarily represented numbers (not so big that it becomes $\mathrm{Inf}$, not so small that the number is sub-normal) then the digits in $YYYYYYY$ become the fraction (mantissa) part of the floating point number. I'm not 100% sure how sign bits are handled (eg two's complement flip like with integers?), but you can inspect the binary representation of numbers to learn. – Sean Lake Sep 05 '16 at 04:21
2

I'll consider a general fixed-precision floating-point system, which includes any of the standard computer floating-point number types such as the single- or double-precision formats of the IEEE 754 standard. But I will only consider the normal numbers within such a system, not the zeros, denormals, infinities, and other special cases.

A system of floating-point representation of this general type is described by four integers: \begin{array}{cl} b & \text{the base, also known as the radix,}\\ m & \text{the number of digits in the significand, also called the mantissa,}\\ E_\min & \text{the minimum value of the exponent,}\\ E_\max & \text{the maximum value of the exponent,} \end{array} where $b>1,$ $d>0,$ and $E_\max \geq E_\min.$

For example, the IEEE 754-2008 binary32 (single-precision) floating-point format illustrated in the question sets $b=2,$ $m=24,$ $E_\min=-126,$ and $E_\max=127.$ Note that the most significant digit of the significand is always $1$ in this format, so it is omitted from the stored bits (only the other $23$ bits of the significand need to be stored), and this format reserves two of the possible $8$-bit exponent values for special cases, so that the exponents for normal numbers are stored as $8$-bit integers ranging from $00000001_\mathrm{binary} = 1_\mathrm{decimal}$ to $11111110_\mathrm{binary} = 254_\mathrm{decimal}$; adding the exponent bias $-127$ to these values determines the minimum and maximum exponent values.

The value of a particular floating-point number in such a system is $$ \sigma \times n \times b^{E-m+1} = \sigma \times \left(\sum_{k=0}^{m-1} s_{-k}b^{-k}\right) \times b^E \tag1$$ where $\sigma \in \{-1,1\},$ $n$ is an $m$-digit integer in base $b$ representing the significand (a most significant digit $s_0 \in \{1,\ldots,b-1\}$ followed by $m-1$ less significant digits $s_{-1}$ through $s_{-(m-1)}$ selected from $\{0,\ldots,b-1\}$), and $E$ is an integer such that $E_\min\leq E \leq E_\max.$

To convert a real number $x$ to the floating-point number system with parameters $b,$ $m,$ $E_\min,$ and $E_\max,$ proceed as follows:

Find $E$ such that $b^E \leq \lvert x\rvert < b^{E+1}.$ There are various ways to do this: take the integer part of $\log_b \lvert x\rvert$ ; multiply or divide $\lvert x\rvert$ by powers of $b$ in order to get a number in the range $[0, b)$ and count how many powers of $b$ were needed; or take positive or negative powers of $b$ until you find one near enough to $\lvert x\rvert$.

If the result of the previous step does not satisfy $E_\min \leq E \leq E_\max,$ you have an exception (overflow or underflow) and must act according to however you have decided to deal with such exceptions. But if the result satisfies $E_\min \leq E \leq E_\max,$ take $y = \lvert x\rvert \times b^{m-E-1}.$

Round $y$ to the nearest integer, $n,$ using whatever rounding rules you have selected. For the IEEE-754 "round to nearest, ties to even" rule, let $n = \left\lceil y - \frac12\right\rceil$ if $\lfloor y\rfloor$ is even, $n = \left\lfloor y + \frac12\right\rfloor$ if $\lfloor y\rfloor$ is odd. For "round to nearest, ties away from zero," let $n = \left\lfloor y + \frac12\right\rfloor.$

Convert $n$ to base $b$ using any of the usual methods for converting an integer to a base-$b$ representation. The digits of this representation are $s_0, s_{-1}, \ldots, s_{-(m-1)}$ with $s_0$ the most significant digit.

Set $\sigma$ to the sign of $x.$ You now have all the parts of the floating-point value shown in Equation $1$.

To produce the bitwise representation of an IEEE 754 binary floating-point number as illustrated in the question, set the sign bit to $0$ if $\sigma=1,$ $1$ if $\sigma=-1.$ Subtract the exponent bias from $E,$ write the result as an unsigned binary integer, and set the bits of the exponent to that value, padding on the left with zeros to fill the prescribed number of bits of the exponent. Finally, set the bits of the "fraction" to the bits $s_{-1},\ldots,s_{-(m-1)}.$


Example: Convert $3.14$ to the IEEE 754-2008 binary32 format. We have $b=2,$ $m=24,$ $E_\min=-126,$ and $E_\max=127.$

We find that $b^1 = 2^1 \leq \lvert 3.14\rvert < 2^2 = b^2,$ so $E = 1.$ Since $E_\min \leq 1\leq E_\max,$ we set $y = \lvert 3.14\rvert \times b^{m-E-1} = \lvert 3.14\rvert\times 2^{22} = 13170114.56.$ Following either of the "round to nearest" rules, we get $n = 13170115.$ Converting this to binary, $n = 110010001111010111000011_\mathrm{binary}.$

Since $3.14$ is positive, we put $0$ in bit $31.$ The exponent bias for this format is $-127,$ so since $E=1$ and $1 - (-127) = 128 = 10000000_\mathrm{binary},$ we put $10000000$ in bits $23$ through $30,$ inclusive. Finally, we put $10010001111010111000011$ (obtained by removing the most significant bit of $n$) in bits $0$ through $22,$ inclusive.

David K
  • 98,388
0

I'm no expert but I came across this question whilst trying to figure this out for myself and think I've got a handle on it.

The binary representation of any number can be uniquely determined because if you take any number and apply:
while (true)
  if (n >= 2)
    n /= 2
  else if (n < 1)
    n *= 2
  else
    break

you will end up with a number n, with 1 <= n < 2.

  • If you divide 2.0 by 2 you get 1.0 which will break.
  • If you take 0.9999... * 2, you end up with something slightly less than 2 and break again.
So a break condition always exists.

  • If you multiply any number 1 <= n < 2 by 2, we get something greater than or equal to 2.
  • If you divide any number 1 <= n < 2 by 2, we get something less than 1.
This means the number is also unique.

The number of times you divide or multiply by 2 will determine the power of two (exponent).

If you take the resulting number 1 <= n < 2, it can always be uniquely expressed as 1 + a(1/2) + b(1/4) + c(1/8) + d(1/16) + ... where each factor is either 1 or 0. (It may ultimately be recurring.)

Think of a circle. Split it in two - now there are two halves. Split one half in two and you now have 1/2 + 1/4 + 1/4. Split one quarter in two and you have 1/2 + 1/4 + 1/8 + 1/8. Effectively, but doing this indefinitely, you end up with the sequence Sum1->infinity = 1.

If we take any given number, we can decide whether to include each segment of that circle (or expression in the sequence). Take the float 1.85 for example.

  • We start with 1 (1.)
  • 0.85 > 0.5 (1/2) so we include 1/2 (1.1), leaving 0.35
  • 0.35 > 0.25 -> include 1/4 (1.11), leaving 0.1
  • 0.1 < 0.125 -> don't include 1/8 (1.110)
  • 0.1 > 0.0625 -> include 1/16 (1.1101), leaving 0.0375
  • 0.0375 > 0.03125 -> include 1/32 (1.11011), leaving 0.00625
  • 0.00625 < 0.015625 -> don't include 1/64 (1.110110)
  • 0.00625 < 0.0078125 -> don't include 1/128 (1.1101100)

And so on until you've used up all available bits.

This should be pretty easy to implement in code. There may well be more efficient ways to do it though - I haven't seen any implementations or played around with it myself! Have fun.

0

I'm not sure how familiar you are with binary representations: I assume you have some basic knowledge. If I'm not making sense, just ask. If you just want the fixed-point binary representation, you can just take the fraction, add the (hidden) most significant bit which is assumed to be 1.

If you just want the integer value (rounded towards zero), you can shift the result left by $\max(-24, \text{exponent} - 127)$ bits (this can be negative, so this might mean that you have to shift it to the right). Now negate the result if the sign bit is set.

If you want a fixed point binary representation, shift the result left by $\text{exponent}$ bits, and negate the result if the sign bit is set. Now you always have a fixed-point representation (the same way the fraction in the 24 bits is reprented, only in that case the MSB, which is one, is missing) of a maximum of 152 bits. The last 127 bits are bits 'after the dot'.

You might need quite some memory for this significantly more than for a normal 32 or 64-bit binary number.

Ruben
  • 804