\myheading{IEEE 754 round-off errors}

\renewcommand{\CURPATH}{IEEE754/roundoff}

\leveldown{}

\myheading{The example from Wikipedia}

This fragment from the \href{https://en.wikipedia.org/wiki/Guard_digit}{Wikipedia article} has caught my attention:

\begin{lstlisting}[caption=Copypasted from Wikipedia]
An example of the error caused by floating point roundoff is illustrated in the following C code.

int main(){
   double a;
   int i;

   a = 0.2; 
   a += 0.1; 
   a -= 0.3;

   for (i = 0; a < 1.0; i++) 
       a += a;

   printf("i=%d, a=%f\n", i, a);

   return 0;
}

It appears that the program should not terminate. Yet the output is :

i=54, a=1.000000
\end{lstlisting}

How is that possible?

First of all, I wrote a Wolfram Mathematica code to unpack IEEE 754 single/double values,
but with absolute precision (like \textit{bignum}), not bounded by 32 or 64 bits.

\begin{lstlisting}[caption=Wolfram Mathematica code]
DoubleExpLen = 11;

DoubleMantissaLen = 52;

DoubleBits = DoubleExpLen + DoubleMantissaLen + 1;

DoubleMaxExponent = BitShiftLeft[1, DoubleExpLen - 1] - 1;

DoubleExpMask = BitShiftLeft[1, DoubleExpLen] - 1;

DoubleMantissaMask = BitShiftLeft[1, DoubleMantissaLen] - 1;

convertDoubleSubnormal[sign_, mantissa_] := (-1)^
   sign*(mantissa/(BitShiftLeft[
      1, ((DoubleBits - (DoubleExpLen + 1)) + DoubleMaxExponent - 1)]))

convertDoubleNaN[sign_, exp_, mantissa_] := 
 If[mantissa == 0 , If[sign == 0, "Inf", "-Inf"], "NaN"]

convertDoubleNormal[sign_, exp_, mantissa_] := (-1)^
   sign*(BitShiftLeft[1, DoubleMantissaLen] + 
     mantissa)/(2^(DoubleMantissaLen - (exp - DoubleMaxExponent)))

convertDouble[sign_, exp_, mantissa_] := 
 If[exp == DoubleExpMask, convertDoubleNaN[sign, exp, mantissa], 
  If[exp == 0, convertDoubleSubnormal[sign, mantissa], 
   convertDoubleNormal[sign, exp, mantissa]]]

convertDoubleQWORD[dw_] := 
 convertDouble[BitShiftRight[dw, DoubleBits - 1], 
  BitAnd[BitShiftRight[dw, DoubleMantissaLen], DoubleExpMask], 
  BitAnd[dw, DoubleMantissaMask]]
\end{lstlisting}

The closest value to 0.1 is...

\begin{lstlisting}[caption=Wolfram Mathematica notebook]
In[]:= N[convertDoubleQWORD[16^^3fb999999999999a], 100]
Out[]= 0.1000000000000000055511151231257827021181583404541015625
\end{lstlisting}

What are the previous and the next numbers? I decrease/increase mantissa by one bit here:

\begin{lstlisting}[caption=Wolfram Mathematica notebook]
In[]:= N[convertDoubleQWORD[16^^3fb999999999999a - 1], 100]
Out[]= 0.09999999999999999167332731531132594682276248931884765625

In[]:= N[convertDoubleQWORD[16^^3fb999999999999a + 1], 100]
Out[]= 0.10000000000000001942890293094023945741355419158935546875
\end{lstlisting}

You see, the first is not exact, but the best approximation possible, in the double-precision IEEE 754 format.

You can't store 0.1 as exact value by the same reason you can't represent $\frac{1}{3}$ value in any binary format.
Well, maybe \href{https://en.wikipedia.org/wiki/Ternary_computer}{unbalanced ternary computer} could,
but it will have problems in representing $\frac{1}{2}$ value.

Now I souped up the code from Wikipedia:

\lstinputlisting{\CURPATH/1.c}

\begin{lstlisting}[basicstyle=\scriptsize,caption=Output]
0.1: 0 01111111011 1001100110011001100110011001100110011001100110011010 0x3fb999999999999a
0.2: 0 01111111100 1001100110011001100110011001100110011001100110011010 0x3fc999999999999a
0.3: 0 01111111101 0011001100110011001100110011001100110011001100110011 0x3fd3333333333333
a = 0.2;
0.200000 2.000000e-01 0 01111111100 1001100110011001100110011001100110011001100110011010 0x3fc999999999999a
current rounding direction:  FE_TONEAREST
a += 0.1;
FE_INEXACT
0.300000 3.000000e-01 0 01111111101 0011001100110011001100110011001100110011001100110100 0x3fd3333333333334
a -= 0.3;
0.000000 5.551115e-17 0 01111001001 0000000000000000000000000000000000000000000000000000 0x3c90000000000000
a!=0
i=0 0.000000 5.551115e-17 0 01111001001 0000000000000000000000000000000000000000000000000000 0x3c90000000000000
i=1 0.000000 1.110223e-16 0 01111001010 0000000000000000000000000000000000000000000000000000 0x3ca0000000000000
i=2 0.000000 2.220446e-16 0 01111001011 0000000000000000000000000000000000000000000000000000 0x3cb0000000000000
...
i=52 0.250000 2.500000e-01 0 01111111101 0000000000000000000000000000000000000000000000000000 0x3fd0000000000000
i=53 0.500000 5.000000e-01 0 01111111110 0000000000000000000000000000000000000000000000000000 0x3fe0000000000000
stop. i=54, a=1.000000 1.000000e+00
\end{lstlisting}

After summation of 0.2 and 0.1, FPU reports FE\_INEXACT exception in FPU status word.
It warns: the result can't "fit" into double IEEE 754 exactly/precisely.
The sum in form of 64-bit value is 0x3fd3333333333334.
But at the beginning of our program, we dumped 0.3, and it is 0x3fd3333333333333.
The difference is one lowest bit of mantissa.

Let's see what Mathematica can say about the exact sum:

\begin{lstlisting}[caption=Wolfram Mathematica notebook]
In[]:= a = N[convertDoubleQWORD[16^^3fb999999999999a], 100]
Out[]= 0.1000000000000000055511151231257827021181583404541015625

In[]:= b = N[convertDoubleQWORD[16^^3fc999999999999a], 100]
Out[]= 0.200000000000000011102230246251565404236316680908203125

In[]:= a + b
Out[]= 0.3000000000000000166533453693773481063544750213623046875
\end{lstlisting}

That is the result FPU tries to shove into double-precision IEEE 754 register, but can't.

... while the best approximation of 0.3 is:

\begin{lstlisting}[caption=Wolfram Mathematica notebook]
In[]:= N[convertDoubleQWORD[16^^3fd3333333333333], 100]
Out[]= 0.299999999999999988897769753748434595763683319091796875
\end{lstlisting}

And the result produced by our code (lowest bit of mantissa incremented):

\begin{lstlisting}[caption=Wolfram Mathematica notebook]
In[]:= N[convertDoubleQWORD[16^^3fd3333333333334], 100]
Out[]= 0.3000000000000000444089209850062616169452667236328125
\end{lstlisting}

Since the rounding mode set by default is FE\_TONEAREST, the FPU rounded it to the nearest value.

Now what is that difference by one lowest bit in mantissa?

\begin{lstlisting}[caption=Wolfram Mathematica notebook]
In[]:= N[convertDoubleQWORD[16^^3fd3333333333334], 100] - 
 N[convertDoubleQWORD[16^^3fd3333333333333], 100]

Out[]= 5.5511151231257827021181583404541015625*10^-17
\end{lstlisting}

This is the 'noise' left in register after subtraction, that isn't equal to zero:

\begin{lstlisting}[basicstyle=\scriptsize,caption=Output]
0.000000 5.551115e-17 0 01111001001 0000000000000000000000000000000000000000000000000000 0x3c90000000000000
a!=0
i=0 0.000000 5.551115e-17 0 01111001001 0000000000000000000000000000000000000000000000000000 0x3c90000000000000
i=1 0.000000 1.110223e-16 0 01111001010 0000000000000000000000000000000000000000000000000000 0x3ca0000000000000
i=2 0.000000 2.220446e-16 0 01111001011 0000000000000000000000000000000000000000000000000000 0x3cb0000000000000
i=3 0.000000 4.440892e-16 0 01111001100 0000000000000000000000000000000000000000000000000000 0x3cc0000000000000
...
\end{lstlisting}

When that 'noise' gets summed with itself, it grows:

\begin{lstlisting}[basicstyle=\scriptsize,caption=Output]
...
i=49 0.031250 3.125000e-02 0 01111111010 0000000000000000000000000000000000000000000000000000 0x3fa0000000000000
i=50 0.062500 6.250000e-02 0 01111111011 0000000000000000000000000000000000000000000000000000 0x3fb0000000000000
i=51 0.125000 1.250000e-01 0 01111111100 0000000000000000000000000000000000000000000000000000 0x3fc0000000000000
i=52 0.250000 2.500000e-01 0 01111111101 0000000000000000000000000000000000000000000000000000 0x3fd0000000000000
i=53 0.500000 5.000000e-01 0 01111111110 0000000000000000000000000000000000000000000000000000 0x3fe0000000000000
stop. i=54, a=1.000000 1.000000e+00
\end{lstlisting}

However, you can switch rounding mode to FE\_TOWARDZERO:

\begin{lstlisting}[basicstyle=\scriptsize,caption=Output]
a = 0.2;
0.200000 2.000000e-01 0 01111111100 1001100110011001100110011001100110011001100110011010 0x3fc999999999999a
current rounding direction:  FE_TONEAREST
a += 0.1;
FE_INEXACT
0.299999 2.999999e-01 0 01111111101 0011001100110011001100110011001100110011001100110011 0x3fd3333333333333
a -= 0.3;
0.000000 0.000000e+00 0 00000000000 0000000000000000000000000000000000000000000000000000 0x0000000000000000
a==0
i=0 0.000000 0.000000e+00 0 00000000000 0000000000000000000000000000000000000000000000000000 0x0000000000000000
i=1 0.000000 0.000000e+00 0 00000000000 0000000000000000000000000000000000000000000000000000 0x0000000000000000
i=2 0.000000 0.000000e+00 0 00000000000 0000000000000000000000000000000000000000000000000000 0x0000000000000000
i=3 0.000000 0.000000e+00 0 00000000000 0000000000000000000000000000000000000000000000000000 0x0000000000000000
...
\end{lstlisting}

Still FE\_INEXACT value, but that works, because rounding gets other direction.
The sum is ....33333 (not ....33334).
(FE\_DOWNWARD would work as well.)

But in a real-world code it's problematic to set rounding mode for each operation...

\myheading{Conversion from double-precision to single-precision and back}

Now this is even more arcane. We add 0.1 to zero and then subtract it, but the result is not zero:

\lstinputlisting[caption=Stripped version]{\CURPATH/2_clean.c}

\lstinputlisting[caption=Souped up version]{\CURPATH/2.c}

\begin{lstlisting}[caption=Output]
0.1: 0 01111011 10011001100110011001101 0x3dcccccd
a = 0.0;
0.000000 0.000000e+00 0 00000000 00000000000000000000000 0x00000000
FE_INEXACT
a += 0.1;
0.100000 1.000000e-01 0 01111011 10011001100110011001101 0x3dcccccd
FE_INEXACT
a -= 0.1;
0.000000 1.490116e-09 0 01100001 10011001100110011001101 0x30cccccd
a!=0
i=0 0.000000 1.490116e-09 0 01100001 10011001100110011001101 0x30cccccd
i=1 0.000000 2.980232e-09 0 01100010 10011001100110011001101 0x314ccccd
i=2 0.000000 5.960465e-09 0 01100011 10011001100110011001101 0x31cccccd
i=3 0.000000 1.192093e-08 0 01100100 10011001100110011001101 0x324ccccd
...
i=25 0.050000 5.000000e-02 0 01111010 10011001100110011001101 0x3d4ccccd
i=26 0.100000 1.000000e-01 0 01111011 10011001100110011001101 0x3dcccccd
i=27 0.200000 2.000000e-01 0 01111100 10011001100110011001101 0x3e4ccccd
i=28 0.400000 4.000000e-01 0 01111101 10011001100110011001101 0x3ecccccd
i=29 0.800000 8.000000e-01 0 01111110 10011001100110011001101 0x3f4ccccd
i=30, a=1.600000 1.600000e+00
\end{lstlisting}

How is that possible the 'noise' gets appeared if we just add a constant value and subtract it?

Luckily, \href{https://beginners.re}{I have some assembly knowledge}, and I can get into the x86-64 code:

\lstinputlisting{\CURPATH/2.s}

But even without that knowledge, we can say something about that code.
It uses 'cvtss2sd' and 'cvtsd2ss' instructions, which converts single-precision value to double-precision and back.

These are:
"Convert Scalar Single-Precision Floating-Point Value to Scalar Double-Precision Floating-Point Value"
and
"Convert Scalar Double-Precision Floating-Point Value to Scalar Single-Precision Floating-Point Value".

The 'addsd' and 'subsd' instructions are:
"Add Scalar Double-Precision Floating-Point Values"
and
"Subtract Scalar Double-Precision Floating-Point Value".

(The whole operation is happens in SIMD registers.)
Also, the 0.1 constant is encoded as a 64-bit double-precision value (.LC6).

In other words, everything calculates here in double-precision, but stored as single-precision values.

The first FE\_INEXACT exception raises when double-precision zero is summed with double-precision 0.1 value and shoved into
single-precision register.

\begin{lstlisting}
In[]:= N[convertFloatDWORD[16^^3dcccccd], 100]
Out[]= 0.100000001490116119384765625
\end{lstlisting}

This is the best possible approximation of 0.1 in single-precision.
Now you convert it to double-precision.
You'll get something like 0.10000000149011611... But this is not the best possible approximation of 0.1 in double-precision!
The best possible approximation is:

\begin{lstlisting}
In[]:= N[convertDoubleQWORD[16^^3fb999999999999a], 100]
Out[]= 0.1000000000000000055511151231257827021181583404541015625
\end{lstlisting}

You subtract $ \approx 0.1000000000000000055511151...$ (best possible approximation) from $ \approx 0.10000000149011611...$
and get $ \approx 0.00000000149011611...$
Of course, it gets normalized, the mantissa is shifted and exponent aligned, leaving $\approx 1.490116 \cdot 10^{-9}$.
After converting this to a single-precision value, it left as the 'noise':

\begin{lstlisting}
...
0.000000 1.490116e-09 0 01100001 10011001100110011001101 0x30cccccd
a!=0
...
\end{lstlisting}

Again, if summing the 'noise' with itself, it grows.

What if it's compiled for 80x87 FPU support? As it was done before SIMD?

\begin{lstlisting}
% gcc -mfpmath=387 2.c -S -masm=intel -lm

...

        fld     DWORD PTR -8[rbp]
        fld     QWORD PTR .LC8[rip]
        faddp   st(1), st
        fstp    DWORD PTR -8[rbp]

...

        fld     DWORD PTR -8[rbp]
        fld     QWORD PTR .LC8[rip]
        fsubp   st(1), st
        fstp    DWORD PTR -8[rbp]
...


.LC8:
        .long   2576980378
        .long   1069128089
\end{lstlisting}

You see, one operand for faddp/fsubp instruction (add/sub) loaded as a 32-bit value (single-precision),
but another (which is constant) as a 64-bit value (double-precision).

The solution: switching 'float' to 'double' in that code would eliminate problem.

\myheading{The moral of the story}

IEEE 754 is hard.
Mathematically flawless expressions and algorithms can go crazy out of the blue.

\href{https://en.wikipedia.org/wiki/Round-off_error}{Wikipedia: Round-off error}.

A well-known introduction is
\href{https://cr.yp.to/2005-590/goldberg.pdf}{What Every ComputerScientist Should Know About Floating-Point Arithmetic}
by David Goldberg.

The \href{http://www.numerical.recipes/}{Numerical Recipes} book is also recommended,
as well as D.Knuth's TAOCP. % TODO acronym

The field is \href{https://en.wikipedia.org/wiki/Numerical_analysis}{numerical analysis}.

\myheading{All the files}

(Including the Wolfram Mathematica notebook I used.)

\MakeURL{\RepoPath\CURPATH/files}

\levelup{}