# Numerical Reproducibility

Numerical reproducibility is important, but its importance can be overstated. A dice simulator which always throws a double-six is suspect. And so too is any Monte Carlo algorithm which always gives precisely the same result. In this case, the variations between runs can alert one to the probably accuracy of the result.

But what of simpler numerical problems? Surely they are entirely deterministic and portatble between different computers? Possibly not.

## The Maths Library

For simple operations such as addition and multiplication standards such as IEEE-754 help to ensure that answers are consistent, if not always precise. They often cannot be precise, for a computer can store neither 1/3 nor 1/5 as a finite-length binary fraction, and Humans can write the second as a finite-length decimal fraction, but not the first.

Most complicated operations, including trigonometric functions, have a lesser guarantee. Consider the following short piece of python:

python3 -c 'import math ; print(math.cos(2*math.pi/23))'

Run it on a Linux computer and on a Mac, and slightly different answers may be obtained.

Linux: 0.9629172873477992 MacOS: 0.9629172873477994

Those answers differ by just one in the least significant bit, and the exact result lies between them, almost exactly half-way between them.

Python will have used the system maths library, Gnu's on Linux, and Apple's on MacOS, and they are not identical. A reasonable expectation would be that the difference between the result returned and the exact result would be less than the least significant bit, and both meet that condition in this case.

Most other languages, including compiled languages such as C and Fortran, will use the system-provided maths library. Some may even offer faster, less accurate, alternatives, or versions optimised for operating on a whole vector at once.

So once one starts using trigonometic, or transcendental, functions, perfect reproducibility between different systems may be lost.

## Fused Multiply Add

Most modern CPUs have a fused multiply add (FMA) instruction. This
is a single instruction which calculates `a*b+c`

for `a`

, `b`

and `c`

floating point
numbers.

It differs from the traditional two instruction approach in one
important respect. The traditional approach
calculates `a*b`

, rounds it, then adds `c`

and
rounds again. The FMA approach calculates `a*b`

exactly,
adds `c`

, and then rounds. So the answer can be
different, and, when it is, is more accurate.

IBM introduced the FMA instruction in 1990 with the birth of its POWER range of CPUs. In the Intel x86-compatible world, it first appeared in 2013 with Intel's Haswell range of CPUs and AMD's Piledriver range. Apple's ARM-based CPUs (M1, M2, etc) all have it.

Many languages, including C, permit the use of FMA, accepting that it may produce different answers. As a short example:

#include <stdio.h> int main(){ double a,b,c; a=(1<<30)-1; b=-((1<<30)+1); c=1<<30; c=c*c; a=a*b+c; printf("%f\n",a); return(0); }

Without FMA this will print zero. The variable `a`

is
set to 2^{30}-1, and `b`

to
-(2^{30}+1). Those values can be stored exactly. Their
product is -(2^{60}-1). That cannot be stored exactly in a
standard 64-bit double precision value, so it gets rounded to
-2^{60}. The variable `c`

is set to
2^{30}, then 2^{60}, both values being exactly
storable. The sum of `a*b`

and `c`

is thus
zero.

With FMA, one is printed, for the product `a*b`

is not
rounded. One is mathematically the correct answer.

Apple-ARM$ clang test.c Apple-ARM$ ./a.out 1.000000 Pi-5$ clang test.c Pi-5$ ./a.out 1.000000 Pi-5$ gcc test.c Pi-5$ ./a.out 0.000000 Pi-5$ clang -ffp-contract=off test.c Pi-5$ ./a.out 0.000000 Linux-PC$ clang test.c Linux-PC$ ./a.out 0.000000 Linux-PC$ clang -mfma test.c Linux=PC$ ./a.out 1.000000 Linux-PC$ gcc test.c Linux-PC$ ./a.out 0.000000 Linux-PC$ gcc -mfma test.c Linux-PC$ ./a.out 0.000000

Above clang always uses FMA, except when told not to, and except on a Linux PC when compatibility with old CPUs without the FMA instruction might be required. Conversely, for this example it seems impossible to make gcc use FMA at all, although it does in other circumstances.

### Bad FMA

It might seem that FMA only ever increases accuracy. This is not so. Consider

#include<stdio.h> int main(){ double a,b,c,d,x; a=c=(1<<30)+1; b=d=(1<<30)-1; x=a*b-c*d; printf("%f\n",x); return(0); }

One might hope that this would always print zero,
as `a*b-c*d`

with `a=c`

and `b=d`

is
surely zero. But no. Suppose the compiler interprets this as

tmp=c*d; x=FMA(a,b,-tmp); /* i.e. a*b-tmp */

Now the product `c*d`

is rounded to double precision
immediately after the multiplication, which means rounding up from
2^{60}-1 to 2^{60}. But the product `a*b`

is not rounded. Only after the addition does any rounding occur.

If the compiler does chose to use FMA in this sort of manner, and clang-14 will here, then the result is minus one. It is not even necessary to use a shift as large as 30 to trigger this; a mere 27 suffices.

Note that FMA is sometimes abbreviated FMAC (Fused Multiply ACcumulate), using an old term for addition in computers.

These examples used clang version 14 and gcc version 12.