Numerical Reproducibility

Numerical reproducibility is important, but its importance can be overstated. A dice simulator which always throws a double-six is suspect. And so too is any Monte Carlo algorithm which always gives precisely the same result. In this case, the variations between runs can alert one to the probably accuracy of the result.

But what of simpler numerical problems? Surely they are entirely deterministic and portatble between different computers? Possibly not.

The Maths Library

For simple operations such as addition and multiplication standards such as IEEE-754 help to ensure that answers are consistent, if not always precise. They often cannot be precise, for a computer can store neither 1/3 nor 1/5 as a finite-length binary fraction, and Humans can write the second as a finite-length decimal fraction, but not the first.

Most complicated operations, including trigonometric functions, have a lesser guarantee. Consider the following short piece of python:

python3 -c 'import math ; print(math.cos(2*math.pi/23))'

Run it on a Linux computer and on a Mac, and slightly different answers may be obtained.

  Linux: 0.9629172873477992
  MacOS: 0.9629172873477994

Those answers differ by just one in the least significant bit, and the exact result lies between them, almost exactly half-way between them.

Python will have used the system maths library, Gnu's on Linux, and Apple's on MacOS, and they are not identical. A reasonable expectation would be that the difference between the result returned and the exact result would be less than the least significant bit, and both meet that condition in this case.

Most other languages, including compiled languages such as C and Fortran, will use the system-provided maths library. Some may even offer faster, less accurate, alternatives, or versions optimised for operating on a whole vector at once.

So once one starts using trigonometic, or transcendental, functions, perfect reproducibility between different systems may be lost.

Fused Multiply Add

Most modern CPUs have a fused multiply add (FMA) instruction. This is a single instruction which calculates a*b+c for a, b and c floating point numbers.

It differs from the traditional two instruction approach in one important respect. The traditional approach calculates a*b, rounds it, then adds c and rounds again. The FMA approach calculates a*b exactly, adds c, and then rounds. So the answer can be different, and, when it is, is more accurate.

IBM introduced the FMA instruction in 1990 with the birth of its POWER range of CPUs. In the Intel x86-compatible world, it first appeared in 2013 with Intel's Haswell range of CPUs and AMD's Piledriver range. Apple's ARM-based CPUs (M1, M2, etc) all have it.

Many languages, including C, permit the use of FMA, accepting that it may produce different answers. As a short example:

#include <stdio.h>

int main(){
  double a,b,c;
  a=(1<<30)-1;
  b=-((1<<30)+1);
  c=1<<30;
  c=c*c;
  a=a*b+c;

  printf("%f\n",a);
  return(0);
}

Without FMA this will print zero. The variable a is set to 230-1, and b to -(230+1). Those values can be stored exactly. Their product is -(260-1). That cannot be stored exactly in a standard 64-bit double precision value, so it gets rounded to -260. The variable c is set to 230, then 260, both values being exactly storable. The sum of a*b and c is thus zero.

With FMA, one is printed, for the product a*b is not rounded. One is mathematically the correct answer.

Apple-ARM$ clang test.c 
Apple-ARM$ ./a.out
1.000000

Pi-5$ clang test.c 
Pi-5$ ./a.out 
1.000000
Pi-5$ gcc test.c
Pi-5$  ./a.out 
0.000000

Pi-5$ clang -ffp-contract=off test.c
Pi-5$ ./a.out 
0.000000

Linux-PC$ clang test.c 
Linux-PC$ ./a.out 
0.000000
Linux-PC$ clang -mfma test.c 
Linux=PC$ ./a.out 
1.000000
Linux-PC$ gcc test.c
Linux-PC$ ./a.out 
0.000000
Linux-PC$ gcc -mfma test.c
Linux-PC$ ./a.out 
0.000000

Above clang always uses FMA, except when told not to, and except on a Linux PC when compatibility with old CPUs without the FMA instruction might be required. Conversely, for this example it seems impossible to make gcc use FMA at all, although it does in other circumstances.

Bad FMA

It might seem that FMA only ever increases accuracy. This is not so. Consider

#include<stdio.h>

int main(){
  double a,b,c,d,x;
  a=c=(1<<30)+1;
  b=d=(1<<30)-1;
  x=a*b-c*d;

  printf("%f\n",x);
  return(0);
}

One might hope that this would always print zero, as a*b-c*d with a=c and b=d is surely zero. But no. Suppose the compiler interprets this as

  tmp=c*d;
  x=FMA(a,b,-tmp); /* i.e. a*b-tmp */

Now the product c*d is rounded to double precision immediately after the multiplication, which means rounding up from 260-1 to 260. But the product a*b is not rounded. Only after the addition does any rounding occur.

If the compiler does chose to use FMA in this sort of manner, and clang-14 will here, then the result is minus one. It is not even necessary to use a shift as large as 30 to trigger this; a mere 27 suffices.

Note that FMA is sometimes abbreviated FMAC (Fused Multiply ACcumulate), using an old term for addition in computers.

These examples used clang version 14 and gcc version 12.