complex optimisation in Fortran

Whilst Fortran is generally thought of as being easy for a compiler to optimise, that does not prove that all compilers do produce optimal code. One area often thought to be weak is the optimisation of expressions involving complex numbers, because "important" people rarely use them. ("Important" means people paying high licence fees, rather than Theoretical Physicists and applied Mathematicians with discounted academic licences.)

I first wrote these pages at the end of 2018, and one can still see the old pages. But it is now March 2023, and a good time to revisit the results to see if anything has changed, and to include new compilers such as ifx.

These pages give a snap-shot of the state of some major Fortran compilers on x86_64 Linux in March 2023. It should not be used for purchasing decisions if it is not now the the first half of 2023, or, indeed, probably at all, for very synthetic benchmarks like these are not very representative of performance on real-world code. However, these pages might be of some educational interest. Click on the code snippets below for a link to the corresponding full page.

Results

Timings on a 3GHz Kaby Lake. All with -O3 -fopenmp and something to target the AVX2 instruction set. There is the slight complication that the Kaby Lake likes short loops to be 32-byte aligned, but many compilers merely align them to 16 byte boundaries. This can create results which are sometimes fast, and sometimes slow, depending on the alignment caused by code outside of the loop. Both gfortran and nvfortran suffer from this.

conjgmulz scalei scaler scalemean
ifort0.18ns0.38ns0.29ns0.30ns0.22ns0.27ns
nvfortran0.26ns0.53ns0.41ns0.41ns0.20ns0.36ns
gfortran 120.34ns0.69ns0.51ns0.31ns0.35ns0.44ns
gfortran 110.34ns1.10ns0.45ns0.43ns0.51ns0.57ns
ifx (2023.1, 2023.2)0.87ns1.08ns0.76ns0.34ns0.75ns0.76ns
flang (AOCC)0.34ns1.30ns0.79ns0.77ns0.79ns0.80ns
ifx (2022.2)0.87ns2.09ns1.42ns1.42ns1.42ns1.44ns

Comments

Such synthetic tests never tell the whole story. Real loops generally combine more arithmetic operations, and it would be interesting to see how well the compilers cope then. But the performance differences here are large, particularly so between Intel's soon to be discontinued ifort, and its replacement ifx.

Compiler versions used

$ ifort --version
ifort (IFORT) 2021.7.1 20221019
$ nvfortran --version
nvfortran 22.11-0 64-bit target on x86-64 Linux -tp haswell
$ gfortran --version
GNU Fortran (Ubuntu 11.2.0-19ubuntu1) 11.2.0
$ gfortran-12 --version
GNU Fortran (GCC) 12.1.0
$ flang-amd --version
AMD clang version 14.0.6 (CLANG: AOCC_4.0.0-Build#434 2022_10_28) (based on LLVM Mirror.Version.14.0.6)
$ ifx --version
ifx (IFORT) 2022.2.1 20221020
$ ifx --version
ifx (IFX) 2023.1.0 20230320

All run with -O3 -fopenmp-simd -march=skylake save for the Intel compilers which need -qopenmp-simd, and the Nvidia compiler which needs no SIMD option, but does need -march=host to stop it generating AVX-512 instructions.

Code Snippets

Optimising conjugation

  do i=1,n
     c(i)=conjg(c(i))
  enddo
  

Optimising multiplying complex by complex

  complex :: a(*)
  do i=1,n
     c(i)=a(i)*c(i)
  enddo
  

Optimising scaling complex by complex

  complex :: a
  do i=1,n
     c(i)=a*c(i)
  enddo
  

Optimising scaling complex by real

  real :: a
  do i=1,n
     c(i)=a*c(i)
  enddo
  

Optimising scaling complex by i

  do i=1,n
     c(i)=c(i)*(0d0,1d0)
  enddo
  

These were all timed by using a vector length of 1,000 (assumed to fit in L1 cache), and repeating 10,000,000 times. This should make memory references as cheap as possible, and thus make it easier to see differences in compute effort.

Tar file of example code