complex
optimisation in Fortran
Whilst Fortran is generally thought of as being easy for a compiler to optimise, that does not prove that all compilers do produce optimal code. One area often thought to be weak is the optimisation of expressions involving complex numbers, because "important" people rarely use them. ("Important" means people paying high licence fees, rather than Theoretical Physicists and applied Mathematicians with discounted academic licences.)
I first wrote these pages at the end of 2018, and one can still see the old pages. But it is now March 2023, and a good time to revisit the results to see if anything has changed, and to include new compilers such as ifx.
These pages give a snap-shot of the state of some major Fortran compilers on x86_64 Linux in March 2023. It should not be used for purchasing decisions if it is not now the the first half of 2023, or, indeed, probably at all, for very synthetic benchmarks like these are not very representative of performance on real-world code. However, these pages might be of some educational interest. Click on the code snippets below for a link to the corresponding full page.
Results
Timings on a 3GHz Kaby Lake. All with -O3 -fopenmp
and
something to target the AVX2 instruction set. There is the slight
complication that the Kaby Lake likes short loops to be 32-byte
aligned, but many compilers merely align them to 16 byte
boundaries. This can create results which are sometimes fast, and
sometimes slow, depending on the alignment caused by code outside of
the loop. Both gfortran and nvfortran suffer from this.
conjg | mul | z scale | i scale | r scale | mean | |
---|---|---|---|---|---|---|
ifort | 0.18ns | 0.38ns | 0.29ns | 0.30ns | 0.22ns | 0.27ns |
nvfortran | 0.26ns | 0.53ns | 0.41ns | 0.41ns | 0.20ns | 0.36ns |
gfortran 12 | 0.34ns | 0.69ns | 0.51ns | 0.31ns | 0.35ns | 0.44ns |
gfortran 11 | 0.34ns | 1.10ns | 0.45ns | 0.43ns | 0.51ns | 0.57ns |
ifx (2023.1, 2023.2) | 0.87ns | 1.08ns | 0.76ns | 0.34ns | 0.75ns | 0.76ns |
flang (AOCC) | 0.34ns | 1.30ns | 0.79ns | 0.77ns | 0.79ns | 0.80ns |
ifx (2022.2) | 0.87ns | 2.09ns | 1.42ns | 1.42ns | 1.42ns | 1.44ns |
Comments
Such synthetic tests never tell the whole story. Real loops generally combine more arithmetic operations, and it would be interesting to see how well the compilers cope then. But the performance differences here are large, particularly so between Intel's soon to be discontinued ifort, and its replacement ifx.
Compiler versions used
$ ifort --version ifort (IFORT) 2021.7.1 20221019 $ nvfortran --version nvfortran 22.11-0 64-bit target on x86-64 Linux -tp haswell $ gfortran --version GNU Fortran (Ubuntu 11.2.0-19ubuntu1) 11.2.0 $ gfortran-12 --version GNU Fortran (GCC) 12.1.0 $ flang-amd --version AMD clang version 14.0.6 (CLANG: AOCC_4.0.0-Build#434 2022_10_28) (based on LLVM Mirror.Version.14.0.6) $ ifx --version ifx (IFORT) 2022.2.1 20221020 $ ifx --version ifx (IFX) 2023.1.0 20230320
All run with -O3 -fopenmp-simd -march=skylake
save for
the Intel compilers which need -qopenmp-simd
, and the
Nvidia compiler which needs no SIMD option, but does
need -march=host
to stop it generating AVX-512
instructions.
Code Snippets
Optimising conjugation
do i=1,n c(i)=conjg(c(i)) enddo
Optimising multiplying complex by complex
complex :: a(*) do i=1,n c(i)=a(i)*c(i) enddo
Optimising scaling complex by complex
complex :: a do i=1,n c(i)=a*c(i) enddo
Optimising scaling complex by real
real :: a do i=1,n c(i)=a*c(i) enddo
Optimising scaling complex by i
do i=1,n c(i)=c(i)*(0d0,1d0) enddo
These were all timed by using a vector length of 1,000 (assumed to fit in L1 cache), and repeating 10,000,000 times. This should make memory references as cheap as possible, and thus make it easier to see differences in compute effort.