matmul optimisation in Fortran

Fortran has an intrinsic operation for matrix-matrix multiplication. Different compilers optimise this differently, with obvious choices including:

To test what compilers do, a simple example code was used, matmul.f90. It takes about two minutes to run (or rather less if the compiler does a good job!). The kernel is

  c = c + matmul(a,b)

where a, b and c are square matrices. The BLAS matmul call, dgemm, is quite general, calculating

  c = alpha*a*b + beta*c

where alpha and beta are scalars. So the test used here can be reduced to a single call to dgemm with alpha=beta=1, as shown by dgemm.f90. It could also be expanded as


where temp is a temporary array, or, if expanded as nested loops, no temporary storage is required.

Matrix multiplication for square matrices may be an old and classic benchmark, but the performance here, even on a single core, varies by more than a factor of ten between different compilers, even when running all at high optimisation. The tests were all performed on x86_64 Linux.

gfortran 9.2.0

Gfortran replaces the matmul instrinsic with a call to _gfortran_matmul_r8 from its own run-time library. This function does automatic CPU detection, using AVX when appropriate, and achieved around 20 GFLOPS on a 3GHz Skylake.

One can also choose to compile with the flag -fexternal-blas and then one must also provide a suitable BLAS library to link against. Using OpenBLAS produced 42 GFLOPS on a single core, and 150 GFLOPS running on all four cores of the test computer.

flang, AOCC 2.3.0

Again matmul replaced by a call to the compiler's run-time library. However, this one is not as good as gfortran's. It achieves just under 4 GFLOPS, and makes no use of SSE or AVX instructions, so immediately loses a factor of eight on potential peak performance.

There seems to be no option for using an external BLAS library, even though AMD does supply one in the form of BLIS.

AOCC 2.2.0 behaved differently. At low optimisations it used a call to its run-time library, but its run-time library was considerably better, using SSE and achieving 10 GFLOPS. But at higher optimisation levels it uses f90_mmul_real8 rather than f90_matmul_real8_i8. This drops its performance to just under 6 GFLOPS.

Using the 2.3.0 runtime library with an executable compiled with 2.2.0 results in the performance of 2.3.0.

ifort 19.1.3

The interesting approach to array temporaries used by ifort means that it may be necessary to compile with the flag -heap-arrays.

Compiled with no further options, the performance is around 7 GFLOPS for small arrays, dropping to 3.5 GFLOPS for large arrays. No library is called, and blocking is either absent or not completely effective. AVX is not used.

Compiled with -Ofast it improves to around 11 GFLOPS for all sizes, but still uses no AVX instructions.

Adding -mavx2 does result in the use of AVX instructions, and the performance improves to 17 GFLOPS. It fails though to use any FMA instructions. This variation in performance with optimisation level suggests that ifort is using some form of nested loop insertion.

Compiling with -qopt-matmul results in a call to matmul_mkl_f64_. However, note that this will use a threaded version of MKL regardless of whether one specifies -mkl=sequential or not. It is fast, achieving 46 GFLOPS when run with OMP_NUM_THREADS=1 and 170 GFLOPS on all four cores.

When compiled with -qopt-matmul ifort still needs -heap-arrays. This suggests that it is calculating


and using BLAS for the matrix multiplication only, and not the addition.

The inline approach of ifort is fast for small matrices (under about 40x40). When -qopt-matmul is used I believe that it does run-time switching between inline code and MKL depending on the matrix size.

nvfortran 20.7

Uses a call to pgf90_mmul_real8 which achieves 10.5 GFLOPS using SSE code. Compiling with -O2 or above still appears to use pgf90_mmul_real8 but the performance drops to 2.4 GFLOPS. The same effect is seen with PGI F90 19.4 when the LLVM backend is used, but not with PGI's backend.

Sun/Oracle f95 (from Studio 12.6)

I believe that this compiler is no longer being developed.

With no options, it calls __f95_dgemm_ and achieves 9 GFLOPS using SSE instructions.