matmul optimisation in Fortran
Fortran has an intrinsic operation for matrix-matrix multiplication. Different compilers optimise this differently, with obvious choices including:
- Insert three nested loops early in parsing
- Replace by call to own matmul routine
- Replace by call to external BLAS library
To test what compilers do, a simple example code was used,
matmul.f90
. It takes about two
minutes to run (or rather less if the compiler does a good
job!). The kernel is
c = c + matmul(a,b)
where a
, b
and c
are square
matrices. The BLAS matmul call, dgemm, is quite general,
calculating
c = alpha*a*b + beta*c
where alpha
and beta
are scalars. So the
test used here can be reduced to a single call to dgemm with
alpha=beta=1
, as shown by
dgemm.f90
. It could also be
expanded as
temp=matmul(a,b) c=c+temp
where temp
is a temporary array, or, if expanded as
nested loops, no temporary storage is required.
Matrix multiplication for square matrices may be an old and classic benchmark, but the performance here, even on a single core, varies by more than a factor of ten between different compilers, even when running all at high optimisation. The tests were all performed on x86_64 Linux.
gfortran 9.2.0
Gfortran replaces the matmul instrinsic with a call
to _gfortran_matmul_r8
from its own run-time
library. This function does automatic CPU detection, using AVX when
appropriate, and achieved around 20 GFLOPS on a 3GHz
Skylake.
One can also choose to compile with the
flag -fexternal-blas
and then one must also provide a
suitable BLAS library to link against. Using OpenBLAS produced
42 GFLOPS on a single core, and 150 GFLOPS running on all
four cores of the test computer.
flang, AOCC 2.3.0
Again matmul replaced by a call to the compiler's run-time library. However, this one is not as good as gfortran's. It achieves just under 4 GFLOPS, and makes no use of SSE or AVX instructions, so immediately loses a factor of eight on potential peak performance.
There seems to be no option for using an external BLAS library, even though AMD does supply one in the form of BLIS.
AOCC 2.2.0 behaved differently. At low optimisations it used a call
to its run-time library, but its run-time library was considerably
better, using SSE and achieving 10 GFLOPS. But at higher
optimisation levels it uses f90_mmul_real8
rather
than f90_matmul_real8_i8
. This drops its performance to
just under 6 GFLOPS.
Using the 2.3.0 runtime library with an executable compiled with 2.2.0 results in the performance of 2.3.0.
ifort 19.1.3
The interesting approach to array temporaries used by ifort means
that it may be necessary to compile with the
flag -heap-arrays
.
Compiled with no further options, the performance is around 7 GFLOPS for small arrays, dropping to 3.5 GFLOPS for large arrays. No library is called, and blocking is either absent or not completely effective. AVX is not used.
Compiled with -Ofast
it improves to around
11 GFLOPS for all sizes, but still uses no AVX
instructions.
Adding -mavx2
does result in the use of AVX
instructions, and the performance improves to 17 GFLOPS. It
fails though to use any FMA instructions. This variation in
performance with optimisation level suggests that ifort is using
some form of nested loop insertion.
Compiling with -qopt-matmul
results in a call
to matmul_mkl_f64_
. However, note that this will use a
threaded version of MKL regardless of whether one specifies
-mkl=sequential or not. It is fast, achieving 46 GFLOPS when
run with OMP_NUM_THREADS=1
and 170 GFLOPS on all
four cores.
When compiled with -qopt-matmul
ifort still
needs -heap-arrays
. This suggests that it is
calculating
temp=matmul(a,b) c=c+temp
and using BLAS for the matrix multiplication only, and not the addition.
The inline approach of ifort is fast for small matrices (under
about 40x40). When -qopt-matmul
is used I believe that
it does run-time switching between inline code and MKL depending on
the matrix size.
nvfortran 20.7
Uses a call to pgf90_mmul_real8
which achieves
10.5 GFLOPS using SSE code. Compiling with -O2
or above still appears to use pgf90_mmul_real8
but the
performance drops to 2.4 GFLOPS. The same effect is seen with
PGI F90 19.4 when the LLVM backend is used, but not with PGI's backend.
Sun/Oracle f95 (from Studio 12.6)
I believe that this compiler is no longer being developed.
With no options, it calls __f95_dgemm_
and achieves
9 GFLOPS using SSE instructions.