matmul optimisation in Fortran
Fortran has an intrinsic operation for matrix-matrix multiplication. Different compilers optimise this differently, with obvious choices including:
- Insert three nested loops early in parsing
- Replace by call to own matmul routine
- Replace by call to external BLAS library
To test what compilers do, a simple example code was used,
matmul.f90. It takes about two
minutes to run (or rather less if the compiler does a good
job!). The kernel is
c = c + matmul(a,b)
c are square
matrices. The BLAS matmul call, dgemm, is quite general,
c = alpha*a*b + beta*c
beta are scalars. So the
test used here can be reduced to a single call to dgemm with
alpha=beta=1, as shown by
dgemm.f90. It could also be
temp is a temporary array, or, if expanded as
nested loops, no temporary storage is required.
Matrix multiplication for square matrices may be an old and classic benchmark, but the performance here, even on a single core, varies by more than a factor of ten between different compilers, even when running all at high optimisation. The tests were all performed on x86_64 Linux.
Gfortran replaces the matmul instrinsic with a call
_gfortran_matmul_r8 from its own run-time
library. This function does automatic CPU detection, using AVX when
appropriate, and achieved around 20 GFLOPS on a 3GHz
One can also choose to compile with the
-fexternal-blas and then one must also provide a
suitable BLAS library to link against. Using OpenBLAS produced
42 GFLOPS on a single core, and 150 GFLOPS running on all
four cores of the test computer.
flang, AOCC 2.3.0
Again matmul replaced by a call to the compiler's run-time library. However, this one is not as good as gfortran's. It achieves just under 4 GFLOPS, and makes no use of SSE or AVX instructions, so immediately loses a factor of eight on potential peak performance.
There seems to be no option for using an external BLAS library, even though AMD does supply one in the form of BLIS.
AOCC 2.2.0 behaved differently. At low optimisations it used a call
to its run-time library, but its run-time library was considerably
better, using SSE and achieving 10 GFLOPS. But at higher
optimisation levels it uses
f90_matmul_real8_i8. This drops its performance to
just under 6 GFLOPS.
Using the 2.3.0 runtime library with an executable compiled with 2.2.0 results in the performance of 2.3.0.
The interesting approach to array temporaries used by ifort means
that it may be necessary to compile with the
Compiled with no further options, the performance is around 7 GFLOPS for small arrays, dropping to 3.5 GFLOPS for large arrays. No library is called, and blocking is either absent or not completely effective. AVX is not used.
-Ofast it improves to around
11 GFLOPS for all sizes, but still uses no AVX
-mavx2 does result in the use of AVX
instructions, and the performance improves to 17 GFLOPS. It
fails though to use any FMA instructions. This variation in
performance with optimisation level suggests that ifort is using
some form of nested loop insertion.
-qopt-matmul results in a call
matmul_mkl_f64_. However, note that this will use a
threaded version of MKL regardless of whether one specifies
-mkl=sequential or not. It is fast, achieving 46 GFLOPS when
OMP_NUM_THREADS=1 and 170 GFLOPS on all
When compiled with
-qopt-matmul ifort still
-heap-arrays. This suggests that it is
and using BLAS for the matrix multiplication only, and not the addition.
The inline approach of ifort is fast for small matrices (under
about 40x40). When
-qopt-matmul is used I believe that
it does run-time switching between inline code and MKL depending on
the matrix size.
Uses a call to
pgf90_mmul_real8 which achieves
10.5 GFLOPS using SSE code. Compiling with
or above still appears to use
pgf90_mmul_real8 but the
performance drops to 2.4 GFLOPS. The same effect is seen with
PGI F90 19.4 when the LLVM backend is used, but not with PGI's backend.
Sun/Oracle f95 (from Studio 12.6)
I believe that this compiler is no longer being developed.
With no options, it calls
__f95_dgemm_ and achieves
9 GFLOPS using SSE instructions.