real*complex optimisation in Fortran
subroutine scale(a,c,n) integer :: i,n real(kind(1d0)) :: a complex(kind(1d0)) :: c(*) !$omp simd do i=1,n c(i)=a*c(i) enddo end subroutine scale
This looks like something for which it is fairly easy to produce optimal code, even when targeting AVX2. But many compilers fail, in many different ways. What sort of code might one expect?
movq (n),%rdi # Loop exit xorl %rdx,%rdx # Zero loop counter lea (c),%rax # Load pointer to c vbroadcastsd (a), %ymm0 # Copy scalar a into # all four parts of %ymm0 .loop vmulpd (%rax), %ymm0, %ymm1 # load c(i), multiply by a, # store in %ymm1 vmovupd %ymm1, (%rax) # store c(i) addq $32, %rax # update pointer addq $2, %rdx # update loop counter cmpq %rdi, %rdx # compare counter with exit condition jb .loop # conditional jump
So the above code works on four doubles, or two double complexes, in each iteration, and hence increments the loop counter by two. It is not absolutely optimal, but it is a reasonable start.
It does do two clever things. Firstly loading a into a register, and then repeating it throughout a vector register, is done outside of the loop. Secondly, although the Fortran standard is clear that real*complex should be interpreted by promoting the real to a complex, this code has simplified this and does not perform a full complex-complex multiplication.
Produces something similar to the above, but with the loop unrolled twice.
Promotes to a full complex-complex multiplication, and uses scalar registers throughout -- traditional unoptimised code.
Promotes to a full complex-complex multiplication, but now uses the full width of the AVX2 registers, so the loop body looks like
.L4: vmovupd (%rdx), %ymm0 addl $1, %ecx addq $32, %rdx vpermpd $177, %ymm0, %ymm1 vmovapd %ymm0, %ymm2 vmulpd %ymm5, %ymm1, %ymm1 vfmsub132pd %ymm4, %ymm1, %ymm2 vfmadd132pd %ymm4, %ymm1, %ymm0 vshufpd $10, %ymm0, %ymm2, %ymm0 vmovupd %ymm0, -32(%rdx) cmpl %edi, %ecx jb .L4
-O3 -fno-signed-zeros -march=haswell
The addition of the
-fno-signed-zeros allows gcc to
optimise away the multiplications by zero. It now produces
.L4: vmulpd (%rdx), %ymm2, %ymm1 addl $1, %ecx addq $32, %rdx vmovupd %ymm1, -32(%rdx) cmpl %edi, %ecx jb .L4
(Version 8.2.0 is slightly cleverer in that it uses %rdx as both pointer offset and loop counter, saving one integer register and one integer add.)
PGI Fortran 18.10,
.loop vxorpd %xmm0, %xmm0, %xmm0 vmovsd (%rdi), %xmm0 movslq %ecx, %rdx shlq $4, %rdx vmovddup %xmm0, %xmm0 vmovupd -16(%rdx,%rsi), %xmm1 vmulpd %xmm1, %xmm0, %xmm0 vmovupd %xmm0, -16(%rdx,%rsi) addl $1, %ecx subl $1, %eax testl %eax, %eax jg .LB1_307
Only two-element vectors used, and are reloaded on each loop iteration.
PGI Fortran 18.10,
.LB1_400: movl %eax, %r8d andl $2, %r8d jne .LB1_409 prefetchnta 80(%rcx) vunpckhpd %xmm0, %xmm0, %xmm2 vunpcklpd %xmm0, %xmm0, %xmm1 vmovupd (%rcx), %ymm4 subl $2, %eax vinsertf128 $1, %xmm2, %ymm1, %ymm3 vmulpd %ymm4, %ymm3, %ymm1 vmovupd %ymm1, (%rcx) addq $32, %rcx testl %eax, %eax jg .LB1_400
Considerably better, in that the full length of the AVX2 registers is now used, prefetching is used, and a is cached in a register. However, it caches a as a scalar, not a vector, and reconstructs the vector on each iteration. The prefetching is also not beneficial for the test used, as discussed in more detail on the conjugation page.
I failed to persuade this compiler (12.6) to vectorise this code at
all, even with
-xarch=avx2 -O3 -xvector=simd -xivdep
!DIR IVDEP before the loop.
Does it matter?
Times for 10,000,000 calls with n=1000. There are 2x1010 multiplications, so 2s would be equivalent to 10GFLOPS. A 3.7GHz CPU was used, capable of performing four multiplications per clock cycle with AVX2 instructions, i.e. 14.8GFLOPS, so a theoretical minimum of 1.35s.
|gfortran -O3 -fno-signed-zeros||1.96s|