real*complex optimisation in Fortran

subroutine scale(a,c,n)
  integer :: i,n
  real(kind(1d0)) :: a
  complex(kind(1d0)) :: c(*)

!$omp simd
  do i=1,n
     c(i)=a*c(i)
  enddo
end subroutine scale

This looks like something for which it is fairly easy to produce optimal code, even when targeting AVX2. But many compilers fail, in many different ways. What sort of code might one expect?

  movq          (n),%rdi         # Loop exit
  xorl           %rdx,%rdx       # Zero loop counter
  lea           (c),%rax         # Load pointer to c
  vbroadcastsd  (a), %ymm0       # Copy scalar a into
                                 # all four parts of %ymm0
.loop
  vmulpd    (%rax), %ymm0, %ymm1 # load c(i), multiply by a,
                                 #  store in %ymm1
  vmovupd   %ymm1, (%rax)        # store c(i)
  addq      $32, %rax            # update pointer
  addq      $2, %rdx             # update loop counter
  cmpq      %rdi, %rdx           # compare counter with exit condition
  jb .loop                       # conditional jump

So the above code works on four doubles, or two double complexes, in each iteration, and hence increments the loop counter by two. It is not absolutely optimal, but it is a reasonable start.

It does do two clever things. Firstly loading a into a register, and then repeating it throughout a vector register, is done outside of the loop. Secondly, although the Fortran standard is clear that real*complex should be interpreted by promoting the real to a complex, this code has simplified this and does not perform a full complex-complex multiplication.

Intel compiler

ifort 19.0.1, -march=core-avx2

Produces something similar to the above, but with the loop unrolled twice.

GCC

gfortran 7.3.0/8.2.0, -march=haswell -fopenmp-simd

Promotes to a full complex-complex multiplication, and uses scalar registers throughout -- traditional unoptimised code.

gfortran 7.3.0/8.2.0, -O3 -march=haswell -fopenmp-simd

Promotes to a full complex-complex multiplication, but now uses the full width of the AVX2 registers, so the loop body looks like

.L4:
        vmovupd (%rdx), %ymm0
        addl    $1, %ecx
        addq    $32, %rdx
        vpermpd $177, %ymm0, %ymm1
        vmovapd %ymm0, %ymm2
        vmulpd  %ymm5, %ymm1, %ymm1
        vfmsub132pd     %ymm4, %ymm1, %ymm2
        vfmadd132pd     %ymm4, %ymm1, %ymm0
        vshufpd $10, %ymm0, %ymm2, %ymm0
        vmovupd %ymm0, -32(%rdx)
        cmpl    %edi, %ecx
        jb      .L4

gfortran 7.3.0/8.2.0, -O3 -fno-signed-zeros -march=haswell -fopenmp-simd

The addition of the -fno-signed-zeros allows gcc to optimise away the multiplications by zero. It now produces

.L4:
        vmulpd  (%rdx), %ymm2, %ymm1
        addl    $1, %ecx
        addq    $32, %rdx
        vmovupd %ymm1, -32(%rdx)
        cmpl    %edi, %ecx
        jb      .L4

(Version 8.2.0 is slightly cleverer in that it uses %rdx as both pointer offset and loop counter, saving one integer register and one integer add.)

PGI

PGI Fortran 18.10, -tp=haswell

.loop
      vxorpd  %xmm0, %xmm0, %xmm0
      vmovsd  (%rdi), %xmm0
      movslq  %ecx, %rdx
      shlq    $4, %rdx
      vmovddup        %xmm0, %xmm0
      vmovupd -16(%rdx,%rsi), %xmm1
      vmulpd  %xmm1, %xmm0, %xmm0
      vmovupd %xmm0, -16(%rdx,%rsi)
      addl    $1, %ecx
      subl    $1, %eax
      testl   %eax, %eax
      jg      .LB1_307

Only two-element vectors used, and are reloaded on each loop iteration.

PGI Fortran 18.10, -tp=haswell -O3

.LB1_400:
        movl    %eax, %r8d
        andl    $2, %r8d
        jne     .LB1_409
        prefetchnta     80(%rcx)
        vunpckhpd       %xmm0, %xmm0, %xmm2
        vunpcklpd       %xmm0, %xmm0, %xmm1
        vmovupd (%rcx), %ymm4
        subl    $2, %eax
        vinsertf128     $1, %xmm2, %ymm1, %ymm3
        vmulpd  %ymm4, %ymm3, %ymm1
        vmovupd %ymm1, (%rcx)
        addq    $32, %rcx
        testl   %eax, %eax
        jg      .LB1_400

Considerably better, in that the full length of the AVX2 registers is now used, prefetching is used, and a is cached in a register. However, it caches a as a scalar, not a vector, and reconstructs the vector on each iteration. The prefetching is also not beneficial for the test used, as discussed in more detail on the conjugation page.

Sun/Oracle Studio

I failed to persuade this compiler (12.6) to vectorise this code at all, even with -xarch=avx2 -O3 -xvector=simd -xivdep -fsimple=2 and !DIR IVDEP before the loop.

Does it matter?

Times for 10,000,000 calls with n=1000. There are 2x1010 multiplications, so 2s would be equivalent to 10GFLOPS. A 3.7GHz CPU was used, capable of performing four multiplications per clock cycle with AVX2 instructions, i.e. 14.8GFLOPS, so a theoretical minimum of 1.35s.

ifort1.78s
gfortran -O3 -fno-signed-zeros1.96s
pgf90 -O34.14s
gfortran -O34.14s
pgf907.66s
sunf90 -O38.51s
sunf9022.84s
gfortran28.93s