Vector conjugation in Fortran

subroutine v_conjg(c,n)
  integer :: i,n
  complex(kind(1d0)) :: c(*)

!$omp simd
  do i=1,n
     c(i)=conjg(c(i))
  enddo
end subroutine vconjg

This looks like something for which it is fairly easy to produce optimal code, even when targeting AVX2. But many compilers fail, in many different ways. What sort of code might one expect?

  movq          (n),%rdi         # Loop exit
  xorl           %rdx,%rdx       # Zero loop counter
  lea           (c),%rax         # Load pointer to c
  vmovupd  (256bit const), %ymm0 # Suitable constant to invert
                                 # two signs in ymm register
.loop
  vxorpd    (%rax), %ymm0, %ymm1 # load c(i), xor to change signs,
                                 #  store in %ymm1
  vmovupd   %ymm1, (%rax)        # store c(i)
  addq      $32, %rax            # update pointer
  addq      $2, %rdx             # update loop counter
  cmpq      %rdi, %rdx           # compare counter with exit condition
  jb .loop                       # conditional jump

So the above code works on four doubles, or two double complexes, in each iteration, and hence increments the loop counter by two. It is not absolutely optimal, but it is a reasonable start.

Intel compiler

ifort 19.0.1, -march=core-avx2

Produces something similar to the above, but with the loop unrolled four times.

GCC

gfortran 7.3.0/8.2.0, -O3 -march=haswell -fopenmp-simd

.L4:
   vmovupd       8(%rax), %ymm3           # move c(i) and c(i+1) into %ymm3
   vunpcklpd     40(%rax), %ymm3, %ymm0   # move imaginary parts of c(i+2)
                                          # c(i+3) and %ymm3 into %ymm0
   addq          $64, %rax                # increment pointer
   vpermpd       $216, %ymm0, %ymm0       # permute %ymm0
   vxorpd        %ymm2, %ymm0, %ymm0      # change signs of all elements of
                                          # %ymm0
   vmovlpd       %xmm0, -56(%rax)         # write imaginary part of c(i)
   vmovhpd       %xmm0, -40(%rax)         # write imaginary part of c(i+1)
   vextractf128  $0x1, %ymm0, %xmm0       # move top half of %ymm0 to %xmm0
   vmovlpd       %xmm0, -24(%rax)         # write imaginary part of c(i+2)
   vmovhpd       %xmm0, -8(%rax)          # write imaginary part of c(i+3)
   cmpq          %rcx, %rax
   jb            .L4

This saves on xors at the expense of lots of shuffling. The four 64-bit writes will not be any faster than the two 256-bit writes that Intel's style would use.

gfortran 7.3.0/8.2.0, -march=haswell -fopenmp-simd

Without optimisation flags specified, not only does gfortran not vectorise, but it loads the real and imaginary parts separately into two separate registers, xors the sign of the imaginary part, and then writes the two parts back out again. The activity on the real part really is

        movq    -24(%rbp), %rsi     # %rsi points to c
        movq    %rdx, %rcx          # %rdx is i
        salq    $4, %rcx            # %rcx is now 16*i
        addq    %rsi, %rcx          # %rcx is now c+16*i
                                    # i.e. the address of c(i)
        vmovsd  (%rcx), %xmm1       # load real part of c(i) to %xmm1
        [...]
        movq    -24(%rbp), %rsi
        movq    %rdx, %rcx
        salq    $4, %rcx
        addq    %rsi, %rcx
        vmovsd  %xmm1, (%rcx)

A rather slow way of doing nothing!

pgf90, -O3 -tp=haswell -Mllvm

The PGI compiler takes a very different approach. No vectorisation, and it only loads and stores the imaginary parts of c. The sign reversal it does with a vsubsd, a scalar subtraction (from zero), rather than an xor. It unrolls by a factor of eight, which must help its speed.

pgf90, -O3 -tp=haswell

When using its own code generator, not LLVM, the PGI compiler does produce code very similar to the example at the top of this page. It does not unroll, does maintain the loop counter separately to the pointer to c(i), and has a prefetch instruction on each iteration. The timing test here is deliberately contained within L1 cache, so the prefetching will be of no benefit. It is also quite expensive (in a simple un-unrolled loop), for it is done as

.loop:
  movl        %eax, %edx
  andl        $2, %edx
  jne         .skip
  prefetchnta 80(%rcx)
.skip:

It can be suppressed with -Mnoprefetch.

Does it matter?

Times for 10,000,000 calls with n=1000. The clock speed of the CPU used was 3.7GHz, and so one clock-cycle per loop iteration would result in a time of 2.7s.

ifort1.39s
pgf90 -O3 -Mnoprefetch2.14s
pgf90 -O3 -Mllvm2.73s
gfortran -O32.75s
pgf90 -O33.03s
sunf90 -O36.45s
gfortran -O323.73s