Vector conjugation in Fortran
subroutine v_conjg(c,n) integer :: i,n complex(kind(1d0)) :: c(*) !$omp simd do i=1,n c(i)=conjg(c(i)) enddo end subroutine vconjg
This looks like something for which it is fairly easy to produce optimal code, even when targeting AVX2. But many compilers fail, in many different ways. What sort of code might one expect?
movq (n),%rdi # Loop exit xorl %rdx,%rdx # Zero loop counter lea (c),%rax # Load pointer to c vmovupd (256bit const), %ymm0 # Suitable constant to invert # two signs in ymm register .loop vxorpd (%rax), %ymm0, %ymm1 # load c(i), xor to change signs, # store in %ymm1 vmovupd %ymm1, (%rax) # store c(i) addq $32, %rax # update pointer addq $2, %rdx # update loop counter cmpq %rdi, %rdx # compare counter with exit condition jb .loop # conditional jump
So the above code works on four doubles, or two double complexes, in each iteration, and hence increments the loop counter by two. It is not absolutely optimal, but it is a reasonable start.
Produces something similar to the above, but with the loop unrolled four times.
-O3 -march=haswell -fopenmp-simd
.L4: vmovupd 8(%rax), %ymm3 # move c(i) and c(i+1) into %ymm3 vunpcklpd 40(%rax), %ymm3, %ymm0 # move imaginary parts of c(i+2) # c(i+3) and %ymm3 into %ymm0 addq $64, %rax # increment pointer vpermpd $216, %ymm0, %ymm0 # permute %ymm0 vxorpd %ymm2, %ymm0, %ymm0 # change signs of all elements of # %ymm0 vmovlpd %xmm0, -56(%rax) # write imaginary part of c(i) vmovhpd %xmm0, -40(%rax) # write imaginary part of c(i+1) vextractf128 $0x1, %ymm0, %xmm0 # move top half of %ymm0 to %xmm0 vmovlpd %xmm0, -24(%rax) # write imaginary part of c(i+2) vmovhpd %xmm0, -8(%rax) # write imaginary part of c(i+3) cmpq %rcx, %rax jb .L4
This saves on xors at the expense of lots of shuffling. The four 64-bit writes will not be any faster than the two 256-bit writes that Intel's style would use.
Without optimisation flags specified, not only does gfortran not vectorise, but it loads the real and imaginary parts separately into two separate registers, xors the sign of the imaginary part, and then writes the two parts back out again. The activity on the real part really is
movq -24(%rbp), %rsi # %rsi points to c movq %rdx, %rcx # %rdx is i salq $4, %rcx # %rcx is now 16*i addq %rsi, %rcx # %rcx is now c+16*i # i.e. the address of c(i) vmovsd (%rcx), %xmm1 # load real part of c(i) to %xmm1 [...] movq -24(%rbp), %rsi movq %rdx, %rcx salq $4, %rcx addq %rsi, %rcx vmovsd %xmm1, (%rcx)
A rather slow way of doing nothing!
-O3 -tp=haswell -Mllvm
The PGI compiler takes a very different approach. No vectorisation, and it only loads and stores the imaginary parts of c. The sign reversal it does with a vsubsd, a scalar subtraction (from zero), rather than an xor. It unrolls by a factor of eight, which must help its speed.
When using its own code generator, not LLVM, the PGI compiler does produce code very similar to the example at the top of this page. It does not unroll, does maintain the loop counter separately to the pointer to c(i), and has a prefetch instruction on each iteration. The timing test here is deliberately contained within L1 cache, so the prefetching will be of no benefit. It is also quite expensive (in a simple un-unrolled loop), for it is done as
.loop: movl %eax, %edx andl $2, %edx jne .skip prefetchnta 80(%rcx) .skip:
It can be suppressed with
Does it matter?
Times for 10,000,000 calls with n=1000. The clock speed of the CPU used was 3.7GHz, and so one clock-cycle per loop iteration would result in a time of 2.7s.
|pgf90 -O3 -Mnoprefetch||2.14s|
|pgf90 -O3 -Mllvm||2.73s|