Scaling by i in Fortran

Following the theme of the page on scaling a complex vector by a real number, this page discusses the even simpler prospect of scaling a complex vector by the imaginary constant i. All that needs to be done is the exchange of the real and imaginary parts, and a sign reversal for one of them. But what do current compilers make of this?

Firstly, there are many different ways of expressing this operation. Here we discuss:

subroutine scale_i(c,n)
  integer :: i,n
  complex(kind(1d0)) :: c(*)

!$omp simd
  do i=1,n
     c(i)=c(i)*(0d0,1d0)
  enddo
end subroutine scale_i
but with many varients, such as:
  c(i)=c(i)*cmplx(0d0,1d0,kind(1d0))
or
  c(i)=cmplx(-aimag(c(i)),real(c(i)),kind(1d0))
or
   c(i)=conjg(c(i))
   c(i)=cmplx(aimag(c(i)),real(c(i)),kind(1d0))
or
  zi=cmplx(0d0,1d0,kind(1d0))
!$omp simd
  do i=1,n
     c(i)=c(i)*zi
  enddo
One could even give up on Fortran, lose all hope of portability, and write using Intel's C AVX2 intrinsics:
#include <stdint.h>
#include <immintrin.h>

void scale_i_(__m256d_u *c, int *n){
  int i,n2;
  __m256d x=_mm256_set_pd(0.0,-0.0,0.0,-0.0);
  n2=*n/2;

  for(i=0;i<n2;i++){
    *c=_mm256_permute_pd(*c,5);
    *c=_mm256_xor_pd(*c,x);
    c++;
  }
}

This style makes it very clear that there are no multiplications, and just two registers are needed, one to store the magic constant for xor'ing the signs, and one to hold the elements of c.

*(0d0,1d0)

Unless otherwise noted, identical code is produced for *cmplx(0d0,1d0,kind(1d0)).

gfortran

Versions 7.3.0 and 8.2.0 are not significantly different.

-march=haswell -fopenmp-simd: scalar, and full complex-complex multiplication.

-O3 -march=haswell -fopenmp-simd: fully vectorised, and full complex-complex multiplication, from 8.2.0 -- one vpermpd, two of vfmadd???pd, and a vshufpd. 7.3.0 unrolls twice but cleverly uses the just two vfmadd??pd's in the unrolled loop, and no other arithmetic instructions. It appears to have eliminated the multiplications by one. Less clever is that this loop contains six vpermpds along with a pair of vshufpds and an asymmetric pair of vunpacklpd/vunpackhpds.

-O3 -fno-signed-zeros -march=haswell -fopenmp-simd: fully vectorised, and only shuffles and an xor. Each two AVX2 register-fulls take six vpermpds, two vshufpds, two vunpck?pds, and one vxorpd. The code packs the imaginary parts of two vectors into a single register, and performs a single xor to change their signs, and then unpacks. The extra packing and unpacking to save a single xor is not efficient.

-O3 -march=sandybridge -fopenmp-simd: similar to the above, but with no FMA instructions. So slower when FMA was previously used.

ifort

-march=core-avx2: loop unrolled four times, and has one vmulpd, one vfmaddsub213pd, one vunpckhpd and one vmovddup per un-unrolled iteration.

Adding -Ofast changes nothing. Similarly -fp-model fast=2

Changing to -march=corei7-avx to exclude the use of FMA instructions still produces very similar code, now with eight multiplies per loop, all of which will be by zero or one! This is slightly slower than the FMA version.

Exchange

gfortran

-O3 -march=haswell -fopenmp-simd: the code is identical to that produced with -O3 -fno-signed-zeros and multiplying by (0d0,1d0). Spotting the equivalence of these two ways of writing the code is impressive. The code produced, less so.

The version with the conjugate is completely unvectorised, and nor is it unrolled. So the loop contains two scalar loads, one scalar xor, and two scalar stores. It runs slightly faster.

ifort

Ifort unrolls the loop eight times, with two vperm2f128, one vunpckhpd, one vunpcklpd and half a vxorpd instruction per un-unrolled iteration. The heavy unrolling leaves it with no spare register to cache the constant for the xor, so that is reloaded every time it is used. However, turning off unrolling still does not result in this constant being cached in a register.

AVX intrinsics

AVX intrinsics might be closer to assembler than C, but they are still not assembler, and there is considerable variation in how compilers convert them to assembler. A reasonably good conversion, from gcc -O3 is

.L3:
        vpermpd $177, (%rdi), %ymm0
        addq    $32, %rdi
        vxorpd  %ymm1, %ymm0, %ymm0
        vmovapd %ymm0, -32(%rdi)
        cmpq    %rdi, %rax
        jne     .L3

Intel's conversion is similar, but safer as it uses vmovupd, as does gcc 8.2.0 -- gcc 7.3.0's code will segfault if the array is not 32-byte aligned. PGI's conversion is

.loop
  vmovupd (%rdi), %ymm2
  vpermilpd $5, %ymm2, %ymm0
  vmovupd %ymm0, (%rdi)
  vmovupd (%rdi), %ymm2
  vxorpd %ymm1, %ymm2, %ymm0
  vmovupd %ymm0, (%rdi)
  addq    $32, %rdi
  cmpq    %rax, %rdi
  jb      .loop

This is very literal, with c being stored back to memory after the permute, and then reloaded into a (different) register for the xor. Still excellent when compared with gcc's unoptimised conversion…

Run times

The code was by timing 10,000,000 calls to the subroutine with n=1000. The clock speed of the CPU used was 3.7GHz, and so one clock-cycle per loop iteration would result in a time of 2.7s. The CPU was a Kaby Lake.

AVX intrinsics, gcc -O32.16s
*(0d0,1d0), ifort2.39s
*(0d0,1d0), ifort -march=corei7-avx2.71s
AVX intrinsics, ifort/icc2.78s
AVX intrinsics, gcc-8 -O32.78s
AVX intrinsics, pgi -Mllvm2.78s
AVX intrinsics, pgi3.26s
*(0d0,1d0), gfortran-8 -O33.96s
*(0d0,1d0), gfortran-8 -O3 -march=sandybridge4.14s
*(0d0,1d0), pgf904.77s
exchange, ifort5.44s
exchange, pgf90 -Mllvm5.44s
exchange/conjg, gfortran -O35.49s
*(0d0,1d0), gfortran -O3 -fno-signed-zeros6.78s
exchange, gfortran -O36.78s
*(0d0,1d0), gfortran-7 -O36.79s
*(0d0,1d0), gfortran-7 -O3 -march=sandybridge7.45s
*(0d0,1d0), pgf90 -Mllvm8.16s
exchange, pgf909.52s
AVX intrinsics, gcc14.89s
exchange, gfortran23.51s
*(0d0,1d0), gfortran25.38s
exchange/conjg, gfortran41.23s

I believe that the theoretical minimum time for this test is 1.35s, and the lowest I have acheived (by hand-unrolling the AVX intrinsic example) is 1.67s.

Conclusions

This example is a little simple, and it shows that full complex-complex multiplication can be as fast as reordering a vector to multiply by i, and much faster if the re-ordering is done badly. There is no difference between AVX and AVX2 for the exchange method, but the addition of FMA in AVX2 does improve the speed of methods using multiplication.

But the reordering needs just two registers (one extra for the constant to be xored), whereas the full complex-complex multiplication seems to need six. Furthermore, although the throughput may be similar, the latency for the exchange method is two clock-cycles (two instructions, each single cycle latency). For the full complex-complex multiplication it is about ten cycles (the multiply and FMA are four cycles each on Skylake, and two further cycles for the unpckhpd and vmovddup).

In more complex loops this will favour the exchanging method, so it is a pity that compilers are so poor at it.

It is also an example where gfortran's -fno-signed-zeros is counter-productive, particularly for gfortran 8.2.0, and where PGI's optional use of LLVM is very mixed: bad for full multiplication, but good for exchange.