Scaling by i in Fortran
Following the theme of the page on scaling a complex vector by a real number, this page discusses the even simpler prospect of scaling a complex vector by the imaginary constant i. All that needs to be done is the exchange of the real and imaginary parts, and a sign reversal for one of them. But what do current compilers make of this?
Firstly, there are many different ways of expressing this operation. Here we discuss:
subroutine scale_i(c,n) integer :: i,n complex(kind(1d0)) :: c(*) !$omp simd do i=1,n c(i)=c(i)*(0d0,1d0) enddo end subroutine scale_ibut with many varients, such as:
c(i)=c(i)*cmplx(0d0,1d0,kind(1d0))or
c(i)=cmplx(-aimag(c(i)),real(c(i)),kind(1d0))or
c(i)=conjg(c(i)) c(i)=cmplx(aimag(c(i)),real(c(i)),kind(1d0))or
zi=cmplx(0d0,1d0,kind(1d0)) !$omp simd do i=1,n c(i)=c(i)*zi enddoOne could even give up on Fortran, lose all hope of portability, and write using Intel's C AVX2 intrinsics:
#include <stdint.h> #include <immintrin.h> void scale_i_(__m256d_u *c, int *n){ int i,n2; __m256d x=_mm256_set_pd(0.0,-0.0,0.0,-0.0); n2=*n/2; for(i=0;i<n2;i++){ *c=_mm256_permute_pd(*c,5); *c=_mm256_xor_pd(*c,x); c++; } }
This style makes it very clear that there are no multiplications, and just two registers are needed, one to store the magic constant for xor'ing the signs, and one to hold the elements of c.
*(0d0,1d0)
Unless otherwise noted, identical code is produced
for *cmplx(0d0,1d0,kind(1d0))
.
gfortran
Versions 7.3.0 and 8.2.0 are not significantly different.
-march=haswell -fopenmp-simd
: scalar, and full
complex-complex multiplication.
-O3 -march=haswell -fopenmp-simd
: fully vectorised,
and full complex-complex multiplication, from 8.2.0 -- one vpermpd,
two of vfmadd???pd, and a vshufpd. 7.3.0 unrolls twice but cleverly
uses the just two vfmadd??pd's in the unrolled loop, and no other
arithmetic instructions. It appears to have eliminated the multiplications
by one. Less clever is that this loop contains six
vpermpds along with a pair of vshufpds and an asymmetric pair of
vunpacklpd/vunpackhpds.
-O3 -fno-signed-zeros -march=haswell -fopenmp-simd
:
fully vectorised, and only shuffles and an xor. Each two AVX2
register-fulls take six vpermpds, two vshufpds, two vunpck?pds, and
one vxorpd. The code packs the imaginary parts of two vectors into
a single register, and performs a single xor to change their signs,
and then unpacks. The extra packing and unpacking to save a single
xor is not efficient.
-O3 -march=sandybridge -fopenmp-simd
: similar to the
above, but with no FMA instructions. So slower when FMA was
previously used.
ifort
-march=core-avx2
: loop unrolled four times, and has
one vmulpd, one vfmaddsub213pd, one vunpckhpd and one vmovddup per
un-unrolled iteration.
Adding -Ofast
changes
nothing. Similarly -fp-model fast=2
Changing to -march=corei7-avx
to exclude the use of FMA
instructions still produces very similar code, now with eight multiplies
per loop, all of which will be by zero or one! This is slightly slower than
the FMA version.
Exchange
gfortran
-O3 -march=haswell -fopenmp-simd
: the code is
identical to that produced with -O3 -fno-signed-zeros
and multiplying by (0d0,1d0)
. Spotting the equivalence
of these two ways of writing the code is impressive. The code
produced, less so.
The version with the conjugate is completely unvectorised, and nor is it unrolled. So the loop contains two scalar loads, one scalar xor, and two scalar stores. It runs slightly faster.
ifort
Ifort unrolls the loop eight times, with two vperm2f128, one vunpckhpd, one vunpcklpd and half a vxorpd instruction per un-unrolled iteration. The heavy unrolling leaves it with no spare register to cache the constant for the xor, so that is reloaded every time it is used. However, turning off unrolling still does not result in this constant being cached in a register.
AVX intrinsics
AVX intrinsics might be closer to assembler than C, but they are
still not assembler, and there is considerable variation in how
compilers convert them to assembler. A reasonably good conversion,
from gcc -O3
is
.L3: vpermpd $177, (%rdi), %ymm0 addq $32, %rdi vxorpd %ymm1, %ymm0, %ymm0 vmovapd %ymm0, -32(%rdi) cmpq %rdi, %rax jne .L3
Intel's conversion is similar, but safer as it
uses vmovupd
, as does gcc 8.2.0 -- gcc 7.3.0's code
will segfault if the array is not 32-byte aligned. PGI's conversion
is
.loop vmovupd (%rdi), %ymm2 vpermilpd $5, %ymm2, %ymm0 vmovupd %ymm0, (%rdi) vmovupd (%rdi), %ymm2 vxorpd %ymm1, %ymm2, %ymm0 vmovupd %ymm0, (%rdi) addq $32, %rdi cmpq %rax, %rdi jb .loop
This is very literal, with c
being stored back to
memory after the permute, and then reloaded into a (different)
register for the xor. Still excellent when compared with gcc's
unoptimised conversion…
Run times
The code was by timing 10,000,000 calls to the subroutine with n=1000. The clock speed of the CPU used was 3.7GHz, and so one clock-cycle per loop iteration would result in a time of 2.7s. The CPU was a Kaby Lake.
AVX intrinsics, gcc -O3 | 2.16s |
*(0d0,1d0), ifort | 2.39s |
*(0d0,1d0), ifort -march=corei7-avx | 2.71s |
AVX intrinsics, ifort/icc | 2.78s |
AVX intrinsics, gcc-8 -O3 | 2.78s |
AVX intrinsics, pgi -Mllvm | 2.78s |
AVX intrinsics, pgi | 3.26s |
*(0d0,1d0), gfortran-8 -O3 | 3.96s |
*(0d0,1d0), gfortran-8 -O3 -march=sandybridge | 4.14s |
*(0d0,1d0), pgf90 | 4.77s |
exchange, ifort | 5.44s |
exchange, pgf90 -Mllvm | 5.44s |
exchange/conjg, gfortran -O3 | 5.49s |
*(0d0,1d0), gfortran -O3 -fno-signed-zeros | 6.78s |
exchange, gfortran -O3 | 6.78s |
*(0d0,1d0), gfortran-7 -O3 | 6.79s |
*(0d0,1d0), gfortran-7 -O3 -march=sandybridge | 7.45s |
*(0d0,1d0), pgf90 -Mllvm | 8.16s |
exchange, pgf90 | 9.52s |
AVX intrinsics, gcc | 14.89s |
exchange, gfortran | 23.51s |
*(0d0,1d0), gfortran | 25.38s |
exchange/conjg, gfortran | 41.23s |
I believe that the theoretical minimum time for this test is 1.35s, and the lowest I have acheived (by hand-unrolling the AVX intrinsic example) is 1.67s.
Conclusions
This example is a little simple, and it shows that full complex-complex multiplication can be as fast as reordering a vector to multiply by i, and much faster if the re-ordering is done badly. There is no difference between AVX and AVX2 for the exchange method, but the addition of FMA in AVX2 does improve the speed of methods using multiplication.
But the reordering needs just two registers (one extra for the constant to be xored), whereas the full complex-complex multiplication seems to need six. Furthermore, although the throughput may be similar, the latency for the exchange method is two clock-cycles (two instructions, each single cycle latency). For the full complex-complex multiplication it is about ten cycles (the multiply and FMA are four cycles each on Skylake, and two further cycles for the unpckhpd and vmovddup).
In more complex loops this will favour the exchanging method, so it is a pity that compilers are so poor at it.
It is also an example where
gfortran's -fno-signed-zeros
is counter-productive,
particularly for gfortran 8.2.0, and where PGI's optional use of
LLVM is very mixed: bad for full multiplication, but good for exchange.