# Scaling by i in Fortran

Following the theme of the page on scaling a complex vector by a real number, this page discusses the even simpler prospect of scaling a complex vector by the imaginary constant i. All that needs to be done is the exchange of the real and imaginary parts, and a sign reversal for one of them. But what do current compilers make of this?

Firstly, there are many different ways of expressing this operation. Here we discuss:

```subroutine scale_i(c,n)
integer :: i,n
complex(kind(1d0)) :: c(*)

!\$omp simd
do i=1,n
c(i)=c(i)*(0d0,1d0)
enddo
end subroutine scale_i
```
but with many varients, such as:
```  c(i)=c(i)*cmplx(0d0,1d0,kind(1d0))
```
or
```  c(i)=cmplx(-aimag(c(i)),real(c(i)),kind(1d0))
```
or
```   c(i)=conjg(c(i))
c(i)=cmplx(aimag(c(i)),real(c(i)),kind(1d0))
```
or
```  zi=cmplx(0d0,1d0,kind(1d0))
!\$omp simd
do i=1,n
c(i)=c(i)*zi
enddo
```
One could even give up on Fortran, lose all hope of portability, and write using Intel's C AVX2 intrinsics:
```#include <stdint.h>
#include <immintrin.h>

void scale_i_(__m256d_u *c, int *n){
int i,n2;
__m256d x=_mm256_set_pd(0.0,-0.0,0.0,-0.0);
n2=*n/2;

for(i=0;i<n2;i++){
*c=_mm256_permute_pd(*c,5);
*c=_mm256_xor_pd(*c,x);
c++;
}
}
```

This style makes it very clear that there are no multiplications, and just two registers are needed, one to store the magic constant for xor'ing the signs, and one to hold the elements of c.

## *(0d0,1d0)

Unless otherwise noted, identical code is produced for `*cmplx(0d0,1d0,kind(1d0))`.

### gfortran

Versions 7.3.0 and 8.2.0 are not significantly different.

`-march=haswell -fopenmp-simd`: scalar, and full complex-complex multiplication.

`-O3 -march=haswell -fopenmp-simd`: fully vectorised, and full complex-complex multiplication, from 8.2.0 -- one vpermpd, two of vfmadd???pd, and a vshufpd. 7.3.0 unrolls twice but cleverly uses the just two vfmadd??pd's in the unrolled loop, and no other arithmetic instructions. It appears to have eliminated the multiplications by one. Less clever is that this loop contains six vpermpds along with a pair of vshufpds and an asymmetric pair of vunpacklpd/vunpackhpds.

`-O3 -fno-signed-zeros -march=haswell -fopenmp-simd`: fully vectorised, and only shuffles and an xor. Each two AVX2 register-fulls take six vpermpds, two vshufpds, two vunpck?pds, and one vxorpd. The code packs the imaginary parts of two vectors into a single register, and performs a single xor to change their signs, and then unpacks. The extra packing and unpacking to save a single xor is not efficient.

`-O3 -march=sandybridge -fopenmp-simd`: similar to the above, but with no FMA instructions. So slower when FMA was previously used.

### ifort

`-march=core-avx2`: loop unrolled four times, and has one vmulpd, one vfmaddsub213pd, one vunpckhpd and one vmovddup per un-unrolled iteration.

Adding `-Ofast` changes nothing. Similarly `-fp-model fast=2`

Changing to `-march=corei7-avx` to exclude the use of FMA instructions still produces very similar code, now with eight multiplies per loop, all of which will be by zero or one! This is slightly slower than the FMA version.

## Exchange

### gfortran

`-O3 -march=haswell -fopenmp-simd`: the code is identical to that produced with `-O3 -fno-signed-zeros` and multiplying by `(0d0,1d0)`. Spotting the equivalence of these two ways of writing the code is impressive. The code produced, less so.

The version with the conjugate is completely unvectorised, and nor is it unrolled. So the loop contains two scalar loads, one scalar xor, and two scalar stores. It runs slightly faster.

### ifort

Ifort unrolls the loop eight times, with two vperm2f128, one vunpckhpd, one vunpcklpd and half a vxorpd instruction per un-unrolled iteration. The heavy unrolling leaves it with no spare register to cache the constant for the xor, so that is reloaded every time it is used. However, turning off unrolling still does not result in this constant being cached in a register.

## AVX intrinsics

AVX intrinsics might be closer to assembler than C, but they are still not assembler, and there is considerable variation in how compilers convert them to assembler. A reasonably good conversion, from `gcc -O3` is

```.L3:
vpermpd \$177, (%rdi), %ymm0
vxorpd  %ymm1, %ymm0, %ymm0
vmovapd %ymm0, -32(%rdi)
cmpq    %rdi, %rax
jne     .L3
```

Intel's conversion is similar, but safer as it uses `vmovupd`, as does gcc 8.2.0 -- gcc 7.3.0's code will segfault if the array is not 32-byte aligned. PGI's conversion is

```.loop
vmovupd (%rdi), %ymm2
vpermilpd \$5, %ymm2, %ymm0
vmovupd %ymm0, (%rdi)
vmovupd (%rdi), %ymm2
vxorpd %ymm1, %ymm2, %ymm0
vmovupd %ymm0, (%rdi)
cmpq    %rax, %rdi
jb      .loop
```

This is very literal, with `c` being stored back to memory after the permute, and then reloaded into a (different) register for the xor. Still excellent when compared with gcc's unoptimised conversion…

## Run times

The code was by timing 10,000,000 calls to the subroutine with n=1000. The clock speed of the CPU used was 3.7GHz, and so one clock-cycle per loop iteration would result in a time of 2.7s. The CPU was a Kaby Lake.

 AVX intrinsics, gcc -O3 2.16s *(0d0,1d0), ifort 2.39s *(0d0,1d0), ifort -march=corei7-avx 2.71s AVX intrinsics, ifort/icc 2.78s AVX intrinsics, gcc-8 -O3 2.78s AVX intrinsics, pgi -Mllvm 2.78s AVX intrinsics, pgi 3.26s *(0d0,1d0), gfortran-8 -O3 3.96s *(0d0,1d0), gfortran-8 -O3 -march=sandybridge 4.14s *(0d0,1d0), pgf90 4.77s exchange, ifort 5.44s exchange, pgf90 -Mllvm 5.44s exchange/conjg, gfortran -O3 5.49s *(0d0,1d0), gfortran -O3 -fno-signed-zeros 6.78s exchange, gfortran -O3 6.78s *(0d0,1d0), gfortran-7 -O3 6.79s *(0d0,1d0), gfortran-7 -O3 -march=sandybridge 7.45s *(0d0,1d0), pgf90 -Mllvm 8.16s exchange, pgf90 9.52s AVX intrinsics, gcc 14.89s exchange, gfortran 23.51s *(0d0,1d0), gfortran 25.38s exchange/conjg, gfortran 41.23s

I believe that the theoretical minimum time for this test is 1.35s, and the lowest I have acheived (by hand-unrolling the AVX intrinsic example) is 1.67s.

## Conclusions

This example is a little simple, and it shows that full complex-complex multiplication can be as fast as reordering a vector to multiply by i, and much faster if the re-ordering is done badly. There is no difference between AVX and AVX2 for the exchange method, but the addition of FMA in AVX2 does improve the speed of methods using multiplication.

But the reordering needs just two registers (one extra for the constant to be xored), whereas the full complex-complex multiplication seems to need six. Furthermore, although the throughput may be similar, the latency for the exchange method is two clock-cycles (two instructions, each single cycle latency). For the full complex-complex multiplication it is about ten cycles (the multiply and FMA are four cycles each on Skylake, and two further cycles for the unpckhpd and vmovddup).

In more complex loops this will favour the exchanging method, so it is a pity that compilers are so poor at it.

It is also an example where gfortran's `-fno-signed-zeros` is counter-productive, particularly for gfortran 8.2.0, and where PGI's optional use of LLVM is very mixed: bad for full multiplication, but good for exchange.