ifort CPU options

Intel's compiler suite, and in particular its Fortran compiler, has a good reputation for performance amongst those writing numerically-intensive code. However, there are a couple of issues one may wish to be aware of in its choices of instruction set.

Vectorisation of the complex type

In general Intel's approach to vectorisation is that each element of a vector register performs a separate iteration of a loop. This leads to several restrictions on the contents of the loop -- no function calls, no conditionals, no unexpected exits, and every iteration independent. The details are a little more complicated, but in general vectorisation occurs over consecutive iterations, not within a single iteration.

A possible exception is the use of the complex datatype. A complex number looks as though two-element registers were designed for it, but this is not quite true. It is quite hard to perform complex-complex multiplication using the instructions of the Intel's original SSE2 instruction set, so Intel's compiler defaults to treating complex numbers as little more than a pair of scalars.

But, as soon as one tells the compiler to use the SSE3 instruction set, as introduced with the "Prescott" revision of the Pentium 4 in 2004, this changes: then it will keep a complex variable in a single two-element vector most of the time, even if no vectorisation of the loops is possible. So, if you are using complex data, then you may want to change the compiler's default instruction set choice. (As far as I can tell, the only 64 bit x86_64 CPUs which do not support SSE3 are the original Athlon64 and Opteron. AMD updated its CPUs in 2005 to support SSE3.)

Instruction set choices

Intel's CPUs have evolved considerably over time, in part by adding new instructions. A recent CPU (Haswell and newer) supports vectors of four double-precision elements, and can start two fused multiply-add instructions each clock-cycle, so a total of 16 DP FLOPS per Hz. But if running an instruction sequence compatible with the original Core2 or Core i7, then its performance is just four DP FLOPS per Hz, as there will be no fused multiply-add instructions, and no use of vector lengths greater than two.

For some life is easy: all the machines they run on have the same CPUs, and they just compile for that CPU. For others, faced with a mixture of CPUs, or giving binaries to other people, the choices are harder.

Intel's compilers can produce multiple copies of speed-critical sections of code, and switch between them at run-time depending on the processor's features. This is almost ideal, save that they refuse to switch when running on AMD processors, such as Athlons and Zens.

Base, and alternatives

Intel's compilers produce executables with some combination of a "base" execution path, which will execute on non-Intel CPUs, and Intel CPUs not capable of one of the other paths, and alternative paths available for Intel CPUs supporting the necessary features.

The "base" path is set by -m or -march= options. The most useful ones are:

  -msse3               (also with -ax and -x)
  -mavx
  -march=corei7        (effectively sse3, not with -ax or -x)
  -march=core-avx
  -march=core-avx2     (avx with FMA)

Optional code paths are added with the -ax option. This may contain multiple, comma-separated, targets, and the target names are the same as the -march= options above.

The option -x is similar to -march= save that it requires an Intel CPU for the base execution path. One could also view it as omitting the base version, and having just a single alternative, Intel-only, path.

The option -xHost sets the architecture to that of the computer used for the compilation. If used on a non-Intel CPU produces an executable which checks for the correct features at run-time, but will run on non-Intel CPUs of suitable capabilities. There is no corresponding -mHost.

AVX-512 support, -[a]xcommon-avx512, currently has no corresponding -m or -march= option.

For a demonstration of this on an AMD Ryzen CPU:

pc27:~/hello$ ifort hello.f90 
pc27:~/hello$ ./a.out 
 Hello, F90 world.                       
pc27:~/hello$ ifort -msse3 hello.f90 
pc27:~/hello$ ./a.out 
 Hello, F90 world.                       
pc27:~/hello$ ifort -axsse3 hello.f90 
pc27:~/hello$ ./a.out 
 Hello, F90 world.                       
pc27:~/hello$ ifort -xsse3 hello.f90 
pc27:~/hello$ ./a.out 
Please verify that both the operating system and the processor support Intel(R) X87, CMOV, MMX, FXSAVE, SSE, SSE2 and SSE3 instructions.

Without proof, I assert that the first does not use SSE3 on any CPU. The second (-msse3) will use SSE3 on all CPUs, the third (-axsse3) SSE3 on Intel CPUs, but not AMD ones, and it will have an SSE2 codepath for them. The final one simply will not run on non-Intel CPUs, even if they support the relevant instruction set.

With the -m options there is no runtime checking of whether the CPU supports the required instructions. A mismatch will cause an illegal instruction exception once an unsupported instruction is encountered. So, using a Core2,

pc45:~/hello$ ifort -march=core-avx2 hello.f90
pc45:~/hello$ ./a.out 
forrtl: severe (168): Program Exception - illegal instruction

Recommendations

Compiling with -msse3 -axavx would seem to be a good start, especially if you are suspicious that FMA will be of little benefit. If you find FMA attractive, then -msse3 -axavx,avx2 is an option.

If you do only ever run on one architecture, then -xHost seems simplest.

AMD and AVX

Most of Intel's CPUs from the Haswell onwards contain two floating point execution units, each capable of performing an add, multiply, or fused multiply-add, on a 256-bit vector. AMD has concentrated more on integer performance, and its floating point execution units are less impressive: two units capable of addition on 128-bit vectors, and two capable of multiplication on 128-bit vectors.

AMD's Zen (Ryzen and Threadripper) architecture supports the 256-bit AVX2 instructions, and executes them in a single 128-bit unit, starting the two halves in consecutive cycles. If instead it had been presented with a pair of 128-bit vector instructions, these might have been issued on the same cycle to the two different execution units. So for AMD using 128-bit vectors can be faster, and using 256-bit ones is rarely much of an advantage (though it does save on decoding). For Intel the use of 256-bit vectors is likely to be faster than 128-bit vectors. So the above recommendation which restricts AMD CPUs to 128-bit vectors is not so bad for AMD CPUs...

Caveat Lector

I believe that the above is correct, in 2018. Please advise of any errors!

On Intel CPUs, to see how many instructions of which vector length were used, the mflops utility may help.