Intel has a useful, well-optimised, maths kernel library, which is a component of its compiler suite. It provides an optimised BLAS and FFTs, amongst other things. Herewith a few notes on its use.
The option is simply:
in the position that one would normally specify a library, but note there is no l prefix. This links with the threaded version. To obtain the serial version use
icc -o wonder my_wonder_code.c -mkl=sequential
The threaded version defaults to using as many threads per process as there are cores in the node. If you are already using as many MPI processes as there are cores in the node, then you want the serial library, or you want to set the environment variable OMP_NUM_THREADS at runtime.
MKL automatically uses the newer features (AVX, AVX2, AVX-512) of Intel processors which it recognises. Its feature detection is not excellent -- in particular very new Intel processors can confuse it, and cause it to run a basic SSE2 code path. One has a small amount of control over this via the environment variable MKL_ENABLE_INSTRUCTIONS. This can be set (currently) to one of SSE4_2, AVX, AVX2 or AVX512. Its meaning is confusing.
If the code is run on a recognised Intel processor, it will limit the instruction set used to the lesser of the recognised features of the processor, and the setting of this variable. So one can use this to disable (rather than enable) parts of the instruction set.
If the code is run on an unrecognised Intel processor, it will limit the instruction set used to the lesser of the actual features of the processor, and the setting of this variable. So if one's MKL fails to recognise a Kaby Lake CPU, with this variable unset one gets an SSE code-path, and with it set to AVX2 one gets an AVX2 code-path.
If the CPU is non-Intel (e.g. AMD), a basic SSE2 code-path is used whatever. This is likely to be much slower than the CPU is capable of. In this case, one may be much better off with OpenBLAS.
For more information, see Intel's documentation.
To see how many instructions of which vector length were used, the mflops utility may help.