mflops and Intel performance counters
Intel CPUs have hardware event counters which can be used to count a diverse range of events whilst a program runs. As someone interested in the performance of numerical code, I find these very useful for quickly reporting the number of MFLOPS which a program has achieved whilst running. If the number is a good percentage of the CPU's theoretical maximum performance, then there is little scope for further tuning, and if not, there may be scope.
These counters can also be used to determine the proportion of operations which execute on vector lengths of two, four, and, where relevant, eight.
This page restricts consideration to double precision arithmetic, and Linux.
Intel has long provided hardware counters, which count with zero overhead (reading them does incur a slight overhead). They are documented in chapter 19, volume 3B, of Intel's Software Developer's Manual. Unlike almost all other aspects of Intel's series of x86 processors, there is no guaranteed compatibility between different generations of CPU. Each CPU offers a different selection of events one can count, with different numeric codes to describe them.
The Linux kernel has supported reading these counters via
perf_event_open for a long while (it was introduced
in the 2.6 series, and updated in the 3.x series). However, glibc
provides no interface, which puts some off. Here we show some
demonstration code which uses the counters, and link to other
The Linux interface is documented by
man perf_event_open. This
returns a file descriptor, which can be read like any other
The hardware usually provides four counters, and the counters are generally 48 bits wide. The code presented here makes no attempt to check for overflow. A counter incrementing at 4GHz will overflow after about 18 hours.
This code counts events described by Intel as:
These show the number of double precision instructions executed with vector lengths of one, two, four and eight respectively. The total number of floating point operations is then trivially calculated, and, if the time is known, the MFLOPS achieved too. It is also trivial to calculate the percentage of scalar operation which occured at the various vector lengths. Intel makes life easy, in that a fused multiply-add instruction adds two to the relevant counter.
$ mflops dgesv5k Linpack 5000x5000 factor solve total mflops unit ratio 1.580E+00 0.000E+00 1.580E+00 5.277E+04 3.790E-05 2.821E+01 Time: 2.13905s Total FP ops: 8.4182e+10 MFLOPS: 39354.8 Vector length 1: 0.51% Vector length 2: 0.12% Vector length 4: 0.03% Vector length 8: 99.34% Av vector length: 7.70
This shows the Linpack benchmark being run, and the last eight lines are produced by the mflops utility. The benchmark reports 52.77 GFLOPS, but if one includes the initialisation overheads this drops to the 39.35 GFLOPS reported by this utility.
The vectorisation report means that 99.34% of scalar operations occurred with a vector length of eight. The percentage of instructions which had a vector length of eight is only about 95.6%. Thus the average vector length of an instruction is 7.7.
But one can immediately conclude that this code is well-vectorised, and achieving a good proportion of the peak theoretical performance of the CPU (a 2.1GHz Xeon Gold with a theoretical peak of around 65 GFLOPS).
The mflops code can also report raw counter values (with the -v flag), and can count events related to IPL rather than MFLOPS (with the -ipl flag). What it cannot do is count MFLOPS or vectorisation on Haswell or Broadwell CPUs -- they lack relevant events. It works on Core 2, Nehalem, Sandy Bridge, Ivy Bridge, Skylake and Kaby Lake. It does not (currently) support AMD CPUs.
This code relies on being able to read the performance counter events without being privileged. Some modern distributions do no permit this by default (Ubuntu 16.04LTS, on which it was developed, does). If this is a problem
# echo 1 > /proc/sys/kernel/perf_event_paranoid
(as root) will fix, at, presumably, some risk to security.
The results for threaded code may not be sensible. If the code is
OMP_NUM_THREADS to 1 may help.
Other projects using the kernel's interface to the hardware performance counters: