operf & kcachegrind

Linux on x86_64 has access to hardware-based performance counters which provide low-overhead event counting. One way of accessing these is via the OProfile set of utilities. A useful GUI for analysing the results is KCachegrind, the most recent version of which can be found under KDE applications.

The simple example code profiled here consists of two source files, and is written in Fortran. It can be downloaded with:

$ curl -O https://www.mjr19.org.uk/sw/op2kcg/bench.f90
$ curl -O https://www.mjr19.org.uk/sw/op2kcg/smoothsort.f90

and then compiled with

  $ gfortran -g -O0 -c smoothsort.f90
  $ gfortran -g -O0 bench.f90 smoothsort.o

The use of -g is necessary to include line number information in the executable. For this simple example the use of -O0 is necessary to stop inlining making most of the call graph disappear. It should be possible to use any Fortran compiler, perhaps with slight changes to the above options.

(Note that at higher optimisation levels it may be necessary to add -fno-omit-frame-pointer, e.g.

  $ gfortran -g -O3 -fno-omit-frame-pointer ...

in order for the calltree to be available.)

The code can then be run. As operf produces a deep tree with lots of temporary files, one will wish to do this on a local disk, not a remote network volume.

$ operf -gl ./a.out 
Kernel profiling is not possible with current system config.
Set /proc/sys/kernel/kptr_restrict to 0 to collect kernel samples.
operf: Profiler started
 Smooth sort
       Size      time/element
          10   5.30000008E-08
         100   1.35999997E-07
        1000   1.88000001E-07
       10000   2.36999995E-07
      100000   3.01999989E-07
     1000000   4.41999987E-07

Profiling done.

(There may be a spurious warning about lost samples.)

To extract the data from the operf run, this example uses op2kcg.

$ curl -O https://www.mjr19.org.uk/sw/op2kcg/op2kcg
$ chmod +x op2kcg
$ ./op2kcg -o out.dat
$ kcachegrind out.dat &

This should result in something like:

KCachegrind screenshot

Interpretation of the Above

By default operf samples the CPU's clock, so sample count is proportional to time.

Flat Profile

The colours of the functions have no meaning other than to assist in their identification.

The line for interheap_sift means that 81.82% of samples were in this routine and its children, with 36.17% of the samples being in the routine itself (and therefore 45.65% in its one child, sift_in). Similar consideration of smooth_heapify gives 10.57% of samples in it and its children, 2.98% in it, so 7.59% in its children. Clicking on smooth_heapify itself reveals this to be 6.55% in sift_in and 1.04% in interheap_sift So the total percentage of samples expected in sift_in is 52.20%, which is similar to the 52.44% reported...

Call Graph

Because operf simply samples unmodified code, it cannot count the number of calls made. But, by capturing the stack at intervals, it can see different call routes. So here it shows that the subroutine sift_in in the module sorts is called via two different routes. The counts are set to one.

There is a rather complicated way to fix this.

And some more notes on interpreting the call graph.

Source Code

Unfortunately calls are listed at the start of each function, not at the points in function where they occur. This is due to how opreport -gcwf reports stack traces, and is true in both the source and assembler listings.

Here 18% of the total samples are being attributed to a single line

if (array(right).lt.array(left)) then

Assembler ("machine code") can be shown in the call graph pane.

Navigation

A right click on KCachegrind's panes reveals a context-sensitive menu.

/no-vmlinux

This is a reference to a call to the kernel, real or imagined(?!)