operf & kcachegrind
Linux on x86_64 has access to hardware-based performance counters which provide low-overhead event counting. One way of accessing these is via the OProfile set of utilities. A useful GUI for analysing the results is KCachegrind, the most recent version of which can be found under KDE applications.
The simple example code profiled here consists of two source files, and is written in Fortran. It can be downloaded with:
$ curl -O https://www.mjr19.org.uk/sw/op2kcg/bench.f90 $ curl -O https://www.mjr19.org.uk/sw/op2kcg/smoothsort.f90
and then compiled with
$ gfortran -g -O0 -c smoothsort.f90 $ gfortran -g -O0 bench.f90 smoothsort.o
The use of -g
is necessary to include line number
information in the executable. For this simple example the use
of -O0
is necessary to stop inlining making
most of the call graph disappear. It should be possible to use any
Fortran compiler, perhaps with slight changes to the above options.
(Note that at higher optimisation levels it may be necessary to add
-fno-omit-frame-pointer
, e.g.
$ gfortran -g -O3 -fno-omit-frame-pointer ...
in order for the calltree to be available.)
The code can then be run. As operf produces a deep tree with lots of temporary files, one will wish to do this on a local disk, not a remote network volume.
$ operf -gl ./a.out Kernel profiling is not possible with current system config. Set /proc/sys/kernel/kptr_restrict to 0 to collect kernel samples. operf: Profiler started Smooth sort Size time/element 10 5.30000008E-08 100 1.35999997E-07 1000 1.88000001E-07 10000 2.36999995E-07 100000 3.01999989E-07 1000000 4.41999987E-07 Profiling done.
(There may be a spurious warning about lost samples.)
To extract the data from the operf run, this example uses op2kcg.
$ curl -O https://www.mjr19.org.uk/sw/op2kcg/op2kcg $ chmod +x op2kcg $ ./op2kcg -o out.dat $ kcachegrind out.dat &
This should result in something like:
Interpretation of the Above
By default operf samples the CPU's clock, so sample count is proportional to time.
Flat Profile
The colours of the functions have no meaning other than to assist in their identification.
The line for interheap_sift
means that 81.82% of
samples were in this routine and its children, with 36.17% of the
samples being in the routine itself (and therefore 45.65% in its one
child, sift_in
). Similar consideration
of smooth_heapify
gives 10.57% of samples in it and its
children, 2.98% in it, so 7.59% in its children. Clicking
on smooth_heapify
itself reveals this to be 6.55%
in sift_in
and 1.04% in interheap_sift
So
the total percentage of samples expected in sift_in
is
52.20%, which is similar to the 52.44% reported...
Call Graph
Because operf simply samples unmodified code, it cannot
count the number of calls made. But, by capturing the stack at
intervals, it can see different call routes. So here it shows that
the subroutine sift_in
in the module sorts
is called via two different routes. The counts are set to one.
There is a rather complicated way to fix this.
And some more notes on interpreting the call graph.
Source Code
Unfortunately calls are listed at the start of each function, not
at the points in function where they occur. This is due to
how opreport -gcwf
reports stack traces, and is
true in both the source and assembler listings.
Here 18% of the total samples are being attributed to a single line
if (array(right).lt.array(left)) then
Assembler ("machine code") can be shown in the call graph pane.
Navigation
A right click on KCachegrind's panes reveals a context-sensitive menu.
/no-vmlinux
This is a reference to a call to the kernel, real or imagined(?!)