MJ Rutter's Linpack Page

Linpack is a benchmark which has been around since 1979, so has the advantage of stability. It involves solving a system of simultaneous linear equations, usually many thousands. One major reason for its relevance today is that it is used to generate the Top500 list of the world's fastest supercomputers.

No benchmark which attempts to reduce a computer's performance to a single number is going to be perfect. Linpack has an emphasis on floating point performance rather than memory bandwidth, and it allows an arbitrary amount of code (and algorithm) tuning, provided that reasonable accuracy is maintained.

I believe one of its main uses today is in discovering how well optimised a given linear algebra package is. Most will offer to solve simultaneous linear equations, but not all do so in an efficient, cache-friendly manner. Not all make use of multiple cores either. Indeed, on the computer I am using, I seem to have half a dozen different libraries which independently implement the dgesv routine which is the core of Linpack. Hence this page looking at their performance, and a corresponding page of Linpack sources.

Results

Results, in GFLOPS, for Linpack run in many different ways. The array size was 5000x5000 and the computer was a quad core 3.4GHz Haswell, with a theoretical peak performance of 54.4 GFLOPS per core. The Haswell was introduced in 2013 and was not superceded until late 2015. The machine used was running OpenSuSE 13.1.

MKL, threaded125 GFLOPS
numpy, OpenBLAS, threaded105 GFLOPS
Matlab95 GFLOPS
MKL, serial43 GFLOPS
numpy, OpenBLAS, serial39 GFLOPS
C++, eigen, g++ -O312 GFLOPS
Fortran/C, NAG Mk 244.9 GFLOPS
numpy, netlib3.3 GFLOPS
octave, netlib3.3 GFLOPS
Fortran/C, netlib3.3 GFLOPS
Fortran, original source2.4 GFLOPS
Java, Oracle java2.1 GFLOPS
Java, gjc -fno-bounds-check -fno-store-check1.8 GFLOPS
Java, gjc1.4 GFLOPS
C++, eigen, g++0.3 GFLOPS
Java, Oracle java -Xint0.1 GFLOPS

And a page of the sources for the above.

Problem

The original implementation of dgesv has no awareness of caches and uses a single thread. This was entirely appropriate for the machines for which it was written. Since the 1970s CPU floating point performance has increased much more than main memory performance, and multiple levels of caches are now ubiquitous. So too are multiple cores per CPU. A version of Linpack which recognises these changes performs much better.

Conclusions

The default LAPACK library used by gcc and gfortran, as well as python2 python3 and octave, was unimpressive, managing 3.3 GFLOPS (and using just one core). This is less than 2% of the theoretical peak performance of the computer. By switching to better LAPACK libraries, especially those provided with the Intel compiler suite, much better performance was obtained - over thirty times faster. This difference in speeds produced by software is so large that an early 2008 MacBook Pro, with a 2.4GHz Core2 processor, but with a reasonable system LAPACK library, can achieve 7.7 GFLOPS on the python benchmark, comfortably beating a desktop seven years younger.

There are faster free versions of LAPACK, such as ATLAS and OpenBLAS. If python and octave had been built to use these, and they can be, they would have been much faster. OpenBLAS is pretty much a complete replacement for the netlib LAPACK library, but ATLAS lacks some symbols and tends not to work as a post-compilation swap. OpenBLAS is not only over ten times faster on a single core, but can also utilise multiple cores.

(OpenSuSE 13.1, which was used here, has a mechanism for switching between netlib and ATLAS. It is a pity it seems not to work with either python/numpy (producing unresolved symbols) or octave (producing segfaults).

The eigen C++ package uses its own implementation of LAPACK, although it can also be configured to use MKL. I believe Matlab uses MKL internally.

The commercial NAG numerical library offers a choice between its own implementation of LAPACK, or using a third party implementation. It recommends the latter for speed, with good reason it would seem.

Java, when compiled (including JIT) is within a factor of two of the speed of the compiled languages here. When run interpreted (-Xint) it slows by more than a factor of twenty, to produce the slowest result here.

The bottom line is that good libraries, which include both MKL and OpenBLAS, offer respectable performance. Some libraries still in use in 2015 are very much poorer, poorer by such a large factor that they could make a current computer slower than a ten year old one.

Another machine I have access to and which is running Ubuntu 12.04.5 LTS has a default lapack library which is almost a factor of three better than this, i.e. still over a factor of three slower than it could be.)

Maths

The coefficient array is filled with random numbers in the range -0.5 to +0.5. The RHS vector is filled with the sum of each row of the matrix, so that the correct answer is that all unknowns are precisely one. The number of floating-point operations required, assuming LU factorisation by partial pivoting, and conventional methods for matrix-matrix multiplication is two-thirds N3 plus twice N2. (See further notes on operation count.)