Sorting in Fortran

Whilst sort routines are part of the standard library in many languages, this is not so in Fortran. As I recently produced a collection of various sort algorithms in Fortran, they are now recorded here in case they are of use to anyone else.

All the routines simply sort an array of real values into ascending order. They are provided with very little explanation or other documentation, and certainly without any guarantees of correctness or fitness for purpose.

Two files are provided, sorts.f90 and bench_all.f90. After downloading both, one can try:

  $ gfortran -O3 -c sorts.f90
  $ gfortran -O3 bench_all.f90 sorts.o
  $ ./a.out

to obtain output like

 Bubble sort                        Shell sort                  
       Size      time/element             Size      time/element
          10   5.99999979E-08                10   3.19999991E-08
         100   2.39999991E-07               100   7.10000023E-08
        1000   2.15999989E-06              1000   1.04999998E-07
       10000   2.13399999E-05             10000   1.77000004E-07
                                         100000   2.24999994E-07
                                        1000000   3.33000003E-07
 Insertion sort                     Dual pivot quicksort        
       Size      time/element             Size      time/element
          10   9.99999994E-09                10   1.69999996E-08
         100   3.99999998E-08               100   3.69999995E-08
        1000   3.30000006E-07              1000   5.60000011E-08
       10000   3.15000011E-06             10000   7.59999992E-08
      100000   3.12699995E-05            100000   9.69999974E-08
                                        1000000   1.16000002E-07

The very different scaling of the different algorithms is clear. The time per element for bubble sort and insertion sort is linear in the size of the array being sorted, so the total time for the sort increases quadratically with array size. In contrast, for the dual pivot quicksort, which for large arrays is the fastest routine given, the time per element barely increases as the array to be sorted grows larger. However, for very small arrays the complexities of the order N.log(N) algorithms mean that an insertion sort is faster.

The routines currently in sorts.f90 are, in approximately increasing order of performance:

Order N2

N^2 scaling

Better than order N2, worse than N.log(N)

Generally order N.log(N)

N.log(N) scaling

Order N


Note that the performance difference between the different N.log(N) algorithms is not great, and the ordering can be changed by minor optimisation differences. For the N2 algorithms, the ordering is very size dependent. The inline selection sort crosses the binary insertion sort algorithm at about 30, and the (intrinsic) selection sort at about 60, in the example given.

For an array size of 1000, the N.log(N) methods all gave times of between 56,000 and 150,000. One would therefore use these in preference to the N2 methods whose times start at 180,000. But it may be surprising that for this size the double insertion sort is the best N2 algorithm, at 180,000, followed by the pair insertion sort at 240,000, then the binary insertion sort at 310,000, and the plain insertion sort scores 320,000. At a small size of around 70, three of the insertion sorts are almost indistinguishable in performance, and the binary version is a factor of two slower.

(Why is the inlined selection sort algorithm ultimately slower than the non-inlined version? The Fortran compiler used expanded the intrinsic maxloc function in a way which contained no conditional branches. It was unable to eliminate the conditional branches from the explicitly inlined code.)

The radix sorts are interesting for their linear scaling, lack of any if statements (i.e. unpredictable branches), and being entirely based on integer operations. Casting a double to an int is not easy in Fortran, and even in C it risks breaking the standard. The (slightly) more readable version is also a stable sort; equal elements retain their ordering. The more optimised version is not, in that equal negative elements reverse their ordering.

The linear scaling of radix sorts is a bit of a cheat. Rather than being N.log(N), they are N.W, where W is the size in bits of the items being sorted. So single precision can be sorted about twice as fast as the double precision example here.

Both radix sorts, and in particular the optimised version, are examples of how not to code. Uncommented, unreadable, and probably containing subtle bugs as a result. They are very slow at small sizes, and the advantages of this optimised version evaporate once it is out of cache.