# Linear Congruential Random Numbers and Parallelisation

Linear congruential pseudorandom number generators have a bad press: they tend to have short repeat periods, and n-tuples do not fill an n-dimensional space uniformly, but rather lie on hyperplanes. They are well described in a Wikipedia article.

But sometimes a fast, and poor, generator is all that is needed. Linear congruential generators have the advantage that their state is small, and they are easy to step forwards without needing to calculate every intermediate value in the sequence.

Thus one can write OpenMP code to fill in an array with random numbers, and obtain the same results no matter how many threads are used.

The basic equation is

x_{n+1}=(a*x_{n}+c) mod m

For well chosen a, m and c the repeat length can be up to m. For speed m is often chosen to be a power of two, so that the modulus operation becomes simply a bitwise and. A popular set of choices is:

m=2^{31}

a=1103515245

c=12345

The formula is easy to step forwards large distances because

x_{n+2}=(a*((a*x_{n}+c) mod m + c) mod m

= ((a^{2} mod m)x + (a*c+c) mod m ) mod m

which is the same equation as before, but with a replaced by
a^{2} mod m, and c by (a*c+c) mod m. (Note
that this new choice of a and c leads to a sequence with half the repeat
length of the original sequence.)

This can be repeated to work out quickly how to step forwards four,
eight, sixteen, thirty-two etc. steps of the original sequence at
once. To step forwards an arbitrary number of steps, N, requires
order log_{2}N calculations. (That it takes so little effort
to jump forwards in this sequence might be regarded as proof that it
is not truely random.)

## Random numbers and parallelisation

When parallelising code reliant on random numbers, it is useful for debugging if precisely the same random numbers are used no matter how many parallel processes are running. This does not mean the same random numbers on each parallel process, but rather the same sequence that the serial code would have produced if one combines all the parallel parts.

The ability to take a linear congruential generator and readily produce another which produces every nth value from the same sequence, or to step forwards quickly a large number of values and then calculate every value in the original sequence, can be useful. As an example this page considers parallelising the linear congruential generator using OpenMP.

A rather conservative piece of Fortran for filling a 1D array with random numbers in the range 0<=x<1 might be:

subroutine pran(rnd,seed) implicit none integer, intent(inout) :: seed real(kind(1d0)), intent(out) :: rnd(:) integer, parameter :: long=selected_int_kind(15) integer(long), parameter :: m=2147483647_long integer(long) :: s s=seed do i=1,size(rnd) s=iand(s*1103515245+12345,m) rnd(i)=s/(m+1d0) enddo seed=s end subroutine pran

(This code makes no assumptions about how the multiplication of default integers behaves under overflow. It uses 64 bit integers for everything which might overflow a 32 bit integer.)

Suppose we actually wish to jump forwards and start filling the
array `rnd`

at some offset from the beginning. This we
can do by recalculating `s`

after a loop involving just
log_{2}(offset) iterations. That loop looks like

a=1 c=0 ap=1103515245 cp=12345 p=1 do while (p<=ioffset) if (iand(p,ioffset).ne.0) then c=iand(ap*c+cp,m) a=iand(a*ap,m) end if cp=iand(ap*cp+cp,m) ap=iand(ap*ap,m) p=p*2 enddo s=iand(seed*a+c,m)

An example of this using OpenMP and Fortran to fill in an array is
given as `prand.f90`

. This
example lets each thread operate on a block of elements, rather than
interleaving the elements in the array which each thread
modifies. Interleaving would cause cache line conflicts.

The example also needs to ensure that the final seed written back is the same one which the unparallelised code would have produced.

One could argue that this example is pointless: the run-time is approximately zero, even for arrays of many millions of elements. But it is the concept that counts! If one wished to use a modulus which was not a power of two, so than an explicit modulus operation was needed, then the compute costs would be significantly higher, and parallelisation more useful.

And, given that test programs are expected, see
`prand_test.f90`

for the
same with a short test program appended.

Finally, for those who prefer it,
`prand_test.c`

, the same in
C. One might argue that a lot of the modulus operations are
unnecessary, and that the slightly less readable
`prand2_test.c`

suffices. In
C/C++, unsigned integer overflow is well-defined, whereas Fortran
has no unsigned integer type, and signed integer overflow is less
well defined.

## Timings

One timed test of the above OpenMP C code produced the following data on a quad core 3.1GHz Haswell.

Size | ns per number | |
---|---|---|

OMP off | OMP on | |

2 | 3.48 | 536.22 |

20 | 1.57 | 54.90 |

200 | 1.65 | 6.34 |

2000 | 1.63 | 0.94 |

20000 | 1.63 | 0.47 |

200000 | 1.62 | 0.41 |

2000000 | 1.64 | 0.50 |

20000000 | 1.64 | 0.75 |

(Compiled with "`gcc -O3 -mavx2`".)

This shows clearly that OMP parallelisation has an overhead. Eventually it wins by about a factor of four (until the final two rows, with data sizes of over 8MB, fall out of the last level cache and the memory starts slowing progress slightly). But for sizes under about 1,000 elements, it is slower.