Accelerating apt-get

TCM's computers run Ubuntu, and are installed by using Ubuntu's installer to produce a very basic system, and then running a shell script which installs over 2,000 extra packages using apt-get. On some machines this script was taking over an hour to run, so this page describes how to accelerate things.

Apt-get uses dpkg, and both are very concerned to keep things in a state from which they can resume should the system crash. Afterall, if dpkg gets too confused one can no longer install even security patches. Like almost all OSes, Linux caches disk writes: a write to disk appears to the application to complete long before the data are actually written to permanent storage. With ext3 in most situations the data would be sent to the drive within five seconds of the write, and the ordering of writes of data and metadata were guaranteed. The performance optimisations of ext4 remove these guarantees: metadata may be written before data, and data writes may be delayed by about a minute.

Applications have a way of ensuring this does not happen. They can call fsync() on a file, which guarantees to flush all changes to it to the disk, and, on good OSes such as recent versions of Linux, through the disk drive's own small cache to permanent storage. As the disk drive has no idea to which files data in its cache belong, the last step involves flushing the entirety of the disk drive's cache. But fsync() negates most of the point of write caches, in that it greatly reduces opportunities for re-ordering writes for efficiency, and of combining multiple writes to the same disk block.

In the case of installing a machine, it is also all unnecessary. If a crash occurs during installation, then reinstalling is much simpler than trying to work out how to resume. Provided that the install ends with a clean shutdown, all data will be correctly written, and no calls to fsync() are needed.

Disabling fsync

With applications which use shared libraries, it is very simple to modify, or disable, a library call. One can write one's own library, and force its functions to be used in preference by setting the environment variable LD_PRELOAD. So here is some minimal code to behave like fsync(), even returning an error if called with an invalid file descriptor, but otherwise doing nothing:

#include <fcntl.h>

int fsync(int fd){
  if (fcntl(fd,F_GETFL)==-1) return 1; /* Will set errno appropriately */
  return 0;
}

This can then be made into a shared library:

$ gcc -c -fPIC nosync.c
$ gcc -shared -o libnosync.so nosync.o

Finally one needs to run apt-get with LD_PRELOAD set to the full path to the above library:

$ LD_PRELOAD=/path/to/libnosync.so apt-get install foo

On machines with spinning disks this generally reduces the installation time for my 2,000+ packages (served from a local mirror over a 1GBit/s link) by between a factor of two and three. The nervous might wish to run the sync command explicitly after the last apt-get, but it is not necessary.

Of course if your machine does crash after using the above, then apt is likely to be left in an unrecoverable state, with packages partially installed, and I take no responsibility for this.

Or, to put it another way, a very similar project names the resulting library "EatMyData".