Accelerating apt-get
TCM's computers run Ubuntu, and are installed by using Ubuntu's installer to produce a very basic system, and then running a shell script which installs over 2,000 extra packages using apt-get. On some machines this script was taking over an hour to run, so this page describes how to accelerate things.
Apt-get uses dpkg, and both are very concerned to keep things in a state from which they can resume should the system crash. Afterall, if dpkg gets too confused one can no longer install even security patches. Like almost all OSes, Linux caches disk writes: a write to disk appears to the application to complete long before the data are actually written to permanent storage. With ext3 in most situations the data would be sent to the drive within five seconds of the write, and the ordering of writes of data and metadata were guaranteed. The performance optimisations of ext4 remove these guarantees: metadata may be written before data, and data writes may be delayed by about a minute.
Applications have a way of ensuring this does not happen. They can
call fsync()
on a file, which guarantees to flush all
changes to it to the disk, and, on good OSes such as recent versions
of Linux, through the disk drive's own small cache to permanent
storage. As the disk drive has no idea to which files data in its
cache belong, the last step involves flushing the entirety of the
disk drive's cache. But fsync()
negates most of the
point of write caches, in that it greatly reduces opportunities for
re-ordering writes for efficiency, and of combining multiple writes
to the same disk block.
In the case of installing a machine, it is also all unnecessary. If
a crash occurs during installation, then reinstalling is much
simpler than trying to work out how to resume. Provided that the
install ends with a clean shutdown, all data will be correctly
written, and no calls to fsync()
are needed.
Disabling fsync
With applications which use shared libraries, it is very simple to
modify, or disable, a library call. One can write one's own library,
and force its functions to be used in preference by setting the
environment variable LD_PRELOAD
. So here is some
minimal code to behave like fsync()
, even returning an
error if called with an invalid file descriptor, but otherwise doing
nothing:
#include <fcntl.h> int fsync(int fd){ if (fcntl(fd,F_GETFL)==-1) return 1; /* Will set errno appropriately */ return 0; }
This can then be made into a shared library:
$ gcc -c -fPIC nosync.c $ gcc -shared -o libnosync.so nosync.o
Finally one needs to run apt-get with LD_PRELOAD
set
to the full path to the above library:
$ LD_PRELOAD=/path/to/libnosync.so apt-get install foo
On machines with spinning disks this generally reduces the
installation time for my 2,000+ packages (served from a local mirror
over a 1GBit/s link) by between a factor of two and three. The
nervous might wish to run the sync
command explicitly
after the last apt-get
, but it is not necessary.
Of course if your machine does crash after using the above, then apt is likely to be left in an unrecoverable state, with packages partially installed, and I take no responsibility for this.
Or, to put it another way, a very similar project names the resulting library "EatMyData".