Optimal use of glibc's fseek()

The fseek() function is the standard (C89 and later) call for positioning the file position indicator for a stream. At first glance it seems like a very light-weight function, as it does no I/O. Its one obvious problem is the use of long, not off_t, for the offset, so on those 64 bit platforms on which long is only 32 bits, one cannot seek by more than +/-2GB.

But fseek() as implimented in glibc is not at all light-weight. It suffers from two potentially-serious performance issues. Sufficiently serious that I have just caused an analysis code to speed up by a factor of over twenty by combining large fseek()s, and replacing small ones by fread()s.

fseek() can imply fread()

In order to optimise the fread() following an fseek(), glibc usually seeks to a multiple of the page size, and then reads any amount left over. The result of this read is stored in an internal buffer.

So, if to read a file of fixed-length records, one writes code such as

  for i=1 to num_records
    if (is_wanted(i))
      fread(buffer,1,record_length,stream)
      process(buffer)
    else
      fseek(stream,record_length,SEEK_CUR)
    endif
  enddo

then it is likely that a read of a fraction of a page-size will occur at the start of every record, even those that are being skipped. If most records are skipped, it may be significantly faster to write

  skip_bytes=0
  for i=1 to num_records
    if (is_wanted(i))
      if (skip_bytes)
        fseek(stream,skip_bytes,SEEK_CUR)
        skip_bytes=0
      endif
      fread(buffer,1,record_length,stream)
      process(buffer)
    else
      skip_bytes+=record_length
    endif
  enddo
  if (skip_bytes)
    fseek(stream,skip_bytes,SEEK_CUR)
    skip_bytes=0
  endif

Small seeks are slow

Glibc does much buffering. If one attempts to read a small number of bytes, glibc will read a page or so into an internal buffer, then return the data requested. If this is followed by another small read, then no further read is needed, for the data are already in glibc's internal buffer. So a small read can be very fast, with no system call being made by glibc. However, fseek() always results in a system call. In other words

  int junk;
  fread(&junk,1,4,stream);

is likely to be much faster than

  fseek(stream,4,SEEK_CUR);

Two demonstrations

Large seeks

#include<stdio.h>
#include<stdlib.h>

int main(void)
{
  int i;
  FILE* file;
  file=fopen("/bin/sh","r");

  for(i=0;i<5;i++)
    fseek(file,10000,SEEK_CUR);

  fclose(file);
  return 0;
}

A silly piece of code which seeks on the file /bin/sh which is bound to exist and be readable.

$ gcc test.c
$ strace ./a.out 
[...]
openat(AT_FDCWD, "/bin/sh", O_RDONLY)   = 3
newfstatat(3, "", {st_mode=S_IFREG|0755, st_size=125688, ...}, AT_EMPTY_PATH) = 0
lseek(3, 10000, SEEK_CUR)               = 10000
lseek(3, 16384, SEEK_SET)               = 16384
read(3, "\363\17\36\372H"..., 3616) = 3616
lseek(3, 28672, SEEK_SET)               = 28672
read(3, "1\322H\2155\237"..., 4096) = 4096
lseek(3, 36864, SEEK_SET)               = 36864
read(3, "\362\0\0\17\37D"..., 4096) = 4096
lseek(3, 49152, SEEK_SET)               = 49152
read(3, "\0H\203x\10\0\17"..., 4096) = 4096
close(3)                                = 0

The first seek is the expected seek of 10000. The next becomes a seek to 16384 plus a read of 3616 bytes (16384+3616=20000). The next becomes a seek to 28672 plus a read of 4096 bytes, which seems like an overshoot on the target of 30000, although glibc will correctly account for this. It is surprising to see four reads from a code which does no reads whatsoever.

Lesson: avoid calling fseek after fseek. After fseek one should call fread (or fwrite).

Small seeks

#include<stdio.h>
#include<stdlib.h>

int main(void)
{
  char c;
  int i;
  FILE* file;
  file=fopen("/bin/sh","r");

  for(i=0;i<10;i++)
    fseek(file,4,SEEK_CUR);

  fread(&c,1,1,file);
  printf("Byte read was %d\n",(int)c);
  fclose(file);
  return 0;
}

Again a silly piece of code which seeks on the file /bin/sh which is bound to exist and be readable.

$ gcc test.c
$ strace ./a.out 
[...]
openat(AT_FDCWD, "/bin/sh", O_RDONLY)   = 3

lseek(3, 4, SEEK_CUR)                   = 4
lseek(3, 0, SEEK_SET)                   = 0
read(3, "\177ELF\2\1\1\0", 8)           = 8
lseek(3, 0, SEEK_SET)                   = 0
read(3, "\177ELF\2\1\1\0\0"..., 4096)   = 4096
lseek(3, 4096, SEEK_SET)                = 4096
lseek(3, 4096, SEEK_SET)                = 4096
lseek(3, 4096, SEEK_SET)                = 4096
lseek(3, 4096, SEEK_SET)                = 4096
lseek(3, 4096, SEEK_SET)                = 4096
lseek(3, 4096, SEEK_SET)                = 4096
lseek(3, 4096, SEEK_SET)                = 4096

write(1, "Byte read was 120\n", 18)     = 18

So the first seek of four bytes was issued as a seek of four bytes. The next seek of four bytes resulted in seeking to the beginning of the file, and then reading eight bytes. The next seek of four bytes resulted in seeking back to the beginning of the file, and then reading 4096 bytes. The final seven seeks all resulted in a seek to an offset of 4096 bytes. The actual read in the code of just one byte resulted in no read at all, because that byte could be returned by the read produced by the third seek.

Lesson: very small fseek()s may be better replaced by fread()s to a dummy buffer. Small fread()s are well optimised and do not result in more system calls than necessary.

Glib's code

The relevant code from glibc, from wfileops.c, is something like:

/* Try to seek to a block boundary, to improve kernel page management. */
  new_offset = offset & ~(_IO_buf_end - _IO_buf_base - 1);
  delta = offset - new_offset;

  _IO_SYSSEEK (fp, new_offset, 0);

  if (delta)
     _IO_SYSREAD (fp, _IO_buf_base, _IO_buf_end - _IO_buf_base);

but with decent error-checking, and many other issues taken care of too.

Further reading

The above was greatly aided by comments in this Stack Overflow article.