What’s the better way to encode a bunch of data: as text or in binary?

My first book, “BGP,” (over at the competition) is about the Border Gateway Protocol, the protocol that internet providers use to tell each other which blocks of IP addresses they use, so that packets can find their way from one ISP to another. This protocol is now about 13 years old, and it looks like it’s not going to hold the internet together for another 13. One of the issues is the sheer amount of data that the protocol must transport: when two routers run BGP, generally one must tell the other about nearly 200,000 individual blocks of IP addresses when they connect.

Thinking about a successor to BGP, I wondered what the best way would be to encode all this data. Today, it’s done in a binary format that was created especially for BGP. Obviously that doesn’t help code reuse. So i asked about the IETF’s current practices in this area on the Internet Engineering Task Force’s general discussion list.

The answers were interesting, although I was a bit disappointed that there is no such best practice. Some people advocated text-based protocols in general and/or XML in particular for their versatility, but I wasn’t sure this was such a great idea for this particular type of application: I’ve always suspected that parsing text is a very processor intensive task. So what better way to find out than by doing a little comparison?

I didn’t want to write an XML parser as that’s no way to start the week, so I decided to compare two approaches: reading 10 million lines each containing an integer value from a text file, and reading the same values from a binary file where each value is stored as an exact copy of an integer in memory. The time consuming code for the first approach looks like this:

  while (!feof(file))
    {
      fscanf(file, \"%i\", &j);
      tot += j;
    }

This took about 6 seconds on a low end Pentium 4 machine running FreeBSD 6. The binary equivalent:

  for (i = 0; i < 10000000; i++)
    {
      read(file, &j, sizeof(j));
      tot += j;
    }

To my astonishment, the binary code ran a lot slower at 12 seconds. Compiling with -O2 helped a bit: now the text and binary code were the same speed at just under 6 seconds.

I’m not sure what I expected exactly, but certainly not this. The text code should be slower if only because the text file was about twice the size of the binary file. It occurred to me that reading one integer at a time may not be the fastest way to get data from disk into memory, so I rewrote the binary code in order to read bigger chunks from the file:

  int file, i, j, k[10000], tot = 0;
	
  for (i = 0; i < 1000; i++)
    {
      read(file, k, sizeof(k));
      for (j = 0; j < 10000; j++)
        tot += k[j];
    }

As I said, I didn’t know what to expect exactly, but I was surprised once again. This code ran faster. A lot faster. Compiled with -O2 it only took 25 milliseconds to read 10 million values from a 40 megabyte file! That’s more than 200 times faster than reading the same information from a text file.

Now obviously there are many situations where the amount of data and/or the time spent parsing it is so small compared to everything else that’s happening, that using a binary data encoding is not worth the effort. But in cases where performance is even remotely an issue, there is no question: forget text and use binary encoding for internal non-user-facing data representation.

UPDATE

In addition to the comments below, I also got a reaction on the IETF list. I did implement Bill’s truncated suggestion to do fread() for the binary stuff, which made it run at about 1300 ms. But I think that’s not really a realistic test do, because it slows down the binary implementation for no real reason. I think it’s better to see whether it’s possible to squeeze out more speed from the text implementation. So I tried:

  for (i = 0; i < 1000; i++)
    {
      read(file, k, sizeof(k));
      p = k;
      for (j = 0; j < 10000; j++)
        {
          l = atoi(p);
          tot += l;
          p += 7;
        }
    }

I first tried this with sscanf() but apparently I made a mistake or there is something strange going on with the implementation, because it didn’t finish in a reasonable amount of time. With atoi() it worked a lot better: 1100 ms, more than five times as fast as the original text version. Finally, I made a quick and dirty atoi() implementation to squeeze out even more speed:

  do
    {
      c = *(text++);
      if (c < '0' || c > ‘9′)
        return result;
      result = result * 10 + c - ‘0′;
    }
  while (1);

This made the whole thing another factor 4 faster at about 270 ms. That’s still more than 10 times slower than the binary implementation, but the real problem here is that optimizing the text version in this way (I even hardcoded the string lengths…) makes it a lot more complex than either a naive text implementation or a binary implementation. My conclusion: text is 10 times slower than binary even when optimized, and another 10 times or so slower in a more general purpose implementation.