Documentation as an indicator of code quality

I often have to shunt around Lotus Domino databases, as well as all kinds of log files and bundles of XML data. I’ve got a cable modem connection to my home office, but still, uploads can take a while. So data compression is still important to me, and the newer LZMA algorithm can make a big difference.

Here’s an example using a random database template:

bytes	file
9437184	dct.ntf
2618157	dct.ntf.gz
2256744	dct.ntf.bz2
1806715	dct.ntf.lz
1784916	dct.ntf.7z
1775175	dct.ntf.lzma

Switching from bzip2 to LZMA is a bigger improvement than the switch from gzip to bzip2 was. However, notice that there are three LZMA-compressed files. The first was created using lzip; the second, using 7-zip; the third, using lzma-utils, now renamed xz or xz-utils.

I don’t like that there are competing LZMA archive formats, but since that’s the world I’m in, I decided I had to make a choice which one to use. The obvious answer is to go with the one with the best compression, but in my view 0.3% isn’t enough of a difference to make that the sole criterion. So I’ve tried my usual approach to evaluating open source software: I’ve looked at the user interface and documentation.

7-zip fails immediately because it doesn’t behave like a Unix program. It produces 10 lines of output when successfully compressing a single file, and there doesn’t seem to be any way to get it to shut up. So, it’s a two horse race.

The xz-utils site has no documentation I can find. Searching the Ubuntu documentation site locates the xz man page, however.

In contrast, lzip’s web site has a link to go straight to the user manual and tutorial. (Yes, there’s also a standard Unix man page.)

Comparing the two, I see that xz has many more options. It has all kinds of tweaks to specify how much memory it uses, tweak various internal details of the LZMA algorithm, and filter the data. None of these options are adequately explained. To quote Ted Nelson quoting Roger Gregory, “An option means the programmer didn’t have a clear idea of what the module was supposed to do.” Or as Steve Krug puts it, “Don’t make me think.”

In contrast, lzip’s user interface is much simpler, and closer to the Unix philosophy of “do one thing, and do it well”. The only two tweaks to the LZMA algorithm lzip provides are adequately explained if you know the basics of how compression algorithms tend to work, and there’s a table showing how they correspond to the compression levels -0 to -9. The only borderline gratuitous option is to split the compressed file into chunks, and that’s at least a useful one. It also gets the SI units right.

So, lzip wins by a landslide on UI and documentation.

You might be thinking I’m being superficial here; surely documentation alone isn’t a good way to evaluate software? So this time, I took a look at the source code.

Lzip’s source is around 5,680 lines of code (excluding comments), supplied with an autoconfigure script and test suite. It compiles to a 85,753 byte binary.

XZ Utils’ source is around 31,183 lines of code (excluding comments). It compiles to a 516,779 byte binary.

In XZ’s favor, its source code is much better commented (30% vs 10% comment-to-code ratio). Then again, at 6× the size, it had better be. So on balance, I think lzip still wins.

So once again, my “look at the documentation” heuristic worked. The question is, are there any good exceptions to prove the rule? That is, examples of excellent code that has terrible documentation?

Update: The GNU project decided to go with xz rather than lzip, of course, and implemented xz support rather than lzip support in GNU tar.