� Resolutions | Main | Update on the Patriarch �
December 31, 2003
Compression
Today I found a great compression utility called 7-Zip. It's tiny. The installation executable is only 920KB. It supports a wide number of compression schemes including RAR (which is my current utility of choice - the one I licensed and paid for.) and BZIP2.
What the heck is .bz2? I've never heard of it, but apparently it's a butt-kicking format that does up to 40% better than .gz. This is news to me. According to Breton Chapin:
In general, Burrows-Wheeler isn't as fast as LZ variants.
Theoretically, it takes O(n) time to sort the input. This assumes
that strings of any length can be compared instantly. Practical
algorithms for sorting for BWT actually take n log n, or n*a, or n log
a where a is the length of the longest match, because arbitrarily long
strings cannot be compared in O(1) time. Memory requirements of these
algorithms vary considerably, but all are many times the size of the
block, whereas LZ variants get by with a relatively small fixed amount
of memory. bzip2 uses 2 algorithms. bzip2 starts with an n*a which
works great if the data is not too repetitive as then a will be small.
If it's taking too long to sort the data, bzip2 switches to an n *
log n algorithm. gzip uses an O(n) algorithm that really does run in
O(n). If you were compressing 50M in one big block (bzip2's max is
900K blocks) with an n log n algorithm, you could expect it to take
around 25 to 26 (log base 2 of 50M) times as long as an n time
algorithm.Then there are implementation details. Because BWT uses so much more
memory than LZ, performance of a system's memory (caching, especially)
becomes an issue, for both compression and decompression. Also, bzip2
makes its comparisons one byte at a time, rather than one 32 (or 64
bit) word at a time. (Don't know what gzip does there.) I modified
bzip2 to do comparisons 4 bytes at a time (on little endian machines)
and got about 10% to 15% improvement in speed of compression.If you think bzip2 is slow, try some of the older PPM implementations
from before people realized PPM could be as fast as BWT. On the other
end, the old Unix "compress" is faster than gzip, and back in the 80's
people sometimes complained about gzip's "slowness", asking if gzip's
better compression was worth it.
Wow. It's been so long since I've heard a real computer scientist speak. You learn something new every day.
Posted by mbowen at December 31, 2003 11:35 AM
Trackback Pings
TrackBack URL for this entry:
http://www.visioncircle.org/mt/mt-tb.cgi/1281