Compression solutions


For the 2004 compression assignment please see this page.

AIMS students' solutions
Kamga Morgan Magloire POSITION CODE 1442 bits
Simon Kasanda & Iyabo Bello run-length method + code generated by the Huffman Algorithm. 837 bits
Isaac Osunmakinde run-length method, Huffman (assuming no runs longer than 500?) 838 bits
Lubna and Gibson run-length and Huffman (with length N supplied at uncompressor) (Gibson's page) 831 bits
James Malm Block code, length 8 1590 bits
abel & naval & richard Huffman codes and run length approach 837 bits
Martial Ndeffo Mbah Huffman codes and run length approach 845 bits
Naval and Atumbe Jules Baruani run lengths (not Huffman) 913 bits


Some general compression programs I wrote

For a general huffman algorithm written in perl, see huffman.p. For an OCTAVE-FRIENDLY version, see huffmano.p.

For a little C program illustrating how a symbol code could be implemented using structures, see (broken links, sorry) structexample.c and structexample2.c.

For a full Huffman program written in C, based on the above ideas, see huffman.c. This program runs the huffman algorithm then immediately implements either the encoder or the decoder.


Benchmarks for Neil's file
MethodFile size (bits)
(some methods include additional file information)
No compression 10000   
gzip 2448   
gzip --best 1912   
zip 3400   
compress 3720   
bzip2 1808   
Theoretical limit:
Shannon information content 827.8

SOLUTIONS

Notes on an approach using runlength codes

postscript | pdf

Summary of results for Neil's file

Method

file size (bits) program size (lines)
Arithmetic coding 829 220
Huffman using run lengths 831 230
Golomb code 838 129
General purpose run length code 1286 136
Position code (14-bit positions) 1442 71
Position code (64-bit blocks, 6-bit positions) 878 71

Details of solutions

Huffman using run lengths

In my runlength/Huffman encoder, I chose a maximum runlength of 69 for the reasons explained in this document (postscript) | (pdf).

I assume the length of the file is known to the decoder; this knowledge allows the compressed file to be about 6 bits shorter. If I had to communicate the length of the file, or indicate the end of the file somehow, I would have to add roughly 6 bits again.

My encoder is a C program, RLencode.c; the decoder is a perl program, RLdecode.p. I am pretty sure that these programs will work on any files.

The usage is

 RLencode   < ~mackay/compress/filep.01.10000 > file.RLZ
 RLdecode.p < file.RLZ > decoded
The encoder gets Neil's file into 831 bits.

If the source filelength is changed from 10000, please add the Nmax=blah argument to inform the decoder of the correct filelength. Thus:

RLdecode.p Nmax=10000 < file.RLZ > decoded

Arithmetic coding

It's hard to make an arithmetic code that works perfectly. ACencode and ACdecode work on all test files I have tried, but I am still not certain they will always always work!

Arithmetic coding is nice because it lends itself to adaptive models, corresponding for example to the belief that the bias of the bent coin is not known, and should be inferred as we go; or to the belief that the bias of the bent coin might change with time.

The two programs have two choices of compilation options, corresponding to two possible adaptive models. One model (well suited to the competition problem) asserts that the bias of the coin is known to be accurately very close to 0.01; the other asserts that the bias is unknown and could be anything in the ballpark 0.01-0.99. This choice is determined in the file ACdefns.h, which is included at compilation time by both programs.

The programs are used thus:

 make ACdecode
 make ACencode
 ACencode < ~mackay/compress/filep.01.10000 > file.ACZ
 ACdecode < file.ACZ > file.decoded
This encoder gets Neil's file into 829 bits. The decoder makes use of the known source file length, N=10000.

The results achieved by arithmetic coding are especially impressive for even larger files. For example, I made a million-bit source file using randNchooseM.p N=1000000 M=10000 > million.dat and compared ACencode with the runlength encoder RLencode. The compressed file lengths were

   80797 file.ACZ
   81025 file.RLZ

Code C_alpha
n traditional
binary
length c_alpha(n)
1 1 1 1
2 10 2 010
3 11 2 011
4 100 3 00100
5 101 3 00101
6 110 3 00110
. . . .
45 101101 6 00000101101

Run length code with cheap encoder for integers

We can make a really small compression algorithm that is reasonably well suited to sparse files by simply counting the run lengths of 1s and 0s, then coding those integers into binary with a simple uniquely decodeable code.

The program IRL.c encodes the runlengths using the code C_alpha, described in Chapter 7 of Information Theory, Inference, and Learning Algorithms. (Ch 7: Codes for Integers).

The same program can function as an encoder or a decoder; add the command line argument "-1" to select the decoder.

 encoding usage: IRL < ~mackay/compress/secret/CompressMe3 > file.IRLZ
 decoding usage: IRL -1 <  file.IRLZ > decoded
This program does not use an explicit probabilistic model; instead, it uses an implicit probabilistic model defined by the chosen codelengths for integers. For example, according to C_alpha, the implicit probabilities of 2 and 3 are both 1/8, and the probabilities of 4, 5, 6, and 7 are all 1/32.

This program gets Neil's file into 1286 bits.

Position code

I wrote a position code program in perl. There are two interesting ways to use it.

First, the simplest position code encodes the location of each 1 in a 14-bit string. The compressed length is 1442 bits. The decoder (which must know the file length N) makes an all-zero file then reads the compressed file 14 bits at a time, adding 1s to the all-zero file. It stops when it reaches the end of the compressed file.

  position.p multipleblocks=0 bits=14 ../filep.01.10000 > encoded.pos
  position.p multipleblocks=0 bits=14 decode=1 encoded.pos  > recovered

With a small modification, position codes can do better. We encode the file in blocks of size b bits, where 2b=64 is the best choice; and encode each 1 in b bits. To uniquely specify the encoder's progress from one block to the next, we must include a code for the number of 1s in a block. The optimal code for this number of 1s would have lengths given by log of the appropriate Binomial distribution; but for simplicity, I use a unary code: a 0 marks the end of a block and a 1 indicates that another 1 exists in the current block.

  position.p bits=6 ../filep.01.10000 > encoded.pos
  position.p bits=6 decode=1 encoded.pos  > recovered
The compressed file size is 878 bits.

Golomb code

The Golomb code is a very simple code that is both a runlength code and an approximate arithmetic coder. The encoder has just two adjustable parameters:

  • one bit (here set to 0), which identifies the more probable symbol, and
  • an integer m (here set to 6 or 7) whose value defines the implicit probability of the less probable symbol, via
    p1 ~= ln(2) 2-m.
    We define M = 2m.

To encode a file, the Golomb encoder outputs a 1 every time the stream contains M consecutive 0s. Whenever it encounters (for some r between 0 and M-1) a string of r consecutive 0s followed by a 1, it outputs a 0 followed by the integer r encoded as an m-bit binary number.

This encoder may be viewed as a special case of the runlength-based Huffman code with a maximum runlength constrained to be a power of 2, and all runs assigned equal implicit probability.

One may also view it as an approximate arithmetic coder; adaptation may be performed by adjusting m. The Golomb coder was the starting point for the Z-coder, an excellent compression algorithm used inside djvu.

The programs, which depend on the header file Gdefns.h, are used thus:

 make GolombEncode
 make GolombDecode
 GolombEncode < filep.01.10000  > file.GZ
 GolombDecode < file.GZ  > file.decoded

This encoder gets the sparse file into 870 bits (when m=7) and 838 bits (when m=6). That's very close to optimal, isn't it! The decoder does not make use of a known source file length; when it hits an EOF symbol, it stops decoding and terminates the file correctly.


Further ideas for other solutions

Position code plus `bits back'

Here's a fun idea. To encode a file of length N=10,000, of which roughly 100 of the bits are 1s, we could encode the position of each of the 1s in the file. Since each position can be represented by a 14-bit integer, the compressed file length will be roughly 100x14 = 1400 bits.

Now, that's some way off from the expected Shannon information content, 800 bits or so. Why?

Well, an encoding of all the positions has redundancy in it in the sense that the encoder is free to choose the order of the encoded bit-positions. This freedom means that the encoder has the opportunity to encode not only the 100 bit-positions but also an arbitrary choice of one from the 100! (one hundred factorial) possible permutations. In order to make that choice, the encoder could sub-contract to his friend Fred, who also wants to communicate over this channel, the decision about the choice of permutation. Receiving the permuted string of bits, the receiver can then deduce not only the sparse file but also Fred's choice. And Fred can use that choice to convey another file of size log2(100!) bits, which is very roughly 100 (log2 100 - 1), or 600 bits.

So the net cost of communicating this way is (total transmitted length in bits) - (length in bits of Fred's hidden transmission) ~= 1400 - 600 ~= 800 bits.

Which is the expected Shannon information content!

This idea is called bits-back coding. We encode in an arbitrary way, that requires some extra bits to resolve the arbitrariness; then claim back the cost of those extra bits by selling them to Fred. Now we get to the really fun bit: can you write a compression method such that the encoder himself plays the role of Fred? - i.e., the encoder chooses the permutation of the bit-positions in a way that conveys some of the bit-positions!

Related concepts

  • How should Fred turn his 600-bit file into a permutation of the 100 bits? We need an efficient and practical program. And a decoder that turns a permutation back into bits.
  • A nearby concept is this: imagine that Joe wants to communicate an unordered selection of r objects from N objects. How should he encode this selection in bits?


David MacKay