Compression solutions


For the 2005 compression assignment please see this page.

AIMS students' solutions, 2005
Fabrice Talla Nobibon Huffman code; run lengths up to 70 837 bitsRequires decoder to know the uncompressed filelength, N
Kengni Ncheuguim Emmanuel Huffman code; run lengths 838 bits
Inambao Wakwinji with Felicien Muamba Mukanya (Jeje) and K Coffart Mogale Position code 1434 bits
Joël Ravelomanantsoa-Ratsimihah Runlength, Huffman 832 bits
Chodziwadziwa Whiteson Kabudula golomb code with multiplier m = 8 1604 bits
Chodziwadziwa Whiteson Kabudula golomb code with multiplier m = 64 838 bits
Sorel-Platini Ewake Block code, blocks of size 10 bits; Huffman 1000 bits
Maisson Abdella(MEISHO) and MOHAMED ELSHAZLI (SHOSHO) Position code
Biniam zerai Tedlla Position code
Evidence Simbarashe Matangi Golomb code 838 bits
Hind ali Mohamed Ahmed and Saeed Mohamed Taha 837 bits
Kelvin Muzundu and Sara position / run-lengths 1442 bits
Olufemi Olusola Odegbile , Boilaji, and Isaac Huffman code for runlengths 831 bits
Ayoub Basheer Mohammed Basheer Position code
Herbert Hove and Thabo Moreto Runlengths using 9bit binary code
Tahina Rakotoniaina Huffman, blocks of length n:
n = 2      5146
n = 4      2730
n = 5      2263
n = 8      1582
n = 10    1368
1368
Mohamed Adam Run-length code 870-1090 bits, depending on the file
Onyekwelu Uzodinma Okeke and Emmanuel Naziga run-length method 837 bits.
Henry Osita Mbah, Attah Doomnull and Fareo gideon runlength coding. Golomb code, six digits. 838


Some compression programs I wrote

2005 SOLUTIONS IN PYTHON

Solutions written in python

For a general huffman algorithm written in perl, see huffman.p. For an OCTAVE-FRIENDLY version, see huffmano.p.

For a little C program illustrating how a symbol code could be implemented using structures, see structexample.c and structexample2.c.

For a full Huffman program written in C, based on the above ideas, see huffman.c. This program runs the huffman algorithm then immediately implements either the encoder or the decoder.


Benchmarks for Neil's file
MethodFile size (bits)
(some methods include additional file information)
No compression 10000   
gzip 2448   
gzip --best 1912   
zip 3400   
compress 3720   
bzip2 1808   
Theoretical limit:
Shannon information content 827.8

2004 SOLUTIONS in C and perl

Notes on an approach using runlength codes

postscript | pdf

Summary of results for Neil's file

Method

file size (bits) program size (lines)
Arithmetic coding 829 220
Huffman using run lengths 831 230
Golomb code 838 129
General purpose run length code 1286 136
Position code (14-bit positions) 1442 71
Position code (64-bit blocks, 6-bit positions) 878 71

Details of solutions

Huffman using run lengths

In my runlength/Huffman encoder, I chose a maximum runlength of 69 for the reasons explained in this document (postscript) | (pdf).

I assume the length of the file is known to the decoder; this knowledge allows the compressed file to be about 6 bits shorter. If I had to communicate the length of the file, or indicate the end of the file somehow, I would have to add roughly 6 bits again.

My encoder is a C program, RLencode.c; the decoder is a perl program, RLdecode.p. I am pretty sure that these programs will work on any files.

The usage is

 RLencode   < ~mackay/compress/filep.01.10000 > file.RLZ
 RLdecode.p < file.RLZ > decoded
The encoder gets Neil's file into 831 bits.

If the source filelength is changed from 10000, please add the Nmax=blah argument to inform the decoder of the correct filelength. Thus:

RLdecode.p Nmax=10000 < file.RLZ > decoded

Arithmetic coding

It's hard to make an arithmetic code that works perfectly. ACencode and ACdecode work on all test files I have tried, but I am still not certain they will always always work!

Arithmetic coding is nice because it lends itself to adaptive models, corresponding for example to the belief that the bias of the bent coin is not known, and should be inferred as we go; or to the belief that the bias of the bent coin might change with time.

The two programs have two choices of compilation options, corresponding to two possible adaptive models. One model (well suited to the competition problem) asserts that the bias of the coin is known to be accurately very close to 0.01; the other asserts that the bias is unknown and could be anything in the ballpark 0.01-0.99. This choice is determined in the file ACdefns.h, which is included at compilation time by both programs.

The programs are used thus:

 make ACdecode
 make ACencode
 ACencode < ~mackay/compress/filep.01.10000 > file.ACZ
 ACdecode < file.ACZ > file.decoded
This encoder gets Neil's file into 829 bits. The decoder makes use of the known source file length, N=10000.

The results achieved by arithmetic coding are especially impressive for even larger files. For example, I made a million-bit source file using randNchooseM.p N=1000000 M=10000 > million.dat and compared ACencode with the runlength encoder RLencode. The compressed file lengths were

   80797 file.ACZ
   81025 file.RLZ

Code C_alpha
n traditional
binary
length c_alpha(n)
1 1 1 1
2 10 2 010
3 11 2 011
4 100 3 00100
5 101 3 00101
6 110 3 00110
. . . .
45 101101 6 00000101101

Run length code with cheap encoder for integers

We can make a really small compression algorithm that is reasonably well suited to sparse files by simply counting the run lengths of 1s and 0s, then coding those integers into binary with a simple uniquely decodeable code.

The program IRL.c encodes the runlengths using the code C_alpha, described in Chapter 7 of Information Theory, Inference, and Learning Algorithms. (Ch 7: Codes for Integers).

The same program can function as an encoder or a decoder; add the command line argument "-1" to select the decoder.

 encoding usage: IRL < ~mackay/compress/secret/CompressMe3 > file.IRLZ
 decoding usage: IRL -1 <  file.IRLZ > decoded
This program does not use an explicit probabilistic model; instead, it uses an implicit probabilistic model defined by the chosen codelengths for integers. For example, according to C_alpha, the implicit probabilities of 2 and 3 are both 1/8, and the probabilities of 4, 5, 6, and 7 are all 1/32.

This program gets Neil's file into 1286 bits.

Position code

I wrote a position code program in perl. There are two interesting ways to use it.

First, the simplest position code encodes the location of each 1 in a 14-bit string. The compressed length is 1442 bits. The decoder (which must know the file length N) makes an all-zero file then reads the compressed file 14 bits at a time, adding 1s to the all-zero file. It stops when it reaches the end of the compressed file.

  position.p multipleblocks=0 bits=14 ../filep.01.10000 > encoded.pos
  position.p multipleblocks=0 bits=14 decode=1 encoded.pos  > recovered

With a small modification, position codes can do better. We encode the file in blocks of size b bits, where 2b=64 is the best choice; and encode each 1 in b bits. To uniquely specify the encoder's progress from one block to the next, we must include a code for the number of 1s in a block. The optimal code for this number of 1s would have lengths given by log of the appropriate Binomial distribution; but for simplicity, I use a unary code: a 0 marks the end of a block and a 1 indicates that another 1 exists in the current block.

  position.p bits=6 ../filep.01.10000 > encoded.pos
  position.p bits=6 decode=1 encoded.pos  > recovered
The compressed file size is 878 bits.

Golomb code

The Golomb code is a very simple code that is both a runlength code and an approximate arithmetic coder. The encoder has just two adjustable parameters:

  • one bit (here set to 0), which identifies the more probable symbol, and
  • an integer m (here set to 6 or 7) whose value defines the implicit probability of the less probable symbol, via
    p1 ~= ln(2) 2-m.
    We define M = 2m.

To encode a file, the Golomb encoder outputs a 1 every time the stream contains M consecutive 0s. Whenever it encounters (for some r between 0 and M-1) a string of r consecutive 0s followed by a 1, it outputs a 0 followed by the integer r encoded as an m-bit binary number.

This encoder may be viewed as a special case of the runlength-based Huffman code with a maximum runlength constrained to be a power of 2, and all runs assigned equal implicit probability.

One may also view it as an approximate arithmetic coder; adaptation may be performed by adjusting m. The Golomb coder was the starting point for the Z-coder, an excellent compression algorithm used inside djvu.

The programs, which depend on the header file Gdefns.h, are used thus:

 make GolombEncode
 make GolombDecode
 GolombEncode < filep.01.10000  > file.GZ
 GolombDecode < file.GZ  > file.decoded

This encoder gets the sparse file into 870 bits (when m=7) and 838 bits (when m=6). That's very close to optimal, isn't it! The decoder does not make use of a known source file length; when it hits an EOF symbol, it stops decoding and terminates the file correctly.


Further ideas for other solutions

Position code plus `bits back'

Here's a fun idea. To encode a file of length N=10,000, of which roughly 100 of the bits are 1s, we could encode the position of each of the 1s in the file. Since each position can be represented by a 14-bit integer, the compressed file length will be roughly 100x14 = 1400 bits.

Now, that's some way off from the expected Shannon information content, 800 bits or so. Why?

Well, an encoding of all the positions has redundancy in it in the sense that the encoder is free to choose the order of the encoded bit-positions. This freedom means that the encoder has the opportunity to encode not only the 100 bit-positions but also an arbitrary choice of one from the 100! (one hundred factorial) possible permutations. In order to make that choice, the encoder could sub-contract to his friend Fred, who also wants to communicate over this channel, the decision about the choice of permutation. Receiving the permuted string of bits, the receiver can then deduce not only the sparse file but also Fred's choice. And Fred can use that choice to convey another file of size log2(100!) bits, which is very roughly 100 (log2 100 - 1), or 600 bits.

So the net cost of communicating this way is (total transmitted length in bits) - (length in bits of Fred's hidden transmission) ~= 1400 - 600 ~= 800 bits.

Which is the expected Shannon information content!

This idea is called bits-back coding. We encode in an arbitrary way, that requires some extra bits to resolve the arbitrariness; then claim back the cost of those extra bits by selling them to Fred. Now we get to the really fun bit: can you write a compression method such that the encoder himself plays the role of Fred? - i.e., the encoder chooses the permutation of the bit-positions in a way that conveys some of the bit-positions!

Related concepts

  • How should Fred turn his 600-bit file into a permutation of the 100 bits? We need an efficient and practical program. And a decoder that turns a permutation back into bits.
  • A nearby concept is this: imagine that Joe wants to communicate an unordered selection of r objects from N objects. How should he encode this selection in bits?


David MacKay