Compression assignment 2004


This is the main assignment for the information theory course. (You are not obliged to do this assignment, but I think you will find it educational and fun.) Work in groups of two if you wish.

Neil has created a huge file containing the results of ten thousand tosses of a bent coin with f=0.01. (Don't print out this file!)

Your task is: Compress Neil's file.

Or, to be more precise, write two programs: a compressor and an uncompressor. The compressor should read in a file like Neil's file (~mackay/compress/filep.01.10000) and should write out another file, which, for simplicity, I would recommend making a plain text file containing 0s and 1s. (But you can use other formats, such as binary files, if you wish.)

The uncompressor should have the property that when it reads in the compressed file, it writes out the original file.

If I give your program another file similar to Neil's, the compressor and uncompressor should both work on that file too. [So you can't make a fake uncompressor that simply prints out Neil's file!]

We would like the compressed file to be as small as possible.

You may write the programs in any language you wish. C, octave, perl, and python all have their advantages.

Further information, only relevant once we have discussed Huffman coding

If it's helpful to you, I can supply a Huffman coding algorithm (in ~mackay/bin/huffman.p) that produces optimal symbol codes for probability distributions that you choose. [Further information.]

Benchmarks for Neil's file
MethodFile size (bits)
(some methods include additional file information)
No compression 10000   
gzip 2448   
gzip --best 1912   
zip 3400   
compress 3720   
bzip2 1808   
Theoretical limit:
Shannon information content 827.8

When you have finished

I encourage you to make a web-page, describing the method you used, how you decided on the parameters of your method, and (if it works) how many BITS your method compresses Neil's file into.


David MacKay