Compression assignment 2005


This is the main assignment for the information theory course. You may work in pairs if you wish.

You have a choice: you can do the bent coin task described here, or, if you prefer, write a compression program and uncompression program to solve any other task you think is interesting (for example, text compression, or image compression).

The bent coin compression problem

Neil has created a huge file containing the results of ten thousand tosses of a bent coin with f=0.01. (Don't print out this file!)

For your convenience, there are two copies of Neil's file: one with each outcome (0/1) on a separate line; and one where all the outcomes are concatenated on a single line. (Don't print out either file!)

Your task is: Compress Neil's file.

Or, to be more precise, write two programs: a compressor and an uncompressor. The compressor should read in a file like Neil's file (~mackay/compress/filep.01.10000) and should write out another file, which, for simplicity, I would recommend making a plain text file containing 0s and 1s. (But you can use other formats, such as binary files, if you wish.)

The uncompressor should have the property that when it reads in the compressed file, it writes out the original file.

Benchmarks for Neil's file
MethodFile size (bits)
(some methods include additional file information)
No compression 10000   
gzip 2448   
gzip --best 1912   
zip 3400   
compress 3720   
bzip2 1808   
Theoretical limit:
Shannon information content 827.8

If I give your program another file similar to Neil's, the compressor and uncompressor should both work on that file too. [So you can't make a fake uncompressor that simply prints out Neil's file!]

We would like the compressed file to be as small as possible.

You may write the programs in any language you wish; we recommend python.

Huffman coding algorithm

If it's helpful to you, a Huffman coding algorithm (in ~mackay/python/compression/huffman/Huffman.py) can produce optimal symbol codes for probability distributions that you choose. [Further information.]

Solutions

Students' solutions | My solutions written in python

Python tips

Here is an example of a 'compression' algorithm that simply copies the source file to the 'compressed' file.

This package illustrates how to have a test() function that is executed when you type C-c C-c in emacs. It sets itself some simple tests, and if it passes all of those, it continues and attempts a harder test, "compressing" the BentCoinFile into tmp.zip.

For python enthusiasts:

Here is an example of a 'compression' algorithm that simply copies the source file to the 'compressed' file.

This package illustrates

  1. how to have a test() function that is executed when you type C-c C-c in emacs;
  2. how to have different functions that are executed when the program is run from a shell, reading from stdin and writing to stdout

Usage:
  compressing: 
     python copy.py < /home/mackay/compress/BentCoinFile > tmp
  uncompressing:
     python copy.py --uncompress < tmp > tmp2 

When you have finished

I encourage you to make a web-page, describing the method you used, how you decided on the parameters of your method, and (if it works) how many BITS your method compresses Neil's file into.


David MacKay