|
In my runlength/Huffman encoder, I chose
a maximum runlength of 69 for the reasons explained in
this document (postscript)
| (pdf).
I assume the length
of the file is known to the decoder; this knowledge allows the compressed
file to be about 6 bits shorter. If I had to communicate the length of the
file, or indicate the end of the file somehow,
I would have to add roughly 6 bits again.
My encoder is a C program, RLencode.c; the
decoder is a perl program, RLdecode.p.
I am pretty sure that these programs will work on any files.
The usage is
RLencode < ~mackay/compress/filep.01.10000 > file.RLZ
RLdecode.p < file.RLZ > decoded
The encoder gets Neil's file into 831 bits.
If the source filelength is changed from 10000, please add the Nmax=blah
argument to inform the decoder of the correct filelength. Thus:
RLdecode.p Nmax=10000 < file.RLZ > decoded
|
|
It's hard to make an arithmetic code that works perfectly.
ACencode and ACdecode
work on all test files I have tried, but I
am still not certain they will always always work!
Arithmetic coding is nice because it lends itself to
adaptive models, corresponding for example to the belief
that the bias of the bent coin is not known, and should
be inferred as we go; or to the belief that the bias of
the bent coin might change with time.
The two programs have two choices of compilation options,
corresponding to two possible adaptive models. One
model (well suited to the competition problem) asserts
that the bias of the coin is known to be accurately very close
to 0.01; the other asserts that the bias is unknown and could
be anything in the ballpark 0.01-0.99.
This choice is determined in the file ACdefns.h, which
is included at compilation time by both programs.
The programs are used thus:
make ACdecode
make ACencode
ACencode < ~mackay/compress/filep.01.10000 > file.ACZ
ACdecode < file.ACZ > file.decoded
This encoder gets Neil's file into 829 bits.
The decoder makes use of the known source file length, N=10000.
The results achieved by arithmetic coding are especially
impressive for even larger files.
For example, I made a million-bit source file using
randNchooseM.p N=1000000 M=10000 > million.dat
and compared ACencode with the runlength encoder RLencode.
The compressed file lengths were
80797 file.ACZ
81025 file.RLZ
|
| Code C_alpha |
| n | traditional binary | length | c_alpha(n) |
| 1 | 1 | 1 | 1 |
| 2 | 10 | 2 | 010 |
| 3 | 11 | 2 | 011 |
| 4 | 100 | 3 | 00100 |
| 5 | 101 | 3 | 00101 |
| 6 | 110 | 3 | 00110 |
| . | . | . | . |
| 45 | 101101 | 6 | 00000101101 |
|
We can make a really small compression algorithm
that is reasonably well suited to sparse files
by simply counting the run lengths of 1s and 0s, then coding
those integers into binary with a simple uniquely
decodeable code.
The program IRL.c
encodes the runlengths using the code C_alpha,
described in Chapter 7 of
Information Theory, Inference, and Learning Algorithms. (Ch 7: Codes for Integers).
The same program can function as an encoder or a decoder;
add the command line argument "-1" to select the decoder.
encoding usage: IRL < ~mackay/compress/secret/CompressMe3 > file.IRLZ
decoding usage: IRL -1 < file.IRLZ > decoded
This program does not use an explicit probabilistic model; instead,
it uses an implicit probabilistic model defined by
the chosen codelengths for integers. For example, according to C_alpha, the
implicit probabilities of 2 and 3 are both 1/8, and the probabilities of
4, 5, 6, and 7 are all 1/32.
This program gets Neil's file into 1286 bits.
|
|
I wrote a position code program in perl.
There are two interesting ways to use it.
First, the simplest position code encodes the location
of each 1 in a 14-bit string. The compressed length is 1442 bits.
The decoder (which must know the file length N)
makes an all-zero file then reads the compressed file 14 bits
at a time, adding 1s to the all-zero file.
It stops when it reaches the end of the compressed file.
position.p multipleblocks=0 bits=14 ../filep.01.10000 > encoded.pos
position.p multipleblocks=0 bits=14 decode=1 encoded.pos > recovered
With a small modification, position codes can do better.
We encode the file in blocks of size b bits, where 2b=64
is the best choice; and encode each 1 in b bits. To uniquely specify
the encoder's progress from one block to the next, we
must include a code for the number of 1s in a block.
The optimal code for this number of 1s would have lengths
given by log of the appropriate Binomial distribution; but
for simplicity, I use a unary code: a 0 marks the end
of a block and a 1 indicates that another 1 exists in
the current block.
position.p bits=6 ../filep.01.10000 > encoded.pos
position.p bits=6 decode=1 encoded.pos > recovered
The compressed file size is 878 bits.
|
|
The Golomb code is a very simple code that is both a runlength code
and an approximate arithmetic coder.
The encoder has just two adjustable parameters:
-
one bit (here set to 0),
which identifies the more probable symbol, and
-
an integer m (here set to 6 or 7)
whose value defines the implicit probability of the less probable symbol, via
p1 ~= ln(2) 2-m.
We define M = 2m.
To encode a file, the Golomb encoder outputs a 1 every time the stream contains
M consecutive 0s.
Whenever it encounters (for some r between 0 and M-1) a string of r
consecutive 0s followed by a 1, it outputs a 0
followed by the integer r encoded as an m-bit binary number.
This encoder may be viewed as a special case of the runlength-based Huffman code with a
maximum runlength constrained
to be a power of 2, and all runs assigned equal implicit probability.
One may also view it as an approximate arithmetic coder; adaptation may be performed
by adjusting m. The Golomb coder was the starting point for the Z-coder,
an excellent compression algorithm used inside
djvu.
The programs, which depend on the header file Gdefns.h,
are used thus:
make GolombEncode
make GolombDecode
GolombEncode < filep.01.10000 > file.GZ
GolombDecode < file.GZ > file.decoded
This encoder gets the sparse file into 870 bits (when m=7)
and 838 bits (when m=6). That's very close to optimal, isn't it!
The decoder does not make use of a known source file length; when it hits
an EOF symbol, it stops decoding and terminates the file correctly.
|
|