Huffman encoding

Huffman encoding [DRAFT]

Huffman encoding

There's quite a lot on the web about data compression. Some of it is highly technical; other pages glory in saying "Well, you're probably doing a school project or something, so we're not going to do your assignment for you. Assuming you've done your homework and know Huffman coding..".

We don't have many such axes to grind. We believe if you do have an assignment, and are one of those people who reads the solution before you try the assignment, then either you're very bright, or you're a total schmuck who won't amount to much! Either way, here's our simple approach to Huffman compression. We do assume that you have rudimentary programming skills, understand simple binary arithmetic, and know what ASCII is.

Introduction

You may wish to skip to the meat, but why not read our tiny intro? In 1951, David Huffman wrote a little post-graduate assignment, which was destined to become a computer-programming classic. Huffman encoding is perhaps to computer science what fast Fourier transforms are to electrical engineering - a dietary staple that crops up everywhere. Huffman's neat, for it allows us in a predictable, quick fashion to optimally compress data. "Hang on", you cry - "surely it cannot be optimal?" Well, yes, this depends on what you mean by optimal. Huffman is optimal, provided you are looking at your data as made up of a given alphabet of 'things' that you want to represent in the shortest way you can. 'Letters' in your alphabet will be optimally represented, with more common letters taking up fewer bits, and less frequently used letters being allocated far longer bit strings to represent them. This doesn't mean that you cannot achieve more compression by looking at the relationship between letters, and encoding this. But, at the end of the day, when you have a whole lot of 'identifiable letters' lumped together, Huffman is often the way to go. This is why we see Huffman encoding used together with other compression schemes, such as the notorious Lempel-Ziv-Welch compression used in GIF files (Notorious because UNISYS suddenly started exerting their copyright a few years back, which silly move angered not a few webizens).

When we say 'optimal', we also fail to mention that there are different ways of applying Huffman compression. For example, you might have a body (corpus) of text that you want to compress, and know from the start the relative frequencies of all the letters in this text. You could then design a 'static' compression scheme and apply it to the text. Conversely, you might have only the first part of the text (and already want to start compressing) - you might then use an 'adaptive' scheme of Huffman encoding! Or you might have several texts in different languages - each one might need its own scheme for optimal compression. In addition, if you're sending your compressed data down a wire, note that you also have to tell the person on the other end something about how the data are compressed - and this information about how compression has been arranged takes up space, too. There are also tricks and traps in actually doing the compression - and some data sets are inherently unfriendly when it comes to doing what Huffman does - representing the letter frequencies in the form of a 'binary tree'.

Compression

It's obvious if you count the frequencies of letters in a text - say Charles Dickens' Tale of Two Cities - that some letters occur more frequently than others. In most large English texts, the letter 'e' is by far the most common (there are one or two exceptions)! Now let's try some trickery. We know that text is commonly stored as ASCII codes - and for normal 8-bit ASCII characters, the highest bit of each 8-bit byte is zero. Let's go through our text, and wherever we encounter the 8 bit ASCII code for 'e' (ASCII 01100101), we replace it by the single bit '1'. We save seven bits for every letter 'e'! If we now read our text from left to right, we can see that if we first encounter a '0', then there will be seven more bits that represent an 8-bit ASCII code; but if the first bit is a '1', then this alone represents an 'e', and we can move on to the next bit and examine it in turn, to see whether it's another e, or another 8-bit character!

We can see that the above approach is far from optimal - there are other letters, apart from e, that occur frequently. Each letter has its own peculiar frequency, so it would seem appropriate to represent some common letters with bit strings of one particular length, and others less common with other, longer bit strings. Just as we did with our clumsy e-xample, we need to uniquely distinguish between different strings, depending on the particular bit(s) they start with. In our example, every representation of e began (and ended there and then) with a binary one, everything else started with '0'. You can also work out that it's only really worthwhile using a single bit for the letter 'e' if nearly half of the letters are 'e'. If a quarter of the letters, are 'e', then you might represent this letter by two bits, and so on.

Let's have a look at how Huffman designed an optimal approach. When he designed his approach to compression, Huffman was faced with a problem. Let's consider a text, and an alphabet of symbols and their frequencies in the text, for example (swiped from Lanaki's cryptography lesson):

Letter use frequencies in "A Tale of Two Cities": 
E:    72881    12.4%    
T:    52397     8.9%    
A:    47072     8.0%    
O:    45116     7.6%    
N:    41316     7.0%    
I:    39710     6.7%    
H:    38334     6.5%    
S:    36770     6.2%    
R:    35946     6.1%    
D:    27487     4.6%    
L:    21479     3.6%    
U:    16218     2.7%    
M:    14928     2.5%    
W:    13835     2.3%    
C:    13223     2.2%    
F:    13152     2.2%    
G:    12121     2.0%    
Y:    11849     2.0%    
P:     9452     1.6%
B:     8163     1.3%
V:     5044     0.8%
K:     4631     0.7%
Q:      655     0.1%
X:      637     0.1%
J:      623     0.1%
Z:      213     0.0%

Total letter count =  586747

You find that you rapidly get bogged down if you start assigning frequencies starting with the most commonly occurring letters. There's a vast number of ways you can combine frequencies, once you start moving down and looking at less and less frequently-occurring letters. Try it for your self! Huffman's answer was devastatingly simple - just start at the other end, with the least common frequencies! The trick is to take the frequencies of the two least common characters, and add them together. You now have a 'metacharacter' that stands in place of these two characters. You next take all the remaining frequencies (including the new metacharacter), and decide which two characters now are least common, and repeat the process until you have only two characters left (one at least must be a metacharacter, unless your alphabet has only two characters). You then work back down to your initial characters, but this time, every time you encounter a metacharacter and split it into two, you can assign a '0' to the representation of one character (or meta-character), and a '1' to the other branch.

We'll try an example in a moment. But you can immediately get a feel for what you're doing. Every time you move back down towards the infrequently used characters, as you split into either 0 or 1, you are creating a binary tree , with branches, and sub-branches, and sub-sub-branches.. Note that only very rarely will such trees have branches that are completely symmetrical - if a letter is very common, such a 'leaf' will occur very near the 'root' of the tree, making the tree lopsided.

An example

Consider the following simple alphabet, with corresponding frequencies:

Letter Frequency (%)

E 36

S 19

A 17

D 14

R 10

Z 4

We progressively apply our Huffman algorithm, thus:

Rank Initial Pass 1 Pass 2 Pass 3 Pass 4

HIGHEST

LOWEST E=36 E=36 E=36 E=36 DRZSA=64

S=19 S=19 DRZ=28 SA=36 E=36

A=17 A=17 S=19 DRZ=28

D=14 D=14 A=17

R=10 RZ=14

Z=4

You can see what we've done - first we set up our table with the initial values and frequencies. We then add the frequencies for R and Z (call this RZ), and then re-rank everything so that items are still sorted by frequency. We repeat this again and again, until after Pass 4, we have only two items left - the metacharacter DRZSA, and the character E.

Turning this into a binary tree is then simple - let's split or tree into two, binary '1' for E, and binary '0' for DRZSA. E is a 'leaf', so we turn to DRZSA, which we in turn split into two - '01' for DRZ, and '00' for SA. We repeat the process for both metacharacters, deriving '011' for D, and '010' for RZ. Similarly SA is split into '001' for A, and '000' for S. But wait, we still need to sort out RZ, so we split it into '0101' (R) and '0100' (Z). We're finished. Here's a schematic of the binary tree we've created:

      E (1)            D (011)
     /                /
root            DRZ (01)        R (0101)
     \        /       \       /
      \      /         RZ (010)
     DRZSA (0)               \
             \      A (001)   Z (0100)
              \    /
              SA (00)
                   \
                    S (000)

Notice several things:

The tree is far from symmetrical;
As desired, less frequent characters have more bits in their binary representation;
At several points in the tree, where frequencies were the same, arbitrary choices have been made - we could have, for example, added DRZ (28%) to either SA (36%) or E (also 36%).
We arbitrarily allocated '1' to E and '0' to DRZSA. We might just as easily have allocated '0' to E and '1' to DRZSA.

We can now make a table of bit strings for each character:

Bit string table

Character Bit string Length

E 1 1

D 011 3

A 001 3

S 000 3

R 0101 4

Z 0100 4

You can see that despite the constraints of the algorithm, we still have room to play! There were several points where we made 'arbitrary choices'. This is a good thing, as we will soon discover. For there are problems..

Doing the compression

Okay, so we now have our (optimal?) tree that represents the Huffman codes for all of our characters. The question is, how do we use it? (We'll leave actual encoding of the binary tree itself for later). There are two distinct problems. These are:

Encoding our text corpus
Telling someone how to decode the encoded data

Encoding

Conceptually, this is simple. All one does is take the string we wish to encode, and then, using a table of lengths and bit-strings constructed as above, for every 'E' we encounter, substitute '1' (binary); for every 'A', we substitute '001', and so on. The mechanics of doing so are mildly tricky, however!

One way of doing things is to set aside binary words in memory that are big enough to contain the longest binary Huffman string. Also set up an array of lengths of each binary string corresponding to a particular character. We then work sequentially through the string to be translated (don't do this just yet), looking up the binary word for each character, and then tacking the correct number of bits onto our output string. If you set things up correctly, this should be quick. Oh! Remember to somehow record the length of the finished output string, so extra 0's or whatever at the end don't get mis-interpreted as data!

Decoding - Canonical Huffman

Decoding is a bit more subtle, and actually colours our encoding scheme! We need to tell the 'decoder' how things were encoded, and the encoder therefore needs to find an efficient algorithm to do just that.

First, the decoder needs to know exactly how the string was encoded. For this, we employ the trickery we referred to above. The magic term is 'canonical Huffman'. The word canon comes from a Greek/Latin word meaning a 'rule', often actually a rather arbitrary edict. Our edict is, that when encoding the bit strings :

We will preserve "alphabetical order" in our alphabet;
We will allocate '1's to shorter codes.

The reason for this strange set of rules will become apparent. But first, let's take the Huffman table from above:

Rank Initial Pass 1 Pass 2 Pass 3 Pass 4

HIGHEST

LOWEST E=36 E=36 E=36 E=36 DRZSA=64

S=19 S=19 DRZ=28 SA=36 E=36

A=17 A=17 S=19 DRZ=28

D=14 D=14 A=17

R=10 RZ=14

Z=4

We next construct a tree, slightly different from the one above, because we will now always allocate the '1's to shorter branches:

    E(1)          A(011)
   /             /
root       SA (01)
   \      /      \
   DRZSA(0)       S(010)
          \       D(001)
           \     /
           DRZ(00)      Z(0001)
                 \     /
                  RZ(000)
                       \
                        R(0000)

But you can see there's a problem - for although we've meticulously allocated 1's to shorter codes (E gets a one, DRZSA gets a zero; D gets 001 while RZ gets 000), we haven't preserved alphabetical order ! Take A, S and D, which have the same lengths. To preserve alphabetical order, the value for A should be 001, D 010, and S 011. Also note that even if we swop around S and A, making S=011 and A=010, D is still misplaced.

The solution is to create our tree as above, allocating codes, and then, for each set of characters with bit strings of the same length , to re-allocate their codes in alphabetical order! So, for example, with A, D and S, which are all of length three, we re-allocate values - A becomes 001, D 010 and S 011. Note that nothing has really changed, as we still need three bits for each character. We just re-arranged things slightly, thus:

    E(1)               |A(001)
   /             /     |
root       SA (01)     |
   \      /      \     |
   DRZSA(0)            |D(010)
          \            |S(011)
           \     /
           DRZ(00)      Z(0001)
                 \     /
                  RZ(000)
                       \
                        R(0000)

Why?

Well, the answer is that if we use canonical Huffman coding to encode a text, then all the person on the other end needs in order to decode the text is a list of the lengths of all of the characters in the alphabet! You can easily see why this is, by working through our example .. if we know that there's only one code of length 1 then '1' must refer to E, so everything starting with 0 must refer to bit strings with length greater than 1. If there are no characters of length 2, then we move onto length 3, where there are now four possibilities (000, 001, 010 and 011 - remember that E gobbled up the first '1', so 100 and so on are not possibilities).

Knowing that A, S and D are of length three, we can work out what the corresponding codes are - because shorter strings eat the 'ones', 000 cannot refer to A, S or D. Because we preserve alphabetical order, the values must be A=001, D=010, and S=011.

In summary, for the person receiving a canonical Huffman-encoded text to effectively decode the text, all she needs is to know the length (in bits) of each of the characters. She can work out all the rest! Here is our table of bit strings - note that the actual strings are redundant information, as we have NOW used canonical Huffman encoding.

Canonical bit string table

Character Bit string Length

E 1 1

A 001 3

D 010 3

S 011 3

R 0000 4

Z 0001 4

Alternatives

Think for a moment. We could also have arranged things thus:

Rank Initial Pass 1 Pass 2 Pass 3 Pass 4

HIGHEST

LOWEST E=36 E=36 E=36 AS=36 EDRZ=64

S=19 S=19 DRZ=28 E=36 AS=36

A=17 A=17 S=19 DRZ=28

D=14 D=14 A=17

R=10 RZ=14

Z=4

(All we did was meld E and DRZ, rather than AS and DRZ as we did above. Perfectly legal, as E and AS have identical frequency weightings). The corresponding canonical Huffman tree is:

           A(10)
          /
      AS(1)
     /    \
    /      S(11)
root       E(01)
     \     /
      EDRZ(0)      D(001)
           \      /
            DRZ(00)     R(0000)
                  \    /
                   RZ(000)
                       \
                        Z(0001)

Now you can see that even a canonical Huffman tree needn't be unique. We still have some scope for playing! So which variant is better? If you think about it, in our first tree, the letter 'E' was encoded using just a single bit - here it gets two, but then here, A and S also get two bits rather than three. Consider a hundred characters. Using either encoding system, the number of bits required will be:

i='Z'
SIGMA ( freq _i * size _i)
i='A'

In other words, add up the product of the bit size of each letter, and the number of letters to get the total bits.

For the first system, this works out at:

36*1 + 19*3 + 17*3 + 14*3 + 10*4 + 4*4 = 242 bits

For the second, we get:

36*2 + 19*2 + 17*2 + 14*3 + 10*4 + 4*4 = 242 bits

Intriguing, isn't it? Now you have to ask yourself, is this just luck, or always the case?

Practical Points

{We need to make a few notes on e.g. Fibonacci}

NB ceil(log2(alphabetsize) is max number of 1's in a row! ie may have word length longer than this, but then hi-order bits must be zero, which helps with encoding!

No huffman code can be longer than alphabetsize - 1.

Rank	Initial	Pass 1	Pass 2	Pass 3	Pass 4
HIGHEST LOWEST	E=36	E=36	E=36	E=36	DRZSA=64
	S=19	S=19	DRZ=28	SA=36	E=36
	A=17	A=17	S=19	DRZ=28
	D=14	D=14	A=17
	R=10	RZ=14
	Z=4

Bit string table
Character	Bit string	Length
E	1	1
D	011	3
A	001	3
S	000	3
R	0101	4
Z	0100	4

Canonical bit string table
Character	Bit string	Length
E	1	1
A	001	3
D	010	3
S	011	3
R	0000	4
Z	0001	4