COMPRESSION ALGORITHMS

In this section we are going to examine at some of the most important algorithms created up to now.

This is a basic information theoretic algorithm, which uses a top-down approach. Messages are ranked in descending order of probability. The table is then divided in two parts with approximately the same number of counts.

Binary zero is assigned to the upper section, and binary one to the lower section. The process is then repeated until no further division is possible.

This method ensures that the most common messages have less bits while the less common messages have more bits. It is also instantaneously decodable since no code word forms the initial part of another code word.

2. Recursively divide into two parts, each with approx. same number of counts.[i]

Huffman coding is a sophisticated and efficient technique. It converts characters in a data file to binary code, and assigns the most common characters the shortest binary codes, and the least common have the longest. [ii]

This binary codes or code words are listed in descending order of their probabilities.

The lowest two probabilities are combined to give a new probability which is the sum of the two.

The process is then repeated until all the probabilities have been combined and the sum of the final probability is 1.

At each combination 0 and 1 are assigned to each probability, 0 being assigned to the upper division, and 1 to the lower division.

Finally to get binary code for each probability, the diagram is read from right to left.

I will now demonstrate how to use a Huffman coding diagram with the following probabilities:

X1 = 0.25, X2= 0.25, X3 = 0.2, X4 = 0.15, X5 = 0.08, X6 = 0.04, X7 = 0.02, X8 = 0.01

Run length encoding (RLE), is also known as "run length limiting (RLL)". Run Length Encoding is a very simple form of data compression, it uses compression software which scans through the file looking for redundancies, after it finds these redundancies it then stores them using an escape character, followed by the character and a binary count of the number of times it is repeated.

Run-length encoding uses lossless data compression techniques and is most useful on data that contains many redundancies; for example, simple graphic images such as icons and line drawings.

Run-length encoding is also often used as a pre-processor for other compression algorithms. “It does not work well at all on continuous-tone images such as photographs, although JPEG uses it quite effectively on the coefficients that remain after transforming and quantizing image blocks”.[iii]

LZ 77 is known as substitutional coding, it is a lossless compression algorithm, it is a simple but effective method used to compress text, it takes advantage of the redundant nature of text to provide compression.

LZ 77 looks for repetitions of words and characters in a phrase, when it finds them, it uses a pointer (to track them back) followed by the number which indicates how many characters are to be matched.

An important part of LZ 77 encoding is the buffer or sliding window, this stores recently used text so that it can refer back to achieve compression, the size of the buffer is also important since if it is too small it would not be possible to store many characters, and if it is too big it will be more difficult to look for characters, and it will take more time too.

There is a similar variation of this method called LZ-78 these two algorithms are very similar, they are both dictionary coders, the difference is that LZ 77 works on past data, while LZ 78 attempts to work on future data. These algorithms are the basis for the other similar LZ variations including LZW.

-- where the underscores ("_") indicate spaces. This uncompressed message is 43 bytes, or 344 bits, long.

At first, LZ-77 simply outputs uncompressed characters, since there are no previous occurrences of any strings to refer back to. In our example, these characters will not be compressed:

-- has occurred earlier in the message, and can be represented as a pointer back to that earlier text, along with a length field. This gives:

-- where the pointer syntax means "look back three characters and take three characters from that point." There are two different binary formats for the pointer:

As noted, a flag bit with a value of 0 indicates a pointer. This is followed by a second flag bit giving the size of the pointer, with a 0 indicating an 8-bit pointer, and a 1 indicating a 12-bit pointer. So, in binary, the pointer <3,3> would look like this:

The first two bits are the flag bits, indicating a pointer that is 8 bits long. The next 8 bits are the pointer value, while the last four bits are the length value. [iv]

LZW coding is based on LZ77, it is an improved version of it, it also uses pointers to track back, but instead of looking just for characters, it looks for words, it actually creates a dictionary, (also called a translation table) of entries which it then uses to achieve compression.

Its “dictionary” can hold entries for up to 4096 variable length strings, the first 255 entries are used for single characters, so the first string actually starts at 256.

When compressing text files, LZW initializes the first 256 entries of the dictionary with the 8-bit ASCII character set (values 00h through FFh) as phrases. These phrases represent all possible single-byte values that may occur in the data stream, and all substrings are in turn built from these phrases. Because both LZW encoders and decoders begin with dictionaries initialized to these values, a decoder need not have the original dictionary and instead will build a duplicate dictionary as it decodes. [v]

Arithmetic coding is an improved version of the Huffman coding; it has a better compression ratio than Huffman although it is slower.

“Arithmetic coding completely bypasses the idea of replacing an input symbol with a specific code. Instead, it takes a stream of input symbols and replaces it with a single floating point output number. The longer (and more complex) the message, the more bits are needed in the output number”. [vi]

Arithmetic coding works by having a probability line of, 0-1, and assigning to every symbol a range in this line based on its probability, the higher the probability, the higher range which assigns to it.

Cisco internetworking devices use the STAC (LZS) and Predictor data compression algorithms. STAC (LZS) was developed by STAC Electronics, and is based on the Lempel-Ziv compression algorithm.

Cisco’s IOS software employs an improved version of the LZS compression algorithm, called STAC which has high compression ratios, although it requires more CPU cycles to perform compression.

STAC is available in Cisco's Link Access Procedure, Balanced (LAPB), HDLC, X.25, and Frame Relay data compression solutions. FRF.9 and IPComp use the LZS compression algorithm.

STAC searches for redundancies, and replaces them with “tokens” which are shorter than the redundancy strings. STAC also creates tables or dictionaries of these string matches and tokens that consist of pointers into the previous data stream. This dictionary is built and used to begin replacing the redundant strings found in the new data streams. [vii]

The Predictor compression algorithm as it name implies tries to predict the next sequence of characters by using an index to look up a sequence in its compression dictionary Then it looks at the next sequence to see if it matches, if it does it is replaced by the sequence in its dictionary, otherwise it finds the next character sequence in the index, and repeats the process all over again.

The Predictor compression algorithm was taken from the public domain, and optimized by Cisco; it is more efficient than the LZS algorithms in terms of CPU cycles, although it needs more memory.

Both of these algorithms can be used with Point-to-Point Protocol (PPP) and LAPB protocol.

[i] http://en.wikipedia.org/wiki/Run-length_encoding

25/12/2004

[ii] http://www.vectorsite.net/ttdcmp1.html#m3

26/12/2004

[iii] http://www.cs.cf.ac.uk/Dave/Multimedia/node209.html

27/12/2004

[iv] http://www.vectorsite.net/ttdcmp1.html#m5

27/12/2004

[v] http://netghost.narod.ru/gff/graphics/book/ch09_04.htm

[vi] http://dogma.net/markn/articles/arith/part1.htm

[vii] http://www.cisco.com/warp/public/cc/pd/iosw/tech/compr_wp.htm

28/12/2004