Home Gadgetman Forum Book Reviews Links Contact me X-UFO
Gadgets Processors Programming School Old news
Data, Data, Data... You can't get too much Data...
Yeah, right.... You can get too much data. Or at least too much for your storage.

What you need then is a way to pack the data together without losing anything.

We shall see a couple of 'simple' ways of doing that.

RLE

The easiest is RLE or 'Run Lenght Encoding'.
As the name implies, the data is separated into 'runs' and handled as such. A very basic RLE routine would be:

Prefix each run with a byte, letting the highest bit of the prefix Byte have this meaning:

0 - The run consists of the next Byte repeated n times. n is the value of the last 7 bits of the prefix.

1 - The run consists of the n next bytes, where n of course is the last 7 bits of the prefix. There's a few variations to this algorithm:

n is the last 7 bits +1 (You don't have Zero lenght runs after all.)

Signed (negative) value of n for one of the runs.

Just remember that for this to work each run must be 3 bytes or longer before any compression works.

Bitstuffing

A not so easy to implement (but easier to understand) is bitstuffing.

This is a rutine that works best on ASCII data.

Simply put, you consider the data as a stream of bits. Then you just remove the most significant bit of every Byte, and presto, the stream is only 7/8 of the original size...

The problem with this one is a low rate of compression, and a small snag when you bump into a Byte that needs the most significant bit. The MSB bit problem is solved by using a 'prefix' Byte. F. example a 0 might mean that the next symbol is 8 bit instead of 7. Of course this means that the Byte you use as a 'prefix' also have to be prefixed if it exists in the data.

These methods can of course be combined by bitstuffing the runs of the first algorithm. (the second variation of runs)

Archives

Often, you can get a saving just by adding files together.

This is due to the fact that the file system allocates blocks of a specific size for a file. 0.5KB for each block on a diskette, but usually 2KB or more for a Hard drive or compact flash. Since files very seldom fits exactly to a mltiple of these sizes, there's bound to be some loss. If your files fits exactly, are you certain that you have a efficient data structure in the file?

To pack files together, you create a new file, and in the beginning you add the names and size of the files in the package.

You can add a pointer to where in the file the individual files begins, but it's not necessary. You can find the same data by reading the file and adding together the file sizes.

Do remember to add checksums for the beginning of the file and for each file stored in the archive.

If you compress the files inside the archive, you should add information on how they are compressed.

BUT...

before you do anything like this, please consider your data again... Are you certain that the data is in the best possible format?
A common error is to have redundant data. I have seen data files that have 3 times the size that is necessary...

Also remember that compressing data can slow down reading/writing of the data. You also get more wulnerable datafiles.
Consider adding a checksum of the uncompressed data in addition to the data itself.


<< >>
<<Compression>>
<< >>