Compression Design

NOTE: This document will be updated when compression algorithm designs are created, or requirements from the operating system change.

Introduction

Compression is the act of removing redundant information from an information stream, representing it as it's core self. This is a difficult problem, that is well known to have a limit that can be defined by looking at the entropy of the information you wish to compress:

E = log_2 p . p

Requirements

In place compression - we have no extra RAM for the data
Small de/compression program - less than or equal to 512 bytes
Must be perfectly reliable - binary data is run as code
De/compression verification - we must know it is valid before running it

Assumptions

The following will be assumed about the data, to allow theoretical application be discovered without writing the actual application (for the purpose of sanity checking):

5% is 7 bit String data, 95% is 8 bit binary data
Data is randomly generated in the number of allocated bits
We are compressing 512 byte blocks

This means that following can also be assumed:

5% of 512 bytes is 26 bytes of String data
95% of 512 bytes is 486 bytes binary data

Ideas Exploration

Program string compression (7 bit ASCII, common data range, grey codes)
Binary dictionary compression (Replace common strings with lookup indicator)

String Compression

This version involves using a reserved byte for indicating the start and end of a String section. This means that we have to use a byte at the start, but fortunately the end of a String is indicated using a NULL byte. Best case scenario, we would have a large String, so that we get a saving of 25 - ((25 * 7) / 8) = 3. This is very little saving for just 3 bytes.

Saving: 3 out of 512 bytes

Binary Dictionary Compression

This version will be looking at taking pairs of matches and compressing them into a dictionary of fixed size, leaving an indicator in it's place. We are particularly looking for triple matches - as these are worth replacing. The probability of this occurring twice is likely because of the nature of the language, where some assembly instructions or calls will repeat several times.

A reasonable dictionary size would be:

[<1> Dictionary size][<1> Match byte][<3> Match]..[<3> Match]

The Dictionary size is the number of Match entries in the dictionary. Match byte represents the byte to look for to look for a match. The Match represents the original three bytes that were replaced, where it's position in the dictionary represents the index number next to the Match byte in the compressed stream. The following are rules for the Match byte in the compressed stream:

The Match byte should be the least represented byte in the stream to increase compression ratio.
The Match byte can be represented in the compressed stream by using [<1> Match byte][<1> 255].
The number following the Match byte in the data stream represents the index in the dictionary for the replacement bytes.

Assuming that we have three different repeats, where we see 2 repeats of each, we will compress 3 bytes * 3 patterns * 2 repeats = 18 bytes. From these 18 bytes we should save 18 - ((3 bytes * 3 patterns) + dictionary size + match byte) = 7 bytes.

Saving: 7 out of 512 bytes

Conclusion

Assuming we will have 16 programs in the final project, each 512 bytes in size - unless we save 512 / 16 = 32 bytes per program reliably - we won't even save any space with the added decompression program. The algorithms will we be significantly more difficult in assembly to write reliably, the compression program will be difficult for external people to run and there is still the issue of non-fixed size decompressed program sizes.

Unless there are significant savings - this is not worth running.