Compression Design

NOTE: This document will be updated when compression algorithm designs are created, or requirements from the operating system change.

Introduction

Compression is the act of removing redundant information from an information stream, representing it as it's core self. This is a difficult problem, that is well known to have a limit that can be defined by looking at the entropy of the information you wish to compress:

E = log_2 p . p

Requirements

Assumptions

The following will be assumed about the data, to allow theoretical application be discovered without writing the actual application (for the purpose of sanity checking):

This means that following can also be assumed:

Ideas Exploration

String Compression

This version involves using a reserved byte for indicating the start and end of a String section. This means that we have to use a byte at the start, but fortunately the end of a String is indicated using a NULL byte. Best case scenario, we would have a large String, so that we get a saving of 25 - ((25 * 7) / 8) = 3. This is very little saving for just 3 bytes.

Saving: 3 out of 512 bytes

Binary Dictionary Compression

This version will be looking at taking pairs of matches and compressing them into a dictionary of fixed size, leaving an indicator in it's place. We are particularly looking for triple matches - as these are worth replacing. The probability of this occurring twice is likely because of the nature of the language, where some assembly instructions or calls will repeat several times.

A reasonable dictionary size would be:

[<1> Dictionary size][<1> Match byte][<3> Match]..[<3> Match]

The Dictionary size is the number of Match entries in the dictionary. Match byte represents the byte to look for to look for a match. The Match represents the original three bytes that were replaced, where it's position in the dictionary represents the index number next to the Match byte in the compressed stream. The following are rules for the Match byte in the compressed stream:

Assuming that we have three different repeats, where we see 2 repeats of each, we will compress 3 bytes * 3 patterns * 2 repeats = 18 bytes. From these 18 bytes we should save 18 - ((3 bytes * 3 patterns) + dictionary size + match byte) = 7 bytes.

Saving: 7 out of 512 bytes

Conclusion

Assuming we will have 16 programs in the final project, each 512 bytes in size - unless we save 512 / 16 = 32 bytes per program reliably - we won't even save any space with the added decompression program. The algorithms will we be significantly more difficult in assembly to write reliably, the compression program will be difficult for external people to run and there is still the issue of non-fixed size decompressed program sizes.

Unless there are significant savings - this is not worth running.