A while back, I was working on a small compression algorithm for high entropy data, specifically for a tiny hobby operating system back in 2017. The idea was that I would use 256 bytes or less for decompression code that would decompress an operating system hiding in a 512 byte boot sector.
Unfortunately the project didn’t work out, but a recent post on the Pine forum got me thinking. Specifically:
This file is then lz4 compressed, and written to the spi flash of the pinetime. The pinetime decompresses, processes the blocks, and writes them to the display.
That sounds awesome :) Do you get much savings with lz4? I assume the data you are sending is already pretty high entropy?
It goes from 5046208 bytes to 4099181. Which makes it small enough to fit in the 4 MiB of spi flash
This lz4 lossless compression algorithm sounds awesome. It made me think about my previous attempts and what went wrong.
The following are some failed methods I created to try and compress high-entropy data. I thought about these quite some years ago, but it’s interesting to look back on them and see so obviously now why they couldn’t possibly work 1.
The idea was pretty simple with this one - you take a matrix and then attempt to create some abstract representation of it by performing a calculation.
Defining it better, we can give some example matrix of values , , and :
We can then sum each of the rows and columns to ‘compress’ the data: , , , . To recover the data, you simple need to solve the matrix and satisfy the summed values. This could be done trivially and the data is always perfectly recoverable.
Of course, there are no real savings here - you store four values as four different values. The area is and the summed data is . But we see the possibility of savings when we expand the matrix, for we in theory can compress 9 bytes down to 6 bytes:
And the data would be compressed like so: , , , , , . But this should be setting your bullshit sensors off - we can simply apply this to any arbitrary data any arbitrary number of times. Something is seriously wrong here…
The idea essentially was to implement a reversible bubble sort algorithm - each sort the algorithm does it leaves a symbol in place to indicate that a sort occurred. Eventually the idea is that all of the symbols would be in order, and then a table simply needs to store the frequencies of each character.
Essentially we had some string we wanted to sort, and a special symbol - where had some magic value that placed it in the middle of the data when fully sorted. If we use the string , , , we would initially check if . If so, we need to swap them. The string would then become , , , . You would then repeat this until the sorting is completed, and store the index of the last location to be sorted.
To decompress, you then reverse this operation, so , , , becomes , , , where the is removed after reversing the operation.
I implemented this and of course, it’s not perfect 2. The main issues with the algorithm are:
The original article has more details on the issues involved, but it clearly is not so great.
So, how can I proceed with more viable compression methods?
Of course this is a very decent compression algorithm, which is extremely fast and well supported. But it’s not particularly suitable for this application for multiple reasons:
I believe I will attempt to incorporate something like Huffman coding as it is very simple. Some characters in the binary are worth representing in smaller coding forms (such as a NULL byte or a jmp
instruction), whilst many characters either not represented or appear only once or twice in a single block (512 bytes).
Given that only a small subset of bytes can benefit from the coding process, it therefore makes sense to not build a full Huffman coding tree for all 256 symbols but instead only bother building a tree for the most represented data (say the first 16 or 32 bytes). After this, all remaining unrepresented bytes can be assumed to be encoded in order 3.
Why keep on working on this? Well, I still like the OS project of course - but also a compression algorithm that can work on arbitrarily small blocks of data without inter-lock dependency could also be extremely useful. One example is in the audio space, where Google are testing audio streams below 3kbps. I have also seen audio all the way down to 450bps. Compression that can then be applied on top of this could make a big difference.
One application I have been thinking about in the back of my mind is LoRa communications - at some point I really want to revisit my LoRa-chan project and get some form of audio streamed without absolutely demolishing the network.