Coffee Space – Coffee Space

Cache Tool

Sometimes I have good ideas and find out somebody else got there before me, but this time I actually found something that is seriously cool that has not been done before…

Background

I wrote the following on HackerNews (with added highlighting):

I have some 381 posts (including landing pages, etc). It takes maybe 45 minutes to build (there is no cache process). I have a faster version that skips audio processing, but it still takes minutes.

The problem is that it does several pandoc parses for each page:

Plain text - Input for audio parsing (which takes a significant amount of time, I have a version of the build that skips this).

Simple HTML - Input for RSS feed.

Full HTML - Static web page.

One change I would make to pandoc is to have the option for several outputs (as was discussed years ago [1]). I think generating multiple outputs is highly common usage.

One feature I will get around to writing (eventually) is automatic image compression. I had some success using jpegoptim and optipng on another project. This takes a while, so I want to build a command line tool that caches the output of a command if both the input file(s) and command do not change, and invalidate based on time. That way I could blindly do something like:
0001 image="cat-pic.png"
0002 cache-run -i $image -c "optipng -dir out/ -o 4 -strip all -fix $image"
(You might want to experiment with different optimization values based on resolution, etc).

[1] https://groups.google.com/g/pandoc-discuss/c/lex900rSpOM

The idea specifically of a command line based cache could be pretty awesome in many contexts and would have saved my bacon in many scenarios.

Crystallise

Essentially, a tool that caches a command based on time, input file(s) hash changes and command line changes. The idea is to be able to cache long running commands easily, without any extra tooling.

The usage would look something like this:

0003 # Create a cache database with a timeout
0004 cache create -t 30d -o cache # 30 day timeout, write cache.* files
0005 # Run commands against a cache
0006 cache run -d cache.db -i <file> -c <command> # One file, one command
0007 cache run -d cache.db -i <file> -i <file> -c <command> # Two files, one command
0008 cache clear -d cache.db # Clear the cache

Some things to think about:

Not specifying the database each time - Perhaps we could read this from an environment variable if it exists.
Small command name - It’s just cruft that takes up space on the command line.
Piping - Ideally we want to be able to handle pipes as many processing scripts will make use of these.
Output - We need to capture stdout, stderr and any file(s) the command touches. The best way to handle files is probably best to do nothing, that is to leave whatever existing outputs as-is. The stdout and stderr will likely need to be replayed in the correct order in which they were generated too.

Requirements

Out of this we have a few requirements:

Reliability - It’s no good have a cacheing system you cannot rely on. We should use a well know, well trusted database for caching.
Ease of use - If the tool is too complex, people will not want to use it in their scripts, etc.
Performance - If it is slower than the original process, why did we bother using it?
Configurability - Caches are complicated. People may want to experiment with caching based on size, type, processing time, creation date, modified date, or any number of things. The cache policy should therefore be highly configurable.

Design

I am thinking of using the following:

Database - I think I should be using sqlite, but the temptation is there to use a key-value store instead. The reason is simply speed, we can convert the input into a hash and very quickly look up the required output. I would like to show-off u-database, but I am still not yet sure if it is mature enough. For example, I am not sure it handles binary data very well just yet.
Language - Part of me says to use Python for its ease of use, but on the other hand it may struggle on that start-up time and significant add to RAM usage. In which case C/C++ could serve a lot better here.
Scripting - If the language itself does not lend itself to scripting, we should add a small (reasonably good) scripting engine to handle things like policies. I am thinking to use the Elk JS engine again as I liked it, it’s not the fastest but it is memory bound and integrates well into C/C++.
Configuration - Of course it will be JSON, and I would like to enforce (at least in the codebase) the idea that everything is a string.

Next Steps

I may try and prototype something out, see how well it goes. Something like this may not be easy to develop, though. Whilst the idea seems simple, it needs to be done well and there is zero room for error here.

For example, when some programs parse the output, rather than just use stdin or stdout, they may feed it all into a single pipe. That’s great and all, but their scripts may then expect these to be in a certain order too.

The scripting part may also take some time to make nice and fast enough. It should also be done in such a way that it makes sense for other people to come in and hack their own policies without needing to ask questions.

I will keep this updated if/when I make further updates.