Coffee Space


Listen:

Cache Tool

Preview Image

Sometimes I have good ideas and find out somebody else got there before me, but this time I actually found something that is seriously cool that has not been done before…

Background

I wrote the following on HackerNews (with added highlighting):

I have some 381 posts (including landing pages, etc). It takes maybe 45 minutes to build (there is no cache process). I have a faster version that skips audio processing, but it still takes minutes.

The problem is that it does several pandoc parses for each page:

  1. Plain text - Input for audio parsing (which takes a significant amount of time, I have a version of the build that skips this).
  2. Simple HTML - Input for RSS feed.
  3. Full HTML - Static web page.

One change I would make to pandoc is to have the option for several outputs (as was discussed years ago [1]). I think generating multiple outputs is highly common usage.

One feature I will get around to writing (eventually) is automatic image compression. I had some success using jpegoptim and optipng on another project. This takes a while, so I want to build a command line tool that caches the output of a command if both the input file(s) and command do not change, and invalidate based on time. That way I could blindly do something like:

0001 image="cat-pic.png"
0002 cache-run -i $image -c "optipng -dir out/ -o 4 -strip all -fix $image"

(You might want to experiment with different optimization values based on resolution, etc).

[1] https://groups.google.com/g/pandoc-discuss/c/lex900rSpOM

The idea specifically of a command line based cache could be pretty awesome in many contexts and would have saved my bacon in many scenarios.

Crystallise

Essentially, a tool that caches a command based on time, input file(s) hash changes and command line changes. The idea is to be able to cache long running commands easily, without any extra tooling.

The usage would look something like this:

0003 # Create a cache database with a timeout
0004 cache create -t 30d -o cache # 30 day timeout, write cache.* files
0005 # Run commands against a cache
0006 cache run -d cache.db -i <file> -c <command> # One file, one command
0007 cache run -d cache.db -i <file> -i <file> -c <command> # Two files, one command
0008 cache clear -d cache.db # Clear the cache

Some things to think about:

Requirements

Out of this we have a few requirements:

  1. Reliability - It’s no good have a cacheing system you cannot rely on. We should use a well know, well trusted database for caching.
  2. Ease of use - If the tool is too complex, people will not want to use it in their scripts, etc.
  3. Performance - If it is slower than the original process, why did we bother using it?
  4. Configurability - Caches are complicated. People may want to experiment with caching based on size, type, processing time, creation date, modified date, or any number of things. The cache policy should therefore be highly configurable.

Design

I am thinking of using the following:

Next Steps

I may try and prototype something out, see how well it goes. Something like this may not be easy to develop, though. Whilst the idea seems simple, it needs to be done well and there is zero room for error here.

For example, when some programs parse the output, rather than just use stdin or stdout, they may feed it all into a single pipe. That’s great and all, but their scripts may then expect these to be in a certain order too.

The scripting part may also take some time to make nice and fast enough. It should also be done in such a way that it makes sense for other people to come in and hack their own policies without needing to ask questions.

I will keep this updated if/when I make further updates.