Coffee Space


Listen:

Bit Reverse

Preview Image

Preview Image

TL;DR

Look up tables are speedy (if you have the resources and are doing repetitions), magic performs well on the right CPU and avoiding branching is usually very good.

Introduction

As all good stories start, a problem emerged: I needed the ability to take a byte and to reverse the bits. The purpose was the store a bitmap image in RAM as efficiently as possible (the target device has 64kB available and it wouldn't be possible to store a full screen buffer in there with full colour depth). A simple enough problem to solve - or so I thought.

Initially I thought there were op-codes rolling in the back of my mind that should do this sort of a thing. Checking Wikipedia, I found ROL and ROR. Nope, they don't reverse bits. Looking through the instruction set, I realized I would need to do this the "hard way".

Existing Solutions

I had a quick look on Stack Overflow for some potential solutions as was a little disappointed - I thought I could have a good crack at the problem. I'll go though some of the solutions here:

LUT

e.James wrote [accepted]:

If you are talking about a single byte, a table-lookup is probably the best bet, unless for some reason you don't have 256 bytes available.

I am a big fan of LUTs for computation, but these can often be more costly than just computing the target value.

Dot Matrix

R1S8K wrote:

This one helped me with 8x8 dot matrix set of arrays.

uint8_t mirror_bits(uint8_t var)
{
    if ((var & 0x81) && (var != 0x81))var ^= 0x81;
    if ((var & 0x42) && (var != 0x42))var ^= 0x42;
    if ((var & 0x24) && (var != 0x24))var ^= 0x24;
    if ((var & 0x18) && (var != 0x18))var ^= 0x18;
    return var;
}

This is quite a neat little solution, but the cost will be quite high for those if-statements. You'll get killed on that branching.

Magic

mascIT wrote:

Assuming that your compiler allows unsigned long long:

unsigned char reverse(unsigned char b) {
  return (b * 0x0202020202ULL & 0x010884422010ULL) % 1023;
}

Literal magic. We would pay for that multiplication, but the cost is worth it. The only problem is portability, the size of unsigned long long changes depending on the compiler and CPU.

Definition

Bob Stein wrote:

For the very limited case of constant, 8-bit input, this method costs no memory or CPU at run-time:

#define MSB2LSB(b) (((b)&1?128:0)|((b)&2?64:0)|((b)&4?32:0)|((b)&8?16:0)|((b)&16?8:0)|((b)&32?4:0)|((b)&64?2:0)|((b)&128?1:0))

I used this for ARINC-429 where the bit order (endianness) of the label is opposite the rest of the word. The label is often a constant, and conventionally in octal.

Here's how I used it to define a constant, because the spec defines this label as big-endian 205 octal.

I like the ease of use, but just like the Dot Matrix solution, this will kill your branching. One thing that could be nice is if the compiler pre-computed the result of a constant, meaning you don't pay at runtime. That said, if you're running at runtime, this is awful.

Loop & Shift

chqrlie wrote;

Here is a simple and readable solution, portable to all conformant platforms, including those with sizeof(char) == sizeof(int):

#include <limits.h>

unsigned char reverse(unsigned char c) {
    int shift;
    unsigned char result = 0;

    for (shift = 0; shift < CHAR_BIT; shift++) {
        result <<= 1;
        result |= c & 1;
        c >>= 1;
    }
    return result;
}

It looks a bit better, but one major problem is that it sits in a loop. It's nice and cross-platform, but that's an if-statement per check. You're also shifting twice.

New Solutions

I asked a friend to come up with solution to the problem without first seeing my solution, to see how they would solve the problem. They are relatively new to C/C++, so I wasn't expecting the best performing - but I was hoping for some out of the box thinking.

Friend

Their solution was as follows:

unsigned char reverse(unsigned char b){
  unsigned char r = 0;
  for(int i = 0; b != 0; i++){
    r |= (b % 2) << (7 - i);
    b = b / 2;
  }
  return r;
}

An interesting approach, again with the looping. I like the idea of getting the last bit with the b % 2, this was something I also considered early on.

A missed opportunity with the x / 2 == x >> 2, with the divide operation being quite expensive. The compiler would likely pick up on such simple optimizations.

Mine

So now for my solution:

unsigned char reverse(unsigned char b){
  return ((b             ) << 7) |
         ((b & 0b00000010) << 5) |
         ((b & 0b00000100) << 3) |
         ((b & 0b00001000) << 1) |
         ((b & 0b00010000) >> 1) |
         ((b & 0b00100000) >> 3) |
         ((b & 0b01000000) >> 5) |
         ((b             ) >> 7);
}

There were several goal with this implementation:

  • Avoid loops at all costs - they tend to kill branching.
  • Avoid if-statements at all costs - they too tend to kill branching.
  • Avoid the assignment of temporary variables - try to force everything to stay in registers for the extra speed.

The approach is to effectively take the mask of the bit we are probing and then bit shift it into the opposite position. We then as this to the other operations with the | (bitwise-OR) operator.

I like this solution because it should theoretically be fast, it finished in constant time, it should be extremely cross-platform and I wrote it (don't @ me).

Performance

So now for the proof in the pudding!

Method Timing (seconds)
LUT 12
Dot Matrix Failed to finish
Magic 14
Definition 16
Loop & Shift 57
Friends 57
Mine 15

As expected! What wasn't expected was the Dot Matrix solution doesn't actually work. Oof. Sucks if anybody decided to use that code in production. My take-home points are:

Which one would I use? It would still be my implementation. Although the LUT performs quite well, in low-cache, low-RAM or slow memory situations is really doesn't pay off. Also for one-off calculations it may not even be ready in RAM yet, so you have to pay the cost of copying it over (+).

(+) This was also the take-away for another project using a LUT for converting from RGB to HSV. Whilst on my desktop CPU it was fast, on the Raspberry Pi it was actually faster to do the calculation itself than to look it up in a table.