Coffee Space – Coffee Space

Bit Reverse

TL;DR

Look up tables are speedy (if you have the resources and are doing repetitions), magic performs well on the right CPU and avoiding branching is usually very good.

Introduction

As all good stories start, a problem emerged: I needed the ability to take a byte and to reverse the bits. The purpose was the store a bitmap image in RAM as efficiently as possible (the target device has 64kB available and it wouldn’t be possible to store a full screen buffer in there with full colour depth). A simple enough problem to solve - or so I thought.

Initially I thought there were op-codes rolling in the back of my mind that should do this sort of a thing. Checking Wikipedia, I found ROL and ROR. Nope, they don’t reverse bits. Looking through the instruction set, I realized I would need to do this the “hard way”.

Existing Solutions

I had a quick look on Stack Overflow for some potential solutions as was a little disappointed - I thought I could have a good crack at the problem. I’ll go though some of the solutions here:

LUT

e.James wrote [accepted]:

If you are talking about a single byte, a table-lookup is probably the best bet, unless for some reason you don’t have 256 bytes available.

I am a big fan of LUTs for computation, but these can often be more costly than just computing the target value.

Dot Matrix

R1S8K wrote:

This one helped me with 8x8 dot matrix set of arrays.

0001 uint8_t mirror_bits(uint8_t var)
0002 {
0003     if ((var & 0x81) && (var != 0x81))var ^= 0x81;
0004     if ((var & 0x42) && (var != 0x42))var ^= 0x42;
0005     if ((var & 0x24) && (var != 0x24))var ^= 0x24;
0006     if ((var & 0x18) && (var != 0x18))var ^= 0x18;
0007     return var;
0008 }

This is quite a neat little solution, but the cost will be quite high for those if-statements. You’ll get killed on that branching.

Magic

mascIT wrote:

Assuming that your compiler allows unsigned long long:

0009 unsigned char reverse(unsigned char b) {
0010   return (b * 0x0202020202ULL & 0x010884422010ULL) % 1023;
0011 }

Literal magic. We would pay for that multiplication, but the cost is worth it. The only problem is portability, the size of unsigned long long changes depending on the compiler and CPU.

Definition

Bob Stein wrote:

For the very limited case of constant, 8-bit input, this method costs no memory or CPU at run-time:

0012 #define MSB2LSB(b) (((b)&1?128:0)|((b)&2?64:0)|((b)&4?32:0)|((b)&8?16:0)|((b)&16?8:0)|((b)&32?4:0)|((b)&64?2:0)|((b)&128?1:0))

I used this for ARINC-429 where the bit order (endianness) of the label is opposite the rest of the word. The label is often a constant, and conventionally in octal.

Here’s how I used it to define a constant, because the spec defines this label as big-endian 205 octal.

I like the ease of use, but just like the Dot Matrix solution, this will kill your branching. One thing that could be nice is if the compiler pre-computed the result of a constant, meaning you don’t pay at runtime. That said, if you’re running at runtime, this is awful.

Loop & Shift

chqrlie wrote;

Here is a simple and readable solution, portable to all conformant platforms, including those with sizeof(char) == sizeof(int):

0013 #include <limits.h>
0014 
0015 unsigned char reverse(unsigned char c) {
0016     int shift;
0017     unsigned char result = 0;
0018 
0019     for (shift = 0; shift < CHAR_BIT; shift++) {
0020         result <<= 1;
0021         result |= c & 1;
0022         c >>= 1;
0023     }
0024     return result;
0025 }

It looks a bit better, but one major problem is that it sits in a loop. It’s nice and cross-platform, but that’s an if-statement per check. You’re also shifting twice.

New Solutions

I asked a friend to come up with solution to the problem without first seeing my solution, to see how they would solve the problem. They are relatively new to C/C++, so I wasn’t expecting the best performing - but I was hoping for some out of the box thinking.

Friend

Their solution was as follows:

0026 unsigned char reverse(unsigned char b){
0027   unsigned char r = 0;
0028   for(int i = 0; b != 0; i++){
0029     r |= (b % 2) << (7 - i);
0030     b = b / 2;
0031   }
0032   return r;
0033 }

An interesting approach, again with the looping. I like the idea of getting the last bit with the b % 2, this was something I also considered early on.

A missed opportunity with the x / 2 == x >> 2, with the divide operation being quite expensive. The compiler would likely pick up on such simple optimizations.

Mine

So now for my solution:

0034 unsigned char reverse(unsigned char b){
0035   return ((b             ) << 7) |
0036          ((b & 0b00000010) << 5) |
0037          ((b & 0b00000100) << 3) |
0038          ((b & 0b00001000) << 1) |
0039          ((b & 0b00010000) >> 1) |
0040          ((b & 0b00100000) >> 3) |
0041          ((b & 0b01000000) >> 5) |
0042          ((b             ) >> 7);
0043 }

There were several goal with this implementation:

Avoid loops at all costs - they tend to kill branching.
Avoid if-statements at all costs - they too tend to kill branching.
Avoid the assignment of temporary variables - try to force everything to stay in registers for the extra speed.

The approach is to effectively take the mask of the bit we are probing and then bit shift it into the opposite position. We then as this to the other operations with the | (bitwise-OR) operator.

I like this solution because it should theoretically be fast, it finished in constant time, it should be extremely cross-platform and I wrote it (don’t @ me).

Performance

So now for the proof in the pudding!

Method	Timing (seconds)
LUT	12
Dot Matrix	Failed to finish
Magic	14
Definition	16
Loop & Shift	57
Friends	57
Mine	15

As expected! What wasn’t expected was the Dot Matrix solution doesn’t actually work. Oof. Sucks if anybody decided to use that code in production. My take-home points are:

Branching kills performance - stop doing it in tight loops if you can avoid it!
LUTs are awesome if you have the space - especially if you can get a small LUT in the CPU cache!
Always test, even the simple functions - as you can see, they aren’t always as good as they suggest.

Which one would I use? It would still be my implementation. Although the LUT performs quite well, in low-cache, low-RAM or slow memory situations is really doesn’t pay off. Also for one-off calculations it may not even be ready in RAM yet, so you have to pay the cost of copying it over (+).

(+) This was also the take-away for another project using a LUT for converting from RGB to HSV. Whilst on my desktop CPU it was fast, on the Raspberry Pi it was actually faster to do the calculation itself than to look it up in a table.