Look up tables are speedy (if you have the resources and are doing repetitions), magic performs well on the right CPU and avoiding branching is usually very good.
As all good stories start, a problem emerged: I needed the ability to take a byte and to reverse the bits. The purpose was the store a bitmap image in RAM as efficiently as possible (the target device has 64kB available and it wouldn’t be possible to store a full screen buffer in there with full colour depth). A simple enough problem to solve - or so I thought.
Initially I thought there were op-codes rolling in the back of my mind that should do this sort of a thing. Checking Wikipedia, I found ROL
and ROR
. Nope, they don’t reverse bits. Looking through the instruction set, I realized I would need to do this the “hard way”.
I had a quick look on Stack Overflow for some potential solutions as was a little disappointed - I thought I could have a good crack at the problem. I’ll go though some of the solutions here:
e.James wrote [accepted]:
If you are talking about a single byte, a table-lookup is probably the best bet, unless for some reason you don’t have 256 bytes available.
I am a big fan of LUTs for computation, but these can often be more costly than just computing the target value.
R1S8K wrote:
This one helped me with 8x8 dot matrix set of arrays.
0001 uint8_t mirror_bits(uint8_t var) 0002 { 0003 if ((var & 0x81) && (var != 0x81))var ^= 0x81; 0004 if ((var & 0x42) && (var != 0x42))var ^= 0x42; 0005 if ((var & 0x24) && (var != 0x24))var ^= 0x24; 0006 if ((var & 0x18) && (var != 0x18))var ^= 0x18; 0007 return var; 0008 }
This is quite a neat little solution, but the cost will be quite high for those if-statements. You’ll get killed on that branching.
mascIT wrote:
Assuming that your compiler allows unsigned long long:
0009 unsigned char reverse(unsigned char b) { 0010 return (b * 0x0202020202ULL & 0x010884422010ULL) % 1023; 0011 }
Literal magic. We would pay for that multiplication, but the cost is worth it. The only problem is portability, the size of unsigned long long
changes depending on the compiler and CPU.
Bob Stein wrote:
For the very limited case of constant, 8-bit input, this method costs no memory or CPU at run-time:
0012 #define MSB2LSB(b) (((b)&1?128:0)|((b)&2?64:0)|((b)&4?32:0)|((b)&8?16:0)|((b)&16?8:0)|((b)&32?4:0)|((b)&64?2:0)|((b)&128?1:0))
I used this for ARINC-429 where the bit order (endianness) of the label is opposite the rest of the word. The label is often a constant, and conventionally in octal.
Here’s how I used it to define a constant, because the spec defines this label as big-endian 205 octal.
I like the ease of use, but just like the Dot Matrix solution, this will kill your branching. One thing that could be nice is if the compiler pre-computed the result of a constant, meaning you don’t pay at runtime. That said, if you’re running at runtime, this is awful.
chqrlie wrote;
Here is a simple and readable solution, portable to all conformant platforms, including those with
sizeof(char) == sizeof(int)
:
0013 #include <limits.h> 0014 0015 unsigned char reverse(unsigned char c) { 0016 int shift; 0017 unsigned char result = 0; 0018 0019 for (shift = 0; shift < CHAR_BIT; shift++) { 0020 result <<= 1; 0021 result |= c & 1; 0022 c >>= 1; 0023 } 0024 return result; 0025 }
It looks a bit better, but one major problem is that it sits in a loop. It’s nice and cross-platform, but that’s an if-statement per check. You’re also shifting twice.
I asked a friend to come up with solution to the problem without first seeing my solution, to see how they would solve the problem. They are relatively new to C/C++, so I wasn’t expecting the best performing - but I was hoping for some out of the box thinking.
Their solution was as follows:
0026 unsigned char reverse(unsigned char b){ 0027 unsigned char r = 0; 0028 for(int i = 0; b != 0; i++){ 0029 r |= (b % 2) << (7 - i); 0030 b = b / 2; 0031 } 0032 return r; 0033 }
An interesting approach, again with the looping. I like the idea of getting the last bit with the b % 2
, this was something I also considered early on.
A missed opportunity with the x / 2 == x >> 2
, with the divide operation being quite expensive. The compiler would likely pick up on such simple optimizations.
So now for my solution:
0034 unsigned char reverse(unsigned char b){ 0035 return ((b ) << 7) | 0036 ((b & 0b00000010) << 5) | 0037 ((b & 0b00000100) << 3) | 0038 ((b & 0b00001000) << 1) | 0039 ((b & 0b00010000) >> 1) | 0040 ((b & 0b00100000) >> 3) | 0041 ((b & 0b01000000) >> 5) | 0042 ((b ) >> 7); 0043 }
There were several goal with this implementation:
The approach is to effectively take the mask of the bit we are probing and then bit shift it into the opposite position. We then as this to the other operations with the |
(bitwise-OR) operator.
I like this solution because it should theoretically be fast, it finished in constant time, it should be extremely cross-platform and I wrote it (don’t @ me).
So now for the proof in the pudding!
Method | Timing (seconds) |
---|---|
LUT | 12 |
Dot Matrix | Failed to finish |
Magic | 14 |
Definition | 16 |
Loop & Shift | 57 |
Friends | 57 |
Mine | 15 |
As expected! What wasn’t expected was the Dot Matrix solution doesn’t actually work. Oof. Sucks if anybody decided to use that code in production. My take-home points are:
Which one would I use? It would still be my implementation. Although the LUT performs quite well, in low-cache, low-RAM or slow memory situations is really doesn’t pay off. Also for one-off calculations it may not even be ready in RAM yet, so you have to pay the cost of copying it over (+).
(+) This was also the take-away for another project using a LUT for converting from RGB to HSV. Whilst on my desktop CPU it was fast, on the Raspberry Pi it was actually faster to do the calculation itself than to look it up in a table.