Coffee Space – Coffee Space

Hash Functions

In a previous post where I released uDatabase, I discussed the important of good hashing and how it really helped with the speed of my database library.

But that go me thinking: How do I really know what I have done is better?

My use-case is somewhat unique compared to other example hash functions, as I need to be able to do the following:

Convert a char* arbitrary length to a char* of fixed length.
The first two bytes should be very high quality hashing.
The remaining bytes are only required to be ‘unique’, but quality does not so much matter. The remainder of the hash must be as fast as possible.

Use looks something like so:

0001 char* key = "some value";
0002 char hash[16];
0003 func(key, hash, 16);
0004 uint16_t shortHash = *(uint16_t*)hash;
0005 /* Do something with values */

Why? The first two bytes are used as a shorter hash for a look-up table to reduce searching, whereas the rest of the hash is simply to confirm that a given key is unique if the initial check passes. This greatly reduces computation. Bare in mind, this hash function may be run millions of times a second!

Inspiration

Whilst searching on the wild world web I found this article by Andrei Ciobanu on Implementing Hash Tables in C. He kindly offers his source code which was an inspiration for my testing.

Unfortunately I was unable to use his code or testing, as he was interested in handling hash functions that return uint32_t and I am interested in char*, as well as wanting to visualize the quality of the hashes and benchmark the performance. I therefore took the time to write my own tests.

Testing Previous Functions

I created a small piece of checking code here. It outputs results, as well as various visualisation in PGM format. There is then a script to convert the PGM files to scaled JPG files (which are displayed here). The idea is to be able to visualise the collisions of the hash function, and therefore the quality. Black areas mean large numbers of collisions, whereas white means lower numbers of collisions.

In these tests, I test the collisions for uint8_t, uint16_t and “uint24_t” (which doesn’t exist). This in turn gives us single-byte and two-byte collisions, whereas the three-byte collisions indicate the hashing quality after the initial concentration area.

Why not test more bytes for collisions? The problem is storing the collisions in RAM. These tables are already quite large! The program already consumes in excess of 500MB of RAM during testing and each additional bit to be tested doubles that!

Results

I’ll start off by giving the main results (lower is better):

func	ms	col=8	col=16	col=24
cust1	2634	16777200	16776960	16773120
cust2	2839	16777001	16773120	16711680
cust3	2993	16776960	16711680	15902144
cust4	2295	16776960	16711680	15741061

We check $2^24$ keys (16777216). We expect that for a single byte, there is no way to avoid a collision. As we increase the number of bytes, good hashing algorithms will decrease the number of collisions.

As you can see, the hashing function named cust4 runs faster and has the least collisions in each scenario.

Function 1

This was my initial “make something work” function. It did an ‘okay job’ during initial testing.

0006 char* hash_cust1(char* s, char* d, int n){
0007   int x = -1;
0008   while(s[++x]) d[x % n] = s[x] ^ (x < n ? '\0' : d[x % n] << 2);
0009   while(x < n) d[x++] = '\0';
0010   return d;
0011 }

Performance: In terms of performance, this was my second best forming algorithm. We pay heavily for the inline ternary operator and this for sure affects branch prediction.

Collisions: This was awful.

As can be seen in the one-byte output, the function was completely unable to hash char characters with large overlap.

The two-byte output shows how insanely bad this problem is, with collisions typically happening in exactly the same locations.

It obviously doesn’t improve as we look towards three-byte output. I late realised it is even worse than I thought. If $|s| \ge n$ , it overrides any previous good effort attempt to hash the first two bytes. Ouch!

Function 2

At this time I has realised that my hashing function was significantly holding back my program, so I looked to make some small changes to improve test results.

0012 char* hash_cust2(char* s, char* d, int n){
0013   int x = -1;
0014   while(s[++x]) d[x % n] = s[x] ^ (s[x + 1] << 3);
0015   while(x < n) d[x++] = '\0';
0016   return d;
0017 }

Performance: We managed to lose the ternary operator, so things were looking good! But - for some reason, this actually decreased performance. I imagine there was some cost to looking one byte into the future and performing some operation on it with the current byte. Perhaps there is some cost associated with accessing memory more than once. I would imagine that s[x + 1] would be kept in a register for the next loop, but clearly not?

Collisions: It’s still not great, but clearly a massive improvement!

As we can see, the one-byte output is now more spread out! It’s still a very repeatable, not very well spread pattern - but this is a remarkably better distribution.

Unfortunately this didn’t hold up as we look at the two-byte output. It’s still a much better distribution, but not great.

One thing this function does well is to scatter the distribution across the entire range. This is where we see that s[x + 1] come into its own - but again it is too costly.

Function 3

It’s now that I decided to focus more heavily on the entropy of those first two bytes, where here is achieved by having a dedicated line just for hashing those first two bytes repeatedly.

0018 char* hash_cust3(char* s, char* d, int n){
0019   int x = -1;
0020   while(s[++x]){
0021     d[x % n] = s[x];
0022     d[x % sizeof(uint16_t)] ^= s[x] << 3;
0023   }
0024   while(x < n) d[x++] = '\0';
0025   return d;
0026 }

Performance: As a result of this extra dedicated operation, we lose performance. For the purpose of the application though, this was worth it, as it reduced collisions and therefore increased overall performance.

Collisions: Collisions are tonnes better in this version! It’s night and day!

As you can see, for one-byte we have a much better spread, but still not perfect.

This is also the best two-byte output (so far) as we now see

Function 4

This is the ‘best’ function I could derive based on function 3 (cust3). It now uses memset to zero out the memory, which means a bunch of optimizations methods can be used to accelerate the zeroing of this space. Additionally everything is in-lined to increase simplicity.

0027 char* hash_cust4(char* s, char* d, int n){
0028   memset(d, 0, n);
0029   for(int x = 0; *s; ++x) *(int16_t*)d = (*(int16_t*)d << 5) ^ *(int16_t*)d + (d[x % n] ^= *s++);
0030   return d;
0031 }

Performance: The performance is better than function 3 (cust3), but not as as good as the functions before.

Collisions: The number of collisions are reduced, but not significantly. You can see there is a very visible pattern in the visualisation and this this will ultimately cause larger numbers of hash collisions.

The one-byte spread clearly has a pattern, but this is significantly better than previous attempts.

The two-byte pattern is clearly much better than all previous attempts - we can now be certain that hashes are distributed over the entire space.

The thee-byte visualisation shows that we still suck at distributing the data over multiple bytes. The entropy is entirely concentrated on the first two bytes.

Testing New Functions

Ideally, we want a faster function that reduces collisions with the exact same data. Is this possible? Yes.

Armed with this new knowledge, I now test some new functions (and display the results for previous algorithms during the same run window) for $2^24$ keys:

func	ms	col=8	col=16	col=24
cust1	4267	16777200	16776960	16773120
cust2	4506	16777001	16773120	16711680
cust3	5292	16776960	16711680	15902144
cust4	4712	16776960	16711680	15741061
cust5	4602	16776960	16711680	15730195
cust6	4761	16776960	16711680	15728390
cust7	4265	16776960	16711680	15632352
cust8	4468	16776960	16711680	15728624
cust9	6168	16776960	16711680	6155480
cust10	5329	16776960	16711680	6138576
cust11	5320	16776960	16711680	7083260
cust12	4988	16776960	16711680	5915585
cust13	4838	16776960	16711680	5915585

As you can see, cust12 and cust13 have the least hash collisions whilst maintaining respectable timing (compared to the cust4 benchmark).

And for the purpose of really exaggerating the timing, I run them also at $2^28$ keys (268435456):

func	ms	col=8	col=16	col=24
cust1	68992	268435440	268435200	268431360
cust2	74341	268435241	268431360	268369920
cust3	78551	268435200	268369920	267386624
cust4	70847	268435200	268369920	267386624
cust5	66663	268435200	268369920	267386644
cust6	82937	268435200	268369920	267386624
cust7	65465	268435200	268369920	261502963
cust8	65944	268435200	268369920	267386864
cust9	81174	268435200	268369920	251658253
cust10	76063	268435200	268369920	251658241
cust11	71695	268435200	268369920	251658302
cust12	66548	268435200	268369920	251658240
cust13	66037	268435200	268369920	251658240

As you can see, cust5, cust7, cust8, cust12 and cust13 all outperform cust14. We’ll now go into a little more detail about the algorithms and how they were designed…

Function 5

This is similar the function 4, but operates on the theory that we are shifting our entropy out the front of the uint16_t. Instead of shifting by 5, we instead shift by 3:

0032 char* hash_cust5(char* s, char* d, int n){
0033   memset(d, 0, n);
0034   for(int x = 0; *s; ++x) *(int16_t*)d = (*(int16_t*)d << 3) ^ *(int16_t*)d + (d[x % n] ^= *s++);
0035   return d;
0036 }

Shifting by lower values seems to reduce the number of clock-cycles required to do the shift. Ideally shift values should be as low as possible.

Performance: As can bee seen, we reduce the running time when compared to function 4 (cust4).

Collisions: We perform slightly worse on collisions, but it is worth the speed-up.

The one-byte pattern looks very good. No place in particular looks overly concentrated.

This scales well to the two-byte pattern where we see a relatively nice distribution. At this point I realised that the blocky-ness acutally comes from that XOR operation which can be addressed in the future!

Three-byte pattern still have really poor hash distribution and is causing tonnes of collisions.

Function 6

With function 6 (cust6) I experimented with using rotl16 instead of shift, my theory being that we were shifting entropy straight off the bit.

0037 char* hash_cust6(char* s, char* d, int n){
0038   memset(d, 0, n);
0039   for(int x = 0; *s; ++x) *(int16_t*)d = rotl16(*(int16_t*)d, 5) ^ *(int16_t*)d + (d[x % n] ^= *s++);
0040   return d;
0041 }

Performance: Using the in-line function took a massive performance hit!

Collisions: It made only a little difference to the number of collisions ultimately.

In the one-byte pattern you can clearly see some blocking - very interesting! There is clearly something going on here!

This is not a bad distribution at all for the two-byte distribution.

The three-byte pattern still sucks, argh!

Function 7

At this point, I realised that we could overcome the issue of losing bits out the front of the bit shift simply by using larger integers instead of rotation:

0042 char* hash_cust7(char* s, char* d, int n){
0043   memset(d, 0, n);
0044   for(int x = 0; *s; ++x) *(int32_t*)d = (*(int32_t*)d << 3) ^ *(int32_t*)d + (d[x % n] ^= *s++);
0045   return d;
0046 }

Performance: It’s obviously much faster to use larger integers rather than the dedicated rotation function.

Collisions: We saw a nice little reduction in the number of collisions, awesome!

The one-byte pattern looks quite nice again.

The two-byte pattern is still a little blocky.

The three-byte pattern still sucks…

Function 8

This was just an experiment in performing the ++x elsewhere.

0047 char* hash_cust8(char* s, char* d, int n){
0048   memset(d, 0, n);
0049   for(int x = 0; *s;) *(int16_t*)d = (*(int16_t*)d << 5) ^ *(int16_t*)d + (d[++x % n] ^= *s++);
0050   return d;
0051 }

Performance: Runs a little slower.

Collisions: A regression in reducing collisions.

Nice one-byte pattern.

A little blocky two-byte pattern, but okay-ish.

Again an awful three-byte pattern.

Function 9

At this point I was getting a little desperate, the three-byte distributions have so-far sucked! I started experimenting with purposely ensuring that the upper hash was definitely touched by our algorithm:

0052 char* hash_cust9(char* s, char* d, int n){
0053   memset(d, 0, n);
0054   for(int x = 0; *s;){
0055     *(int16_t*)d = (*(int16_t*)d << 5) ^ *(int16_t*)d + (d[++x % n] ^= *s++);
0056     *(int16_t*)(d + (x % n - 1)) ^= *(int16_t*)d << 3;
0057   }
0058   return d;
0059 }

Performance: Doing too major calculations rather than one of course had a massive impact on performance.

Collisions: This is the first time we have seen such a massive reduction in collisions!

One-byte distribution doesn’t look all so bad.

Two-byte distribution is a little blocky, but not all so bad.

The three-byte distribution is… Wait a second! That’s awesome! We have a nice spread over the entire 3-byte distribution space!

Function 10

So at this point, I needed to think of a completely different approach. I turned back to my original resource for simple hash functions and thought it would be worth also checking out the sdmb implementation, specifically the gawk implementation:

0060 char* hash_cust10(char* s, char* d, int n){
0061   memset(d, 0, n);
0062   for(int x = 0; *s; ++x)
0063     *(int32_t*)d = (d[x % n] ^= *s++) + (*(int32_t*)d << 6) + (*(int32_t*)d << 16) - *(int32_t*)d;
0064   return d;
0065 }

Performance: A reduction in performance when compared to function 9 (cust9). Still not as good as function 4 (cust4), but I’ll take it!

Collisions: Another reduction in collisions! I believe the merit here is we see multiple ‘smears’/‘spreads’ of data across the range.

The one-byte pattern looks super cool!

That two-byte pattern is quite evenly distributed too, not bad!

And look at that three-byte pattern, that’s very nice!

Function 11

We speed-up function 11 (cust11) by changing the order of the operations (long story):

0066 char* hash_cust11(char* s, char* d, int n){
0067   memset(d, 0, n);
0068   for(int x = 0; *s; ++x)
0069     *(uint32_t*)d = (*(int32_t*)d << 16) + (*(int32_t*)d << 6) + *(int32_t*)d + (d[x % n] ^= *s++);
0070   return d;
0071 }

Performance: This gives us a little speed-up.

Collisions: Collisions are for some reason slightly higher, although the algorithm should be essentially the same for the most part.

Weird one-byte spread.

Weirder two-byte spread.

Nice three-byte spread.

Function 12

Instead of bit-shifting with (*(int32_t*)d << 16), we now just cast instead with *(int16_t*)(d + 2). We also use 16 + 5 bit shifting rather than 16 + 6 for a slightly nicer distribution:

0072 char* hash_cust12(char* s, char* d, int n){
0073   memset(d, 0, n);
0074   for(int x = 0; *s; ++x)
0075     *(uint32_t*)d = *(int16_t*)(d + 2) + (*(int32_t*)d << 5) + *(int32_t*)d + (d[x % n] ^= *s++);
0076   return d;
0077 }

Performance: For our effort we get significant speed-up, out-competing funct4 once again!

Collisions: Collisions are also reduced now owing to the shifting of 16 + 5.

A very nice one-byte distribution.

And a very nice two-byte distribution to match.

The three-byte distribution, whilst not the prettiest to look at, is much better than previous attempts.

Function 13

This function is a very small iteration of the previous function.

0078 char* hash_cust13(char* s, char* d, int n){
0079   memset(d, 0, n);
0080   for(int x = 0; *s;)
0081     *(uint32_t*)d = *(int16_t*)(d + 2) + (*(int32_t*)d << 5) + *(int32_t*)d + (d[x++ % n] ^= *s++);
0082   return d;
0083 }

Performance: Slightly better performance.

Collisions: The same number of collisions. The below exist just for completeness.

Final Thoughts

I’m pretty happy with the final implementation and believe is is a significant improvement over previous attempts. That said, I still believe there are improvements that can be made in this space:

Single byte (uint8_t) addressing - Removing the dependence on large integers is ideal. Having a hash function that can be used on any architecture (from an 8 bit microcontroller to a 64 bit server) and produce the same result is clearly beneficial.
Remove the memset - Ideally we do the wiping as we setup the hash.
Spread bytes over the entire hash - Even for short keys, we should be able to sample any part of the hash and get a pretty decent distribution.

In future experiments I will look to sample the end of the hash too when looking at the distribution. I suspect they all currently absolutely suck at filling this space in which anything and we get tonnes of collisions out there.

If you make improvements based on these tests, please let me know! Better hashes are better for everybody!