Coffee Space


Listen:

uDatabase

Preview Image

Okay… I did a thing.

A few days ago I touted a database concept. I then built out the initial concept. After this I did some further work and improved the concept. Today I can now release uDatabase, an ultra simple Unix-based database library in less than 256 lines of C.

From the readme (at the time of writing):

Performance

It’s stupendously fast. How fast? During a 500k record test 1 it could write 25k records per second, read 28k records per second and deleted 40k records a second. This is all on really old hardware, writing on average 2.5kB per record.

Records vs Time Per Record

As you can see, the time per record processing is extremely fast and comes close to O(1)O(1) for a put, get and remove operation. Each record takes less that 50 microseconds to handle, with the average record during the 500k tests being 2.5kB in size.

Records vs Total Time

The total time to perform any of the put, get or remove operations is as low as 20 seconds for 500k records of 2.5kB each. In the previous update this took three times as long.

Records vs Record Rate

And as we confirmed previously, we hit approximately 25k records per second in the worst case, on my old busted hardware.

The raw results are as follows:

Function Records ms ms/record records/sec
udatabase_put() 4 1 ms 0.250000 4000.000000
udatabase_put() 128 1 ms 0.007812 128000.000000
udatabase_put() 256 2 ms 0.007812 128000.000000
udatabase_put() 512 3 ms 0.005859 170666.666667
udatabase_put() 1024 6 ms 0.005859 170666.666667
udatabase_put() 2048 14 0.006836 146285.714286
udatabase_put() 4096 61 0.014893 67147.540984
udatabase_put() 8192 91 0.011108 90021.978022
udatabase_put() 65536 1545 0.023575 42418.122977
udatabase_put() 131072 4350 0.033188 30131.494253
udatabase_put() 262144 8287 0.031612 31633.160372
udatabase_put() 524288 20876 0.039818 25114.389730
udatabase_get() 4 0 0.000000 inf
udatabase_get() 128 0 0.000000 inf
udatabase_get() 256 0 0.000000 inf
udatabase_get() 512 2 0.003906 256000.000000
udatabase_get() 1024 3 0.002930 341333.333333
udatabase_get() 2048 7 0.003418 292571.428571
udatabase_get() 4096 15 0.003662 273066.666667
udatabase_get() 8192 37 0.004517 221405.405405
udatabase_get() 65536 281 0.004288 233224.199288
udatabase_get() 131072 2246 0.017136 58357.969724
udatabase_get() 262144 4759 0.018154 55083.841143
udatabase_get() 524288 18258 0.034824 28715.521963
udatabase_remove() 4 0 0.000000 inf
udatabase_remove() 128 2 0.015625 64000.000000
udatabase_remove() 256 2 0.007812 128000.000000
udatabase_remove() 512 6 0.011719 85333.333333
udatabase_remove() 1024 12 0.011719 85333.333333
udatabase_remove() 2048 26 0.012695 78769.230769
udatabase_remove() 4096 51 0.012451 80313.725490
udatabase_remove() 8192 107 0.013062 76560.747664
udatabase_remove() 65536 1464 0.022339 44765.027322
udatabase_remove() 131072 3338 0.025467 39266.626723
udatabase_remove() 262144 7541 0.028767 34762.498342
udatabase_remove() 524288 12699 0.024221 41285.770533

Is this any good? I have no freaking idea. There are really no apples-to-apples benchmarks out there…

I found this one report for Berkeley DB written by Oracle. On four different machines they got tonnes of different results.

I found another report on GitHub testing relational databases and soem of the results look somewhat similar?

I still have no idea.

Implementation

Here I will discuss some insights I got and some nice to know stuff.

Profiling

I haven’t tried profiling with C before, but it’s genuinely pretty good. During compiling with gcc, you can add a -pg flag to allow for profiling. Next you need to run your program, which will generate a gmon.out file. After this you can the convert it into something human readable with:

0001 gprof -b <BINARY> gmon.out > analysis.out

For uDatabase, analysis.out looks like so (I have cut it to the first section of the output):

0002 Flat profile:
0003 
0004 Each sample counts as 0.01 seconds.
0005   %   cumulative   self              self     total           
0006  time   seconds   seconds    calls  ms/call  ms/call  name    
0007  71.43      0.05     0.05   245388     0.00     0.00  udatabase_search
0008  28.57      0.07     0.02        5     4.00     4.00  udatabase_service
0009   0.00      0.07     0.00   245388     0.00     0.00  udatabase_shash
0010   0.00      0.07     0.00   163592     0.00     0.00  udatabase_remove
0011   0.00      0.07     0.00    81796     0.00     0.00  udatabase_get
0012   0.00      0.07     0.00    81796     0.00     0.00  udatabase_put
0013   0.00      0.07     0.00       54     0.00     0.00  currentTimeMillis
0014   0.00      0.07     0.00        1     0.00     0.00  udatabase_close
0015   0.00      0.07     0.00        1     0.00     0.00  udatabase_signal

Here you can see how much time is spent in each function, how many calls to each function and other cool stuff. This is a great way to benchmark your improvements when speeding up functions!

Hashing

During profiling, I realised that udatabase_service() got called often - in fact, way too often. The function is responsible for expanding the lookup table, but we were doing it insanely often. It’s a very heavy function as it also cleans up the database on disk at the same time.

I initially used __unint128_t, but this is both not compiler portable or even system portable. I ended up switching to a char* as all we were essentially doing was byte matching.

After some time, it occurred to me that we were getting tonnes of hash collisions, causing udatabase_service() to get called to expand the lookup table, and then pay massive overheads from this function.

My initial hash function was as follows:

0016 /**
0017  * udatabase_shash() Convert a string to a hash.
0018  * Source: http://www.cse.yorku.ca/~oz/hash.html
0019  * @param s The NULL terminated string to be hashed.
0020  * @return The hashed number.
0021  **/
0022 __uint128_t udatabase_shash(char* s){
0023   __uint128_t h = 0;
0024   int c;
0025   while(c = *s++) h = c + (h << 6) + (h << 16) - h;
0026   return h;
0027 }

This worked well, except I no longer wanted to rely on the __uint128_t type.

Next I implemented string-based hashing and cooked up the following:

0028 /**
0029  * udatabase_shash() Convert a string to a hash.
0030  * @param s The NULL terminated string to be hashed.
0031  * @param d The target pre-allocated string of hash length for the result.
0032  * @param n The length of the hash to be created.
0033  * @return A pointer to the hashed string stored in d.
0034  **/
0035 char* udatabase_shash(char* s, char* d, int n){
0036   int x = -1;
0037   while(s[++x]) d[x % n] = s[x] ^ (x < n ? '\0' : d[x % n] << 2);
0038   while(x < n) d[x++] = '\0';
0039   return d;
0040 }

It worked fine during small test cases, but absolutely sucked in the larger tests… I got a little better performance with:

0041   while(s[++x]) d[x % n] = s[x] ^ (s[x + 1] << 3);

But it still sucked for large scale tests…

Then it hit me! Out of the result of this function (which we’ll call hash), we pull out the first two bytes and use it as a short hash like so: *(uint16_t*)hash. We then use this short form for our initial testing for matches and our lookup table acceleration.

The problem is, with the previous version of the hashing function, the first two bytes are really low entropy! What of there was a way to force high entropy onto the first two bytes… There is!

0042 /**
0043  * udatabase_shash() Convert a string to a hash.
0044  * @param s The NULL terminated string to be hashed.
0045  * @param d The target pre-allocated string of hash length for the result.
0046  * @param n The length of the hash to be created.
0047  * @return A pointer to the hashed string stored in d.
0048  **/
0049 char* udatabase_shash(char* s, char* d, int n){
0050   int x = -1;
0051   while(s[++x]){
0052     d[x % n] = s[x];
0053     d[x % sizeof(uint16_t)] ^= s[x] << 3;
0054   }
0055   while(x < n) d[x++] = '\0';
0056   return d;
0057 }

This works extremely well! I’m not away of any other C (or other language) string->string hash function, and certainly of nothing that prioritizes entropy in the first bits in order for a simple shortening cast to take advantage of.

memcmp()

Holy moly I wish I had known about this function sooner! My initial string comparison looked like this:

0058 /**
0059  * udatabase_cmp() Compare two binary strings of equal length.
0060  * @param a The first string to be compared.
0061  * @param b The second string to be compared.
0062  * @param n The length of the strings to be compared.
0063  * @return If strings are equal return 0, otherwise not equal.
0064  **/
0065 int udatabase_cmp(char* a, char* b, int n){
0066   while(n-->0) if(*a++ != *b++) break;
0067   return n + 1;
0068 }

It’s ‘okay’, but there’s a lot of checking going on in there. I did some checking online and in theory it is possible to do zero-branching string compare for fixed-length strings… But someone (cannot remember who) suggested just to forget that and checkout memcmp() instead! It’s super speedy!

Another plus is it has example the same definition and meant that I could delete my comparison function entirely!

Signal Handling

Okay, this was a stroke of genius. There are a few great points with this code:

  1. All signals of interest exist between SIGHUP and SIGKILL, and all implementations I could find number these roughly between 1 to 9, in that order. All the signal in between were also pretty interesting, so with a single line, we can register our interest in them all!
  2. We access our file pointer FILE* to close the database. This is stored in a database struct uDB. Our close() function should be able to be called by anybody, including the signal handler. So the question is, how do we give the function udatabase_signal() a reference to db when parameters are not supported for the signal handler? Check this out…
0069   for(int x = SIGHUP; x < SIGKILL; x++) signal(x, (void (*)(int))udatabase_signal);
0070   udatabase_signal(0, db);

After setting up the signals of interest, we hand over a fake signal with access to our db struct.

0071 /**
0072  * udatabase_signal() Handle a given signal and close gracefully.
0073  * @param db The database description.
0074  * @param signum The signal number indicating
0075  * @param ptr The raw pointer to the database structure.
0076  **/
0077 void udatabase_signal(int signum, void* ptr){
0078   static struct uDB* db = NULL;
0079   if((db = db != NULL ? db : ptr) != NULL && signum >= SIGHUP && signum <= SIGKILL)
0080     udatabase_close(db); // NOTE: Remains locked.
0081 }

We then read this into a static variable if we detect we have not already stored it! I still need to see exactly how this will work with multiple databases opened though… I suspect it won’t work.

Helpful Functions

There is an example implementation that uses most of the functions and allows you to perform performance tests.

The following are the main functions (at the time of writing) that I expect people to use. They are ordered in roughly the order you may choose to use them.

0082 /**
0083  * udatabase_create() Create a database on disk and closes it.
0084  * @param filename The location of the database to be create.
0085  **/
0086 void udatabase_create(char* filename);

udatabase_create() - Create a new database with a given filename. This essentially just creates a file if one doesn’t already exist 2.

0087 /**
0088  * udatabase_init() Open a database for read/write operations.
0089  * @param db The database description pointer.
0090  * @param filename The location of the database to be operated on.
0091  * @param width The width of an entire record.
0092  **/
0093 void udatabase_init(struct uDB* db, char* filename, long int width);

udatabase_init() - Load a database from disk with a given filename, into a database structure db (allocation of structure memory handled by user), with entries on disk of a given width. The purpose of having db is to allow multiple databases to be opened at the same time (although this is currently untested 3).

0094 /**
0095  * udatabase_put() Store value using key. No check if the key already exists.
0096  * @param db The database description.
0097  * @param k A NULL terminated key string to be used to insert a value.
0098  * @param v A string to be inserted into the database.
0099  * @param n The length of the string to be stored.
0100  * @return The success of the insertion.
0101  **/
0102 int udatabase_put(struct uDB* db, char* k, char* v, long int n);

udatabase_put() - Put a new record into the database db, with key k, payload value v, of payload length n. v doesn’t have to be NULL terminated, and if not, you should have a way of figuring out the length once it’s retrieved if it’s not of fixed length.

0103 /**
0104  * udatabase_search() Search for an entry in the database given a key.
0105  * @param db The database description.
0106  * @param k A NULL terminated key string value to be searched for.
0107  * @param kHash The already hashed key, otherwise NULL.
0108  * @return The table offset with correct fseek(), -1 not found.
0109  **/
0110 long int udatabase_search(struct uDB* db, char* k, char* kHash);

udatabase_search() - Search for an entry in the database db, given a key k, and pre-computed hash kHash (otherwise just set NULL). If the entry is found, you will be a value 0\ge 0, otherwise <0< 0. The disk will also be setup to this location.

This function can be used to check whether an entry exists or not.

Note: If you call this function you will be responsible for releasing the lock, as it keeps it open for the caller thread as not to let the disk position be changed.

0111 /**
0112  * udatabase_get() Get the data stored at the location if it exists.
0113  * @param db The database description.
0114  * @param k A NULL terminated key string value to be searched for.
0115  * @return The malloc'd value associated with the key, otherwise NULL.
0116  **/
0117 char* udatabase_get(struct uDB* db, char* k);

udatabase_get() - Get a value from the database db given a key k, otherwise NULL if it could not be found.

Note: The returned value is malloc()’d, so the caller must run free() on it once finished with it.

0118 /**
0119  * udatabase_remove() Remove a given record and all related entries.
0120  * @param db The database description.
0121  * @oaram k The key to be matched for the entry.
0122  * @param kHash The already hashed key, otherwise NULL.
0123  * @return The table offset, -1 not found.
0124  **/
0125 long int udatabase_remove(struct uDB* db, char* k, char* kHash);

udatabase_remove() - Remove an entry from the database db using key k. If the hash for the key is already computed, this can be given using kHash, otherwise set to NULL. The function returns the offset into the table.

Note: If you call this function you will be responsible for releasing the lock, as it keeps it open for the caller thread as not to let the table position be changed.

0126 /**
0127  * udatabase_service() Compress empty records and rebuild lookup.
0128  * @param db The database description.
0129  * @return 0 on success, otherwise -1 if another service required.
0130  **/
0131 int udatabase_service(struct uDB* db);

udatabase_service() - Perform servicing on database db. This function is called automatically, but may also be called periodically if the database is using too much disk space. This function will take a while to complete depending on which much clean-up work is required.

0132 /**
0133  * udatabase_close() Close the database and stop further operation.
0134  * @param db The database description.
0135  * @return The success of closing.
0136  **/
0137 int udatabase_close(struct uDB* db);

udatabase_close() - Close the database db and keep in a locked state, so that no other operation may be performed.

Future

That’s enough from me for now! As mentioned in the uServer project this is all part of a larger ongoing effort.

I will need to think of a project to put both uServer and uDatabase together, and run them through their paces in a live system. That’s the best way to find bugs after all!


  1. Please look at the source code as the tests are purposely designed to test the database to the limit↩︎

  2. This was historically much more useful.↩︎

  3. One question is how the signal handler would react.↩︎