Coffee Space


Listen:

Database Concept Update

Preview Image

A few days ago I mentioned the intention to flesh out a micro database library as part of my single file header libraries, a project to build small and simple Unix libraries for doing simple tasks really well.

I spent about a day coding some stuff and got something working - but it’s not nearly at a point where I would trust it or want to use it!

Features

It has the following features currently:

If it wasn’t for some performance issues, it would actually be reasonably serviceable!

Initial Interface

0001   /**
0002    * udatabase_shash() Convert a string to a hash.
0003    * Source: http://www.cse.yorku.ca/~oz/hash.html
0004    * @param s The NULL terminated string to be hashed.
0005    * @return The hashed number.
0006    **/
0007   __uint128_t udatabase_shash(char* s);
0008 
0009   /**
0010    * udatabase_sfind() Find a character in a NULL terminated string.
0011    * @param a The string to be searched.
0012    * @param c The character to be located.
0013    * @return Pointer to the character, otherwise the end of the string.
0014    **/
0015   char* udatabase_sfind(char* a, char c);
0016 
0017   /**
0018    * udatabase_scmp() Compare two NULL terminated strings, checking if equal.
0019    * @param a The first string to be checked.
0020    * @param b The substring to be checked that is NULL terminated.
0021    * @param c Treat as a NULL character in string a.
0022    * @return A pointer to the end of the match if equal, otherwise NULL.
0023    **/
0024   char* udatabase_scmp(char* a, char* b, char c);
0025 
0026   /**
0027    * udatabase_search() Search for an entry in the database given a key.
0028    * @param k A NULL terminated key string value to be searched for.
0029    * @param off Offset to begin the search.
0030    * @return The offset into the database and setup fseek(), otherwise -1.
0031    **/
0032   long int udatabase_search(char* k, long int off);
0033 
0034   /**
0035    * udatabase_get() Get the data stored at the location if it exists.
0036    * @param k A NULL terminated key string value to be searched for.
0037    * @return The malloc'd value associated with the key, otherwise NULL.
0038    **/
0039   char* udatabase_get(char* k);
0040 
0041   /**
0042    * udatabase_put() Store value using key. No check if the key already exists.
0043    * @param k A NULL terminated key string to be used to insert a value.
0044    * @param v A string to be inserted into the database.
0045    * @param n The length of the string to be stored.
0046    * @return The success of the insertion.
0047    **/
0048   int udatabase_put(char* k, char* v, int n);
0049 
0050   /**
0051    * udatabase_close() Close the database and stop further operation.
0052    * @return The success of closing.
0053    **/
0054   int udatabase_close();
0055 
0056   /**
0057    * udatabase_init() Open a database for read/write operations.
0058    * @param filename The location of the database to be operated on.
0059    * @return Initialisation success result.
0060    **/
0061   int udatabase_init(char* filename);
0062 
0063   /**
0064    * udatabase_create() Create a database on disk and closes it.
0065    * @param filename The location of the database to be create.
0066    * @param width The width of each entry in the database (max 2^16).
0067    * @return Creation success result.
0068    **/
0069   int udatabase_create(char* filename, int width);
0070 
0071   /**
0072    * udatabase_compress() Compress the database to make use of deleted records.
0073    * @return 0 on success, otherwise -1 on error.
0074    **/
0075   int udatabase_compress();

A few points to note from the original design concept:

One thing I am not happy with is the fact that the internal database variables are accessible from the outside. I will at some point figure out a way to make these ‘private’ encase somebody attempts to use them as part of the library.

Read/Write Performance

I performed some initial benchmarks for putting data into the database, and got the following results:

Function Records ms ms/record
udatabase_put() 128 57 0.445312
udatabase_put() 256 218 0.851562
udatabase_put() 512 1079 2.107422
udatabase_put() 1024 5000 4.882812
udatabase_put() 2048 21417 10.457520
udatabase_put() 4096 91334 22.298340

Instead of the O(1)O(1) performance I wanted, instead we have O(n)O(n) performance. This is because there is currently no caching, and when we initially perform the udatabase_put() function, it calls udatabase_search() which manually searched the database. As the database grows, so does the time to enter data.

Records vs Time per Record

Worst still is the total time, where we see something like O(n2)O(n^2) cost as the number of records increases. Obviously, something must be done!

Records vs Total Time

My initial guess is that we can actually get these times to something like 0.001 ms per record with near O(1)O(1) performance using a RAM-based cache system as mentioned in the concept article.

Future Intentions

I have the following future intentions:

That’s all for now folks!