Coffee Space


Listen:

Database Concept

Preview Image

For the dead social project the ‘database’ is just a series of files on disk, which worked initially, but there are now some 4k posts (at the time of writing) and 6.4k tags, each of which represent a single file.

For a PhD project I ran over a million experiments in a directory, each of which generated their own log file. This essentially broke the EXT partition and the drive later died!

Clearly, what I need is a simple, performant, flat file database.

After the relative success of writing uServer, I am wondering whether I could write something similar but for databases.

Requirements

Ideally this is all to be achieved within the 256 line limit (with comments), as was seen with uServer.

I would have the following basic requirements of the interface:

And the following requirements of the operations:

What we do not expect, but would be nice to have:

Existing

I quite like the simplicity of Google’s LevelDB, where it essentially offers the methods:

  • The basic operations are Put(key,value), Get(key), Delete(key).

It also offers tonnes of other features too.

Structure

One important consideration is the ability to read/write files really fast. Ideally this is done with an O(1)O(1) look-up where possible. We look to achieve this with the following data structure:

Proposed database structure

Hash Map Table

The first structure is the hash map table. This file would look approximately like so:

0001 [small_hash]: [data_index]

We would have some hash function H() take a key string and produce a hash key k. A small version of this would look at an index in the lookup array and point to the data file.

Our key is of size KbitsK_{bits} and we round this to the closest number of bytes KbytesK_{bytes}. The size of the pre-built look-up table is therefore Kbytes×2KbitsK_{bytes} \times 2^{K_{bits}}. Here is some example of table sizes:

kbitsk_{bits} KbytesK_{bytes} Equation Result (bytes) Results
88 11 1×281 \times 2^{8} 256 256B
1010 22 2×2102 \times 2^{10} 2,048 2kB
1616 22 2×2162 \times 2^{16} 131,072 128kB
2424 33 3×2243 \times 2^{24} 50,331,648 48MB
3232 44 4×2324 \times 2^{32} 17,179,869,184 16GB
6464 88 8×2648 \times 2^{64} 147,573,952,589,676,412,928 134.2EB

So this method is fine for small hash bit values, but doesn’t scale well as the number of bits are increased. Generally we will try to keep our hash keys under 24 bits and handle hash collisions as they occur. This allows for a maximum of 2242^{24} entries (approximately 16 million entries), which shouldn’t cause too many issues up until approximately 1 million entries or so.

We can do a little better though. If we checkout our original equation Kbytes×2KbitsK_{bytes} \times 2^{K_{bits}}, we can actually reinterpret this as Mbytes×2KbitsM_{bytes} \times 2^{K_{bits}}, where MbytesM_{bytes} are the number of bytes used for lookup. We can therefore have a maximum of 2(8×Mbytes)2^(8 \times M_{bytes}) entries. Some example sizes:

kbitsk_{bits} MbytesM_{bytes} Equation Result (bytes) Results
3232 33 3×2323 \times 2^{32} 12884901888 12GB
1616 33 3×2163 \times 2^{16} 196608 192kB

Note: It doesn’t make sense for 28×Mbytes>Kbits2^{8 \times M_{bytes}} \gt K_{bits}, as it means you have a larger address space than you can actually address.

If we are willing to break our O(1)O(1) requirement, we can actually instead use the mapping as a search accelerator, rather than a proper mapping. Instead of the hashed key kk directly reporting an index in the database, it could instead suggest that the data can be found at a multiple of the given index. This way we can reduce the value of MbytesM_{bytes}, where a value of 2 would indicate we check the same index offset every 2(8×2)2^(8 \times 2) records at a small cost of search time. This would also require that each record contain the full hash size to ensure the correct entry was found.

Another hash map table structure could be:

0002 [small_hash]: [large_hash] [data_index]

Where we loop through offsets of small_hash until we find a match for the large_hash. This would also allow for the hash table to increase in size as more data is added at a small cost of searching - this time inside the hash table rather than the data table.

Note: It may also be worth considering building the hash table in RAM for speed by doing an initial parse of the database during initialisation - rather than keeping it on disk. If small_hash is derived from large_hash then we would only need the database file in order to rebuild the hash map table.

Database

The database entries on the other hand would likely be quite boring, with a structure such as:

0003 [data_index]: [large_hash] [value]

Note: It might also be worth considering checking for data integrity.

Interface

An interface may look as follows:

0004 /**
0005  * udatabase_create()
0006  *
0007  * Create a database on disk.
0008  *
0009  * @param filename The location of the database to be create.
0010  * @return Creation success result.
0011  **/
0012 int udatabase_create(char* filename);
0013 
0014 /**
0015  * udatabase_init()
0016  *
0017  * Open a database for read/write operations.
0018  *
0019  * @param filename The location of the database to be operated on.
0020  * @param size The maximum size of a stored item.
0021  * @param cache Number of items to store in the cache.
0022  * @return Initialisation success result.
0023  **/
0024 int udatabase_init(char* filename, int size, int cache);
0025 
0026 /**
0027  * udatabase_shash()
0028  *
0029  * Convert a string to a hash.
0030  *
0031  * @param s The NULL terminated string to be hashed.
0032  * @param b The number of bits to hash the string to.
0033  * @return The hashed number.
0034  **/
0035 /* NOTE: https://gitlab.com/danbarry16/u-server/-/commit/5ebd673a2b96b8c2e1897cd1c733e6857902b904 */
0036 /* NOTE: https://stackoverflow.com/questions/7666509/hash-function-for-string */
0037 unsigned int udatabase_shash(char* s, unsigned int b);
0038 
0039 /**
0040  * udatabase_scmp()
0041  *
0042  * Compare two NULL terminated strings, checking if equal.
0043  *
0044  * @param a The first string to be checked.
0045  * @param b The substring to be checked that is NULL terminated.
0046  * @param c Treat as a NULL character in string a.
0047  * @return A pointer to the end of the match if equal, otherwise NULL.
0048  **/
0049 /* NOTE: Check userver_scmp(). */
0050 char* udatabase_scmp(char* a, char* b, char c);
0051 
0052 /**
0053  * udatabase_put()
0054  *
0055  * Put data into the database using the given key. It does not check if the key
0056  * already exists.
0057  *
0058  * @param k A NULL terminated key string to be used to insert a value.
0059  * @param v A NULL terminated value string to be inserted into the database.
0060  * @return The success of the insertion.
0061  **/
0062 int udatabase_put(char* k, char* v);
0063 
0064 /**
0065  * udatabase_get()
0066  *
0067  * Get the data stored at the location if it exists.
0068  *
0069  * @param k A NULL terminated key string value to be searched for.
0070  * @return The value associated with the key, otherwise NULL.
0071  **/
0072 char* udatabase_get(char* k);
0073 
0074 /**
0075  * udatabase_close()
0076  *
0077  * Close the database so that no further operations can be performed.
0078  *
0079  * @return Tne success of closing.
0080  **/
0081 int udatabase_close();

We should be able to make use of fseek() in order to read and write at given offsets, without needs to crawl the entire flat binary.

With regards to reading/writing structured data, using something like JSON would be nice in theory, but actually a pain to write simply and concisely. Instead I could imagine writing something like follows:

0082 struct Post{
0083   char type;
0084   unsigned long id;
0085   char* payload;
0086 ];
0087 
0088 char* buff = udatabase_get("some key");
0089 if(buff != NULL){
0090   /* Make sure the structured data is of the type we expect */
0091   if(buff[0] == 'p'){
0092     struct Post* post = (struct Post*)buff;
0093     /* TODO: Rest of code here on structured data. */
0094   }else{
0095     /* TODO: Unknown data format. */
0096   }
0097 }else{
0098   /* TODO: Handle read error. */
0099 }

Maybe another update in a few weeks when I iron out the details (and get some time).