Coffee Space – Coffee Space

Directory Sampling

This is a quick discussion regarding a theoretical attack vector I thought up many years ago. I decided to quickly code up a simulation of how it may look like whilst listening to podcast by David Starkey Talks (it’s pretty good).

Anyway, the idea is that when the web server is string matching files in the directory, strings that are closer to something that exist take longer. Imagine your server hosts index.html (but this file is hidden), one would expect that a string compare algorithm to quit as soon as it finds the first invalid character. If we tested index we would better match the existing file string when compared to something like zzzzz.

To answer the question: Is it possible to reveal non-public files via sampling? The answer appears to be: yes.

Why?

We could assume that the hidden path somehow encodes something important. Perhaps it is a hidden Youtube URL or a path that is the key to opening a file. Perhaps it is some yet to be published work unlinked to anywhere. Perhaps it is an undocumented API for interacting with your service.

The point is, there are millions of reasons why you may have a path that is undocumented and should be near impossible to accidentally stumble across, unless otherwise known about. We like to think that we always employ good security practices, but in reality this is not always true.

Years ago I was shown by a fellow student (whom I have permission to do so) that it was a bad idea to store the .git directory on CoffeeSpace, after they ran a crawler over the website and discovered my Git history. During that time I used to always regularly commit changes to unpublished articles too (but I avoid that these days as it triggers a server rebuild).

Assumptions

We assume the following about the target:

No resource/rate limiting - If the server or any networking along the route starts rate limiting us, then all bets are off.
Responds correctly to HEAD requests - We assume that we can test the existence of a file with minimal networking overhead. Re-using TCP connections would be great, a UDP connection could be even better.
Simple or zero caching - If files are cached then the return time would be exceptionally difficult to detect on the disk.

Experiment

For our simple experiment, we search for the source file test.c on disk, particularly, given the letter t, can we correctly guess the next letter is e?

We use the real disk latency (as it would be on the server), plus simulated network latency of $50$ ms ping with $\pm 10$ ms. We test each letter for a total of 100k samples each, and test just the lowercase letters a to z, for a total of $26 \times 100000 = 2600000$ samples.

Results

As you can see, with just 100k samples per symbol, we see that we correctly find te as the start of the string. We could then go on to test for the remaining symbols.

0001 $ g++ test.c -o test && ./test
0002 tq  4948093
0003 tw  4951596
0004 te  4954150
0005 tr  4950499
0006 tt  4947198
0007 ty  4950049
0008 tu  4946960
0009 ti  4950848
0010 to  4948782
0011 tp  4949851
0012 ta  4953166
0013 ts  4953580
0014 td  4951138
0015 tf  4949545
0016 tg  4948919
0017 th  4948485
0018 tj  4952604
0019 tk  4951642
0020 tl  4950705
0021 tz  4947572
0022 tx  4951750
0023 tc  4952741
0024 tv  4952960
0025 tb  4952880
0026 tn  4950464
0027 tm  4948977
0028 Test took 3044 ms, simulated time 128715154 ms
0029 Largest was 'te'

As you can see, although only taking 3 seconds of real time, in reality this attack would take some 35+ hours to reveal just one character. I believe the use of some form of Monte Carlo tree search (MCTS) could dramatically increase the search time. Further one could attack the target from multiple collaborating vectors and launch several requests at any one time from the same location, dramatically increasing parallelism.

With proper optimization, I imagine one could get the search time down to something like 1 hour. The more a-priori knowledge you have the better.

Source Code

Here is the source code I used for the tests:

0030 /* Compile: g++ test.c test */
0031 
0032 #include <cstring>
0033 #include <string>
0034 #include <sys/stat.h>
0035 #include <sys/time.h>
0036 
0037 #define PING_AVG_MS 50
0038 #define PING_VAR_MS 10
0039 #define SAMPLE 100000
0040 
0041 /* Source: https://stackoverflow.com/a/51336144/2847743 */
0042 int64_t currentTimeMillis(){
0043   struct timeval time;
0044   gettimeofday(&time, NULL);
0045   int64_t s1 = (int64_t)(time.tv_sec) * 1000;
0046   int64_t s2 = time.tv_usec / 1000;
0047   return s1 + s2;
0048 }
0049 
0050 /* Source: https://stackoverflow.com/a/12774387/2847743 */
0051 inline bool exists_test3 (const std::string& name) {
0052   struct stat buffer;
0053   return (stat (name.c_str(), &buffer) == 0);
0054 }
0055 
0056 int test_char(char c){
0057   long start = currentTimeMillis();
0058   char s[2] = { 't', c };
0059   std::string str(s);
0060   exists_test3(str);
0061   long end = currentTimeMillis();
0062   /* Simulation network delay + disk delay */
0063   return (end - start) + PING_AVG_MS + ((rand() % (PING_VAR_MS * 2)) - PING_VAR_MS);
0064 }
0065 
0066 int main(){
0067   srand(0); // Repeatable
0068   const char* symbols = "qwertyuiopasdfghjklzxcvbnm";
0069   int sLen = strlen(symbols);
0070   long* results = (long*)calloc(sLen, sizeof(long));
0071   /* Run tests */
0072   long start = currentTimeMillis();
0073   for(int y = 0; y < SAMPLE; y++){
0074     for(int x = 0; x < sLen; x++){
0075       results[x] += test_char(symbols[x]);
0076     }
0077   }
0078   long end = currentTimeMillis();
0079   /* Print results */
0080   int largest = 0;
0081   long simtime = 0;
0082   for(int x = 0; x < sLen; x++){
0083     printf("t%c\t%li\n", symbols[x], results[x]);
0084     simtime += results[x];
0085     if(results[x] > results[largest]){
0086       largest = x;
0087     }
0088   }
0089   printf("Test took %li ms, simulated time %li ms\n", end - start, simtime);
0090   printf("Largest was 't%c'\n", symbols[largest]);
0091   free(results); // Cleanup
0092 }

Note that the only file in the given directory is test.c.

Conclusion

It is possible with some work as long as some basic assumption can be made about the target. I should have continued working on this project six years ago - although to be fair, it is only really now I have the theoretical tools in order to test such a theory.