This is a quick discussion regarding a theoretical attack vector I thought up many years ago. I decided to quickly code up a simulation of how it may look like whilst listening to podcast by David Starkey Talks (it’s pretty good).
Anyway, the idea is that when the web server is string matching files in the directory, strings that are closer to something that exist take longer. Imagine your server hosts index.html
(but this file is hidden), one would expect that a string compare algorithm to quit as soon as it finds the first invalid character. If we tested index
we would better match the existing file string when compared to something like zzzzz
.
To answer the question: Is it possible to reveal non-public files via sampling? The answer appears to be: yes.
We could assume that the hidden path somehow encodes something important. Perhaps it is a hidden Youtube URL or a path that is the key to opening a file. Perhaps it is some yet to be published work unlinked to anywhere. Perhaps it is an undocumented API for interacting with your service.
The point is, there are millions of reasons why you may have a path that is undocumented and should be near impossible to accidentally stumble across, unless otherwise known about. We like to think that we always employ good security practices, but in reality this is not always true.
Years ago I was shown by a fellow student (whom I have permission to do so) that it was a bad idea to store the .git
directory on CoffeeSpace, after they ran a crawler over the website and discovered my Git history. During that time I used to always regularly commit changes to unpublished articles too (but I avoid that these days as it triggers a server rebuild).
We assume the following about the target:
HEAD
requests - We assume that we can test the existence of a file with minimal networking overhead. Re-using TCP connections would be great, a UDP connection could be even better.For our simple experiment, we search for the source file test.c
on disk, particularly, given the letter t
, can we correctly guess the next letter is e
?
We use the real disk latency (as it would be on the server), plus simulated network latency of ms ping with ms. We test each letter for a total of 100k samples each, and test just the lowercase letters a
to z
, for a total of samples.
As you can see, with just 100k samples per symbol, we see that we correctly find te
as the start of the string. We could then go on to test for the remaining symbols.
0001 $ g++ test.c -o test && ./test 0002 tq 4948093 0003 tw 4951596 0004 te 4954150 0005 tr 4950499 0006 tt 4947198 0007 ty 4950049 0008 tu 4946960 0009 ti 4950848 0010 to 4948782 0011 tp 4949851 0012 ta 4953166 0013 ts 4953580 0014 td 4951138 0015 tf 4949545 0016 tg 4948919 0017 th 4948485 0018 tj 4952604 0019 tk 4951642 0020 tl 4950705 0021 tz 4947572 0022 tx 4951750 0023 tc 4952741 0024 tv 4952960 0025 tb 4952880 0026 tn 4950464 0027 tm 4948977 0028 Test took 3044 ms, simulated time 128715154 ms 0029 Largest was 'te'
As you can see, although only taking 3 seconds of real time, in reality this attack would take some 35+ hours to reveal just one character. I believe the use of some form of Monte Carlo tree search (MCTS) could dramatically increase the search time. Further one could attack the target from multiple collaborating vectors and launch several requests at any one time from the same location, dramatically increasing parallelism.
With proper optimization, I imagine one could get the search time down to something like 1 hour. The more a-priori knowledge you have the better.
Here is the source code I used for the tests:
0030 /* Compile: g++ test.c test */ 0031 0032 #include <cstring> 0033 #include <string> 0034 #include <sys/stat.h> 0035 #include <sys/time.h> 0036 0037 #define PING_AVG_MS 50 0038 #define PING_VAR_MS 10 0039 #define SAMPLE 100000 0040 0041 /* Source: https://stackoverflow.com/a/51336144/2847743 */ 0042 int64_t currentTimeMillis(){ 0043 struct timeval time; 0044 gettimeofday(&time, NULL); 0045 int64_t s1 = (int64_t)(time.tv_sec) * 1000; 0046 int64_t s2 = time.tv_usec / 1000; 0047 return s1 + s2; 0048 } 0049 0050 /* Source: https://stackoverflow.com/a/12774387/2847743 */ 0051 inline bool exists_test3 (const std::string& name) { 0052 struct stat buffer; 0053 return (stat (name.c_str(), &buffer) == 0); 0054 } 0055 0056 int test_char(char c){ 0057 long start = currentTimeMillis(); 0058 char s[2] = { 't', c }; 0059 std::string str(s); 0060 exists_test3(str); 0061 long end = currentTimeMillis(); 0062 /* Simulation network delay + disk delay */ 0063 return (end - start) + PING_AVG_MS + ((rand() % (PING_VAR_MS * 2)) - PING_VAR_MS); 0064 } 0065 0066 int main(){ 0067 srand(0); // Repeatable 0068 const char* symbols = "qwertyuiopasdfghjklzxcvbnm"; 0069 int sLen = strlen(symbols); 0070 long* results = (long*)calloc(sLen, sizeof(long)); 0071 /* Run tests */ 0072 long start = currentTimeMillis(); 0073 for(int y = 0; y < SAMPLE; y++){ 0074 for(int x = 0; x < sLen; x++){ 0075 results[x] += test_char(symbols[x]); 0076 } 0077 } 0078 long end = currentTimeMillis(); 0079 /* Print results */ 0080 int largest = 0; 0081 long simtime = 0; 0082 for(int x = 0; x < sLen; x++){ 0083 printf("t%c\t%li\n", symbols[x], results[x]); 0084 simtime += results[x]; 0085 if(results[x] > results[largest]){ 0086 largest = x; 0087 } 0088 } 0089 printf("Test took %li ms, simulated time %li ms\n", end - start, simtime); 0090 printf("Largest was 't%c'\n", symbols[largest]); 0091 free(results); // Cleanup 0092 }
Note that the only file in the given directory is test.c
.
It is possible with some work as long as some basic assumption can be made about the target. I should have continued working on this project six years ago - although to be fair, it is only really now I have the theoretical tools in order to test such a theory.