I come across this very interesting Reddit thread on essentially printing y\n
as fast as possible. The baseline here is GNU’s implementation of yes
, which is exceptionally fast.
kjensenxz prefaces this with:
Compared to other Unices, GNU is outrageously fast. NetBSD’s is 139MiB/s, FreeBSD, OpenBSD, DragonFlyBSD have very similar code as NetBSD and are probably identical, illumos’s is 141MiB/s without an argument, 100MiB/s with. OS X just uses an old NetBSD version similar to OpenBSD’s, MINIX uses NetBSD’s, BusyBox’s is 107MiB/s, Ultrix’s (3.1) is 139 MiB/s, COHERENT’s is 141MiB/s.
What got me interested in thinking about this is: “can we do better?”. As it turns out, yes 1!
My test machine is as follows:
0001 $ lscpu 0002 Architecture: x86_64 0003 CPU op-mode(s): 32-bit, 64-bit 0004 Byte Order: Little Endian 0005 Address sizes: 39 bits physical, 48 bits virtual 0006 CPU(s): 8 0007 On-line CPU(s) list: 0-7 0008 Thread(s) per core: 2 0009 Core(s) per socket: 4 0010 Socket(s): 1 0011 NUMA node(s): 1 0012 Vendor ID: GenuineIntel 0013 CPU family: 6 0014 Model: 94 0015 Model name: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz 0016 Stepping: 3 0017 CPU MHz: 2600.000 0018 CPU max MHz: 3500.0000 0019 CPU min MHz: 800.0000 0020 BogoMIPS: 5199.98 0021 Virtualisation: VT-x 0022 L1d cache: 128 KiB 0023 L1i cache: 128 KiB 0024 L2 cache: 1 MiB 0025 L3 cache: 6 MiB
The fans are on maximum, but it’s an old and tired machine! For all of the testing there was burn-in time, but also bare in mind that this machine is also running lots of other random rubbish. These values are not going to be accurate, but should give a rough idea of performance.
Here we will be testing a variety of programs. Here we will list as much information about them as possible.
yes
This is the baseline yes
currently installed on the test machine. The version information is as follows:
0026 $ yes --version 0027 yes (GNU coreutils) 8.32 0028 Copyright (C) 2020 Free Software Foundation, Inc. 0029 Licence GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>. 0030 This is free software: you are free to change and redistribute it. 0031 There is NO WARRANTY, to the extent permitted by law. 0032 0033 Written by David MacKenzie.
yes-orig
This is the original “iteration 4” from the Reddit post:
0034 /** 0035 * yes.c 0036 * 0037 * Source: https://old.reddit.com/r/unix/comments/6gxduc/how_is_gnu_yes_so_fast/ 0038 **/ 0039 0040 #define LEN 2 0041 #define TOTAL 8192 0042 0043 int main(){ 0044 char yes[LEN] = {'y', '\n'}; 0045 char* buf = malloc(TOTAL); 0046 int bufused = 0; 0047 while(bufused < TOTAL){ 0048 memcpy(buf + bufused, yes, LEN); 0049 bufused += LEN; 0050 } 0051 while(write(1, buf, TOTAL)); 0052 return 1; 0053 }
This is ultimately the implementation we are trying to beat.
yes-prebuild_buffer
I wondered whether there would be any difference between dynamically allocating the buffer, or having the binary packed with the buffer already setup. My thinking was that it may be loaded into the same memory, therefore use a single page of cache.
0054 /** 0055 * yes-prebuild_buffer.c 0056 * 0057 * The idea for this modification is to pre-build the buffer used. 0058 **/ 0059 0060 #define LEN 2 0061 #define TOTAL 8192 0062 0063 char buf[TOTAL] = { 0064 'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n', 0065 /* [..] Many duplicates here! [..] */ 0066 'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n', 0067 }; 0068 0069 /** 0070 * main() 0071 * 0072 * The main entry function. 0073 * 0074 * @return The exit status. 0075 **/ 0076 int main(){ 0077 while(write(1, buf, TOTAL)); 0078 return 1; 0079 }
The variable buf
of course was filled with TOTAL
number of y
and \n
values. This had the affect of increasing the binary size.
yes-setbuf
With this implementation, I was wondering whether the kernel was buffering the output to stdout
(as it sometimes does with networks and disk I/O). The idea here was to override this with setvbuf
and set the output as unbuffered _IOFBF
:
0080 /** 0081 * yes-setbuf.c 0082 * 0083 * The idea for this modification is to set the program as unbuffered. 0084 **/ 0085 0086 #include <stdio.h> 0087 0088 #define LEN 2 0089 #define TOTAL 8192 0090 0091 /** 0092 * main() 0093 * 0094 * The main entry function. 0095 * 0096 * @return The exit status. 0097 **/ 0098 int main(){ 0099 char yes[LEN] = {'y', '\n'}; 0100 char* buf = malloc(TOTAL); 0101 int bufused = 0; 0102 while(bufused < TOTAL){ 0103 memcpy(buf + bufused, yes, LEN); 0104 bufused += LEN; 0105 } 0106 setvbuf(stdout, buf, _IOFBF, TOTAL); 0107 while(write(1, buf, TOTAL)); 0108 return 1; 0109 }
yes-reduce_branch
An additional line of enquiry was whether we were paying any cost for the while
loop, specifically in branching:
0110 /** 0111 * yes-reduce_branch.c 0112 * 0113 * The idea for this modification is to reduce the branching caused by the main 0114 * loop. 0115 **/ 0116 0117 #define LEN 2 0118 #define TOTAL 8192 0119 0120 /** 0121 * main() 0122 * 0123 * The main entry function. 0124 * 0125 * @return The exit status. 0126 **/ 0127 int main(){ 0128 char yes[LEN] = {'y', '\n'}; 0129 char* buf = malloc(TOTAL); 0130 int bufused = 0; 0131 while(bufused < TOTAL){ 0132 memcpy(buf + bufused, yes, LEN); 0133 bufused += LEN; 0134 } 0135 while(1){ 0136 write(1, buf, TOTAL); 0137 write(1, buf, TOTAL); 0138 write(1, buf, TOTAL); 0139 write(1, buf, TOTAL); 0140 write(1, buf, TOTAL); 0141 write(1, buf, TOTAL); 0142 write(1, buf, TOTAL); 0143 write(1, buf, TOTAL); 0144 } 0145 return 1; 0146 }
Of course the massive assumption here is that there isn’t tonnes of branching already in write()
or any calls it makes - which you can be sure there is.
yes-aligned
One thing it’s possible we are paying for is unaligned memory - i.e. our data sits in two pages instead of just one. We can therefore allocate a memory aligned buffer and test this:
0147 /** 0148 * yes-aligned.c 0149 * 0150 * The idea for this modification is to force memory alignment. 0151 **/ 0152 0153 #define LEN 2 0154 #define TOTAL 8192 0155 0156 /** 0157 * main() 0158 * 0159 * The main entry function. 0160 * 0161 * @return The exit status. 0162 **/ 0163 int main(){ 0164 char yes[LEN] = {'y', '\n'}; 0165 char* buf = aligned_alloc(4096, TOTAL); 0166 int bufused = 0; 0167 while(bufused < TOTAL){ 0168 memcpy(buf + bufused, yes, LEN); 0169 bufused += LEN; 0170 } 0171 while(write(1, buf, TOTAL)); 0172 return 1; 0173 }
yes-vmsplice
This was a recommendation from the Reddit thread that was completely off of my radar - and hence made this exercise very much worth it. It turns out that vmsplice` is specifically designed for mapping data into a pipe.
0174 /** 0175 * yes-vmsplice.c 0176 * 0177 * The idea for this modification is to force vmsplicing. 0178 **/ 0179 0180 #define _GNU_SOURCE 0181 #define __need_IOV_MAX 0182 0183 #include <stdlib.h> 0184 #include <stdio.h> 0185 #include <string.h> 0186 #include <fcntl.h> 0187 #include <limits.h> 0188 0189 #define LEN 2 0190 #define TOTAL (1 * 1024 * 1024) 0191 #define IOVECS IOV_MAX 0192 0193 /** 0194 * main() 0195 * 0196 * The main entry function. 0197 * 0198 * @return The exit status. 0199 **/ 0200 int main(){ 0201 char yes[LEN] = {'y', '\n'}; 0202 char* buf = malloc(TOTAL); 0203 int bufused = 0; 0204 int i; 0205 struct iovec iov[IOVECS]; 0206 while(bufused < TOTAL){ 0207 memcpy(buf + bufused, yes, LEN); 0208 bufused += LEN; 0209 } 0210 for(i = 0; i < IOVECS; i++){ 0211 iov[i].iov_base = buf; 0212 iov[i].iov_len = TOTAL; 0213 } 0214 fcntl(1, F_SETPIPE_SZ, TOTAL); 0215 while(vmsplice(1, iov, IOVECS, 0)); 0216 return 1; 0217 }
Spoiler: It’s fast. And there is even hints that it could be even faster!
yes-single_char
One final idea was whether putc
performs less checks that write()
, therefore less branching and therefore faster:
0218 /** 0219 * yes-single_char.c 0220 * 0221 * The idea for this modification is to keep everything in cache. 0222 **/ 0223 0224 #include <stdio.h> 0225 #include <fcntl.h> 0226 0227 /** 0228 * main() 0229 * 0230 * The main entry function. 0231 * 0232 * @return The exit status. 0233 **/ 0234 int main(){ 0235 while(1){ 0236 putc('y', stdout); 0237 putc('\n', stdout); 0238 } 0239 return 1; 0240 }
All of the C programs were compiled with the following flags:
0241 program=$1 0242 echo "[>>] Building $program ..." 0243 gcc -O2 -pipe -march=native -mtune=native $program.c -o $program 2> /dev/null 0244 echo "[??] Build size: $(ls -sh $program)"
Note that gcc
version is as follows:
0245 $ gcc --version 0246 gcc (Ubuntu 11.2.0-7ubuntu2) 11.2.0 0247 Copyright (C) 2021 Free Software Foundation, Inc. 0248 This is free software; see the source for copying conditions. There is NO 0249 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
The programs were then run with the following, which averages the throughput of 1000GB (~1TB) of data:
0250 program=$1 0251 echo "[>>] Testing $program ..." 0252 $program | pv -a -s 1000G -S > /dev/null
Note that the Reddit thread correctly points out that pv
will become a bottleneck at larger throughputs as it itself makes kernel calls. Still, it’s a reasonably good method for comparing throughput, but may not represent absolute maximum throughput.
Without any further delay, here are the results:
Approach | Binary Size | Speed (GiB/s) |
---|---|---|
yes |
N/A | 4.17 |
yes-orig |
24K | 3.94 |
yes-prebuild_buffer |
32K | 3.9 |
yes-setbuf |
24K | 3.6 |
yes-reduce_branch |
24K | 3.74 |
yes-aligned |
24K | 3.63 |
yes-vmsplice |
24K | 29.8 |
yes-single_char |
24K | 0.184 |
Yes, you read that correctly. Using vmsplice
is 30-40x faster than the standard implementation - incredible!
Pun intended.↩︎