Coffee Space – Coffee Space

Fast Yes

Background

I come across this very interesting Reddit thread on essentially printing y\n as fast as possible. The baseline here is GNU’s implementation of yes, which is exceptionally fast.

kjensenxz prefaces this with:

Compared to other Unices, GNU is outrageously fast. NetBSD’s is 139MiB/s, FreeBSD, OpenBSD, DragonFlyBSD have very similar code as NetBSD and are probably identical, illumos’s is 141MiB/s without an argument, 100MiB/s with. OS X just uses an old NetBSD version similar to OpenBSD’s, MINIX uses NetBSD’s, BusyBox’s is 107MiB/s, Ultrix’s (3.1) is 139 MiB/s, COHERENT’s is 141MiB/s.

What got me interested in thinking about this is: “can we do better?”. As it turns out, yes ¹!

Test Machine

My test machine is as follows:

0001 $ lscpu
0002 Architecture:                    x86_64
0003 CPU op-mode(s):                  32-bit, 64-bit
0004 Byte Order:                      Little Endian
0005 Address sizes:                   39 bits physical, 48 bits virtual
0006 CPU(s):                          8
0007 On-line CPU(s) list:             0-7
0008 Thread(s) per core:              2
0009 Core(s) per socket:              4
0010 Socket(s):                       1
0011 NUMA node(s):                    1
0012 Vendor ID:                       GenuineIntel
0013 CPU family:                      6
0014 Model:                           94
0015 Model name:                      Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
0016 Stepping:                        3
0017 CPU MHz:                         2600.000
0018 CPU max MHz:                     3500.0000
0019 CPU min MHz:                     800.0000
0020 BogoMIPS:                        5199.98
0021 Virtualisation:                  VT-x
0022 L1d cache:                       128 KiB
0023 L1i cache:                       128 KiB
0024 L2 cache:                        1 MiB
0025 L3 cache:                        6 MiB

The fans are on maximum, but it’s an old and tired machine! For all of the testing there was burn-in time, but also bare in mind that this machine is also running lots of other random rubbish. These values are not going to be accurate, but should give a rough idea of performance.

Programs

Here we will be testing a variety of programs. Here we will list as much information about them as possible.

`yes`

This is the baseline yes currently installed on the test machine. The version information is as follows:

0026 $ yes --version
0027 yes (GNU coreutils) 8.32
0028 Copyright (C) 2020 Free Software Foundation, Inc.
0029 Licence GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
0030 This is free software: you are free to change and redistribute it.
0031 There is NO WARRANTY, to the extent permitted by law.
0032 
0033 Written by David MacKenzie.

`yes-orig`

This is the original “iteration 4” from the Reddit post:

0034 /**
0035  * yes.c
0036  *
0037  * Source: https://old.reddit.com/r/unix/comments/6gxduc/how_is_gnu_yes_so_fast/
0038  **/
0039 
0040 #define LEN 2
0041 #define TOTAL 8192
0042 
0043 int main(){
0044   char yes[LEN] = {'y', '\n'};
0045   char* buf = malloc(TOTAL);
0046   int bufused = 0;
0047   while(bufused < TOTAL){
0048     memcpy(buf + bufused, yes, LEN);
0049     bufused += LEN;
0050   }
0051   while(write(1, buf, TOTAL));
0052   return 1;
0053 }

This is ultimately the implementation we are trying to beat.

`yes-prebuild_buffer`

I wondered whether there would be any difference between dynamically allocating the buffer, or having the binary packed with the buffer already setup. My thinking was that it may be loaded into the same memory, therefore use a single page of cache.

0054 /**
0055  * yes-prebuild_buffer.c
0056  *
0057  * The idea for this modification is to pre-build the buffer used.
0058  **/
0059 
0060 #define LEN 2
0061 #define TOTAL 8192
0062 
0063 char buf[TOTAL] = {
0064   'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n',
0065   /* [..] Many duplicates here! [..] */
0066   'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n',
0067 };
0068 
0069 /**
0070  * main()
0071  *
0072  * The main entry function.
0073  *
0074  * @return The exit status.
0075  **/
0076 int main(){
0077   while(write(1, buf, TOTAL));
0078   return 1;
0079 }

The variable buf of course was filled with TOTAL number of y and \n values. This had the affect of increasing the binary size.

`yes-setbuf`

With this implementation, I was wondering whether the kernel was buffering the output to stdout (as it sometimes does with networks and disk I/O). The idea here was to override this with setvbuf and set the output as unbuffered _IOFBF:

0080 /**
0081  * yes-setbuf.c
0082  *
0083  * The idea for this modification is to set the program as unbuffered.
0084  **/
0085 
0086 #include <stdio.h>
0087 
0088 #define LEN 2
0089 #define TOTAL 8192
0090 
0091 /**
0092  * main()
0093  *
0094  * The main entry function.
0095  *
0096  * @return The exit status.
0097  **/
0098 int main(){
0099   char yes[LEN] = {'y', '\n'};
0100   char* buf = malloc(TOTAL);
0101   int bufused = 0;
0102   while(bufused < TOTAL){
0103     memcpy(buf + bufused, yes, LEN);
0104     bufused += LEN;
0105   }
0106   setvbuf(stdout, buf, _IOFBF, TOTAL);
0107   while(write(1, buf, TOTAL));
0108   return 1;
0109 }

`yes-reduce_branch`

An additional line of enquiry was whether we were paying any cost for the while loop, specifically in branching:

0110 /**
0111  * yes-reduce_branch.c
0112  *
0113  * The idea for this modification is to reduce the branching caused by the main
0114  * loop.
0115  **/
0116 
0117 #define LEN 2
0118 #define TOTAL 8192
0119 
0120 /**
0121  * main()
0122  *
0123  * The main entry function.
0124  *
0125  * @return The exit status.
0126  **/
0127 int main(){
0128   char yes[LEN] = {'y', '\n'};
0129   char* buf = malloc(TOTAL);
0130   int bufused = 0;
0131   while(bufused < TOTAL){
0132     memcpy(buf + bufused, yes, LEN);
0133     bufused += LEN;
0134   }
0135   while(1){
0136     write(1, buf, TOTAL);
0137     write(1, buf, TOTAL);
0138     write(1, buf, TOTAL);
0139     write(1, buf, TOTAL);
0140     write(1, buf, TOTAL);
0141     write(1, buf, TOTAL);
0142     write(1, buf, TOTAL);
0143     write(1, buf, TOTAL);
0144   }
0145   return 1;
0146 }

Of course the massive assumption here is that there isn’t tonnes of branching already in write() or any calls it makes - which you can be sure there is.

`yes-aligned`

One thing it’s possible we are paying for is unaligned memory - i.e. our data sits in two pages instead of just one. We can therefore allocate a memory aligned buffer and test this:

0147 /**
0148  * yes-aligned.c
0149  *
0150  * The idea for this modification is to force memory alignment.
0151  **/
0152 
0153 #define LEN 2
0154 #define TOTAL 8192
0155 
0156 /**
0157  * main()
0158  *
0159  * The main entry function.
0160  *
0161  * @return The exit status.
0162  **/
0163 int main(){
0164   char yes[LEN] = {'y', '\n'};
0165   char* buf = aligned_alloc(4096, TOTAL);
0166   int bufused = 0;
0167   while(bufused < TOTAL){
0168     memcpy(buf + bufused, yes, LEN);
0169     bufused += LEN;
0170   }
0171   while(write(1, buf, TOTAL));
0172   return 1;
0173 }

`yes-vmsplice`

This was a recommendation from the Reddit thread that was completely off of my radar - and hence made this exercise very much worth it. It turns out that vmsplice` is specifically designed for mapping data into a pipe.

0174 /**
0175  * yes-vmsplice.c
0176  *
0177  * The idea for this modification is to force vmsplicing.
0178  **/
0179 
0180 #define _GNU_SOURCE
0181 #define __need_IOV_MAX
0182 
0183 #include <stdlib.h>
0184 #include <stdio.h>
0185 #include <string.h>
0186 #include <fcntl.h>
0187 #include <limits.h>
0188 
0189 #define LEN 2
0190 #define TOTAL (1 * 1024 * 1024)
0191 #define IOVECS IOV_MAX
0192 
0193 /**
0194  * main()
0195  *
0196  * The main entry function.
0197  *
0198  * @return The exit status.
0199  **/
0200 int main(){
0201   char yes[LEN] = {'y', '\n'};
0202   char* buf = malloc(TOTAL);
0203   int bufused = 0;
0204   int i;
0205   struct iovec iov[IOVECS];
0206   while(bufused < TOTAL){
0207     memcpy(buf + bufused, yes, LEN);
0208     bufused += LEN;
0209   }
0210   for(i = 0; i < IOVECS; i++){
0211     iov[i].iov_base = buf;
0212     iov[i].iov_len = TOTAL;
0213   }
0214   fcntl(1, F_SETPIPE_SZ, TOTAL);
0215   while(vmsplice(1, iov, IOVECS, 0));
0216   return 1;
0217 }

Spoiler: It’s fast. And there is even hints that it could be even faster!

`yes-single_char`

One final idea was whether putc performs less checks that write(), therefore less branching and therefore faster:

0218 /**
0219  * yes-single_char.c
0220  *
0221  * The idea for this modification is to keep everything in cache.
0222  **/
0223 
0224 #include <stdio.h>
0225 #include <fcntl.h>
0226 
0227 /**
0228  * main()
0229  *
0230  * The main entry function.
0231  *
0232  * @return The exit status.
0233  **/
0234 int main(){
0235   while(1){
0236     putc('y', stdout);
0237     putc('\n', stdout);
0238   }
0239   return 1;
0240 }

Compiling & Running

All of the C programs were compiled with the following flags:

0241 program=$1
0242 echo "[>>] Building $program ..."
0243 gcc -O2 -pipe -march=native -mtune=native $program.c -o $program 2> /dev/null
0244 echo "[??] Build size: $(ls -sh $program)"

Note that gcc version is as follows:

0245 $ gcc --version
0246 gcc (Ubuntu 11.2.0-7ubuntu2) 11.2.0
0247 Copyright (C) 2021 Free Software Foundation, Inc.
0248 This is free software; see the source for copying conditions.  There is NO
0249 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

The programs were then run with the following, which averages the throughput of 1000GB (~1TB) of data:

0250 program=$1
0251 echo "[>>] Testing $program ..."
0252 $program | pv -a -s 1000G -S > /dev/null

Note that the Reddit thread correctly points out that pv will become a bottleneck at larger throughputs as it itself makes kernel calls. Still, it’s a reasonably good method for comparing throughput, but may not represent absolute maximum throughput.

Results

Without any further delay, here are the results:

Approach	Binary Size	Speed (GiB/s)
`yes`	N/A	4.17
`yes-orig`	24K	3.94
`yes-prebuild_buffer`	32K	3.9
`yes-setbuf`	24K	3.6
`yes-reduce_branch`	24K	3.74
`yes-aligned`	24K	3.63
`yes-vmsplice`	24K	29.8
`yes-single_char`	24K	0.184

Yes, you read that correctly. Using vmsplice is 30-40x faster than the standard implementation - incredible!