Coffee Space


Listen:

Fast Yes

Preview Image

Preview Image

Background

I come across this very interesting Reddit thread on essentially printing y\n as fast as possible. The baseline here is GNU's implementation of yes, which is exceptionally fast.

kjensenxz prefaces this with:

Compared to other Unices, GNU is outrageously fast. NetBSD's is 139MiB/s, FreeBSD, OpenBSD, DragonFlyBSD have very similar code as NetBSD and are probably identical, illumos's is 141MiB/s without an argument, 100MiB/s with. OS X just uses an old NetBSD version similar to OpenBSD's, MINIX uses NetBSD's, BusyBox's is 107MiB/s, Ultrix's (3.1) is 139 MiB/s, COHERENT's is 141MiB/s.

What got me interested in thinking about this is: "can we do better?". As it turns out, yes 1!

Test Machine

My test machine is as follows:

$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              2
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           94
Model name:                      Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
Stepping:                        3
CPU MHz:                         2600.000
CPU max MHz:                     3500.0000
CPU min MHz:                     800.0000
BogoMIPS:                        5199.98
Virtualisation:                  VT-x
L1d cache:                       128 KiB
L1i cache:                       128 KiB
L2 cache:                        1 MiB
L3 cache:                        6 MiB

The fans are on maximum, but it's an old and tired machine! For all of the testing there was burn-in time, but also bare in mind that this machine is also running lots of other random rubbish. These values are not going to be accurate, but should give a rough idea of performance.

Programs

Here we will be testing a variety of programs. Here we will list as much information about them as possible.

yes

This is the baseline yes currently installed on the test machine. The version information is as follows:

$ yes --version
yes (GNU coreutils) 8.32
Copyright (C) 2020 Free Software Foundation, Inc.
Licence GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by David MacKenzie.

yes-orig

This is the original "iteration 4" from the Reddit post:

/**
 * yes.c
 *
 * Source: https://old.reddit.com/r/unix/comments/6gxduc/how_is_gnu_yes_so_fast/
 **/

#define LEN 2
#define TOTAL 8192

int main(){
  char yes[LEN] = {'y', '\n'};
  char* buf = malloc(TOTAL);
  int bufused = 0;
  while(bufused < TOTAL){
    memcpy(buf + bufused, yes, LEN);
    bufused += LEN;
  }
  while(write(1, buf, TOTAL));
  return 1;
}

This is ultimately the implementation we are trying to beat.

yes-prebuild_buffer

I wondered whether there would be any difference between dynamically allocating the buffer, or having the binary packed with the buffer already setup. My thinking was that it may be loaded into the same memory, therefore use a single page of cache.

/**
 * yes-prebuild_buffer.c
 *
 * The idea for this modification is to pre-build the buffer used.
 **/

#define LEN 2
#define TOTAL 8192

char buf[TOTAL] = {
  'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n',
  /* [..] Many duplicates here! [..] */
  'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n', 'y', '\n',
};

/**
 * main()
 *
 * The main entry function.
 *
 * @return The exit status.
 **/
int main(){
  while(write(1, buf, TOTAL));
  return 1;
}

The variable buf of course was filled with TOTAL number of y and \n values. This had the affect of increasing the binary size.

yes-setbuf

With this implementation, I was wondering whether the kernel was buffering the output to stdout (as it sometimes does with networks and disk I/O). The idea here was to override this with setvbuf and set the output as unbuffered _IOFBF:

/**
 * yes-setbuf.c
 *
 * The idea for this modification is to set the program as unbuffered.
 **/

#include <stdio.h>

#define LEN 2
#define TOTAL 8192

/**
 * main()
 *
 * The main entry function.
 *
 * @return The exit status.
 **/
int main(){
  char yes[LEN] = {'y', '\n'};
  char* buf = malloc(TOTAL);
  int bufused = 0;
  while(bufused < TOTAL){
    memcpy(buf + bufused, yes, LEN);
    bufused += LEN;
  }
  setvbuf(stdout, buf, _IOFBF, TOTAL);
  while(write(1, buf, TOTAL));
  return 1;
}

yes-reduce_branch

An additional line of enquiry was whether we were paying any cost for the while loop, specifically in branching:

/**
 * yes-reduce_branch.c
 *
 * The idea for this modification is to reduce the branching caused by the main
 * loop.
 **/

#define LEN 2
#define TOTAL 8192

/**
 * main()
 *
 * The main entry function.
 *
 * @return The exit status.
 **/
int main(){
  char yes[LEN] = {'y', '\n'};
  char* buf = malloc(TOTAL);
  int bufused = 0;
  while(bufused < TOTAL){
    memcpy(buf + bufused, yes, LEN);
    bufused += LEN;
  }
  while(1){
    write(1, buf, TOTAL);
    write(1, buf, TOTAL);
    write(1, buf, TOTAL);
    write(1, buf, TOTAL);
    write(1, buf, TOTAL);
    write(1, buf, TOTAL);
    write(1, buf, TOTAL);
    write(1, buf, TOTAL);
  }
  return 1;
}

Of course the massive assumption here is that there isn't tonnes of branching already in write() or any calls it makes - which you can be sure there is.

yes-aligned

One thing it's possible we are paying for is unaligned memory - i.e. our data sits in two pages instead of just one. We can therefore allocate a memory aligned buffer and test this:

/**
 * yes-aligned.c
 *
 * The idea for this modification is to force memory alignment.
 **/

#define LEN 2
#define TOTAL 8192

/**
 * main()
 *
 * The main entry function.
 *
 * @return The exit status.
 **/
int main(){
  char yes[LEN] = {'y', '\n'};
  char* buf = aligned_alloc(4096, TOTAL);
  int bufused = 0;
  while(bufused < TOTAL){
    memcpy(buf + bufused, yes, LEN);
    bufused += LEN;
  }
  while(write(1, buf, TOTAL));
  return 1;
}

yes-vmsplice

This was a recommendation from the Reddit thread that was completely off of my radar - and hence made this exercise very much worth it. It turns out that vmsplice` is specifically designed for mapping data into a pipe.

/**
 * yes-vmsplice.c
 *
 * The idea for this modification is to force vmsplicing.
 **/

#define _GNU_SOURCE
#define __need_IOV_MAX

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <fcntl.h>
#include <limits.h>

#define LEN 2
#define TOTAL (1 * 1024 * 1024)
#define IOVECS IOV_MAX

/**
 * main()
 *
 * The main entry function.
 *
 * @return The exit status.
 **/
int main(){
  char yes[LEN] = {'y', '\n'};
  char* buf = malloc(TOTAL);
  int bufused = 0;
  int i;
  struct iovec iov[IOVECS];
  while(bufused < TOTAL){
    memcpy(buf + bufused, yes, LEN);
    bufused += LEN;
  }
  for(i = 0; i < IOVECS; i++){
    iov[i].iov_base = buf;
    iov[i].iov_len = TOTAL;
  }
  fcntl(1, F_SETPIPE_SZ, TOTAL);
  while(vmsplice(1, iov, IOVECS, 0));
  return 1;
}

Spoiler: It's fast. And there is even hints that it could be even faster!

yes-single_char

One final idea was whether putc performs less checks that write(), therefore less branching and therefore faster:

/**
 * yes-single_char.c
 *
 * The idea for this modification is to keep everything in cache.
 **/

#include <stdio.h>
#include <fcntl.h>

/**
 * main()
 *
 * The main entry function.
 *
 * @return The exit status.
 **/
int main(){
  while(1){
    putc('y', stdout);
    putc('\n', stdout);
  }
  return 1;
}

Compiling & Running

All of the C programs were compiled with the following flags:

program=$1
echo "[>>] Building $program ..."
gcc -O2 -pipe -march=native -mtune=native $program.c -o $program 2> /dev/null
echo "[??] Build size: $(ls -sh $program)"

Note that gcc version is as follows:

$ gcc --version
gcc (Ubuntu 11.2.0-7ubuntu2) 11.2.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

The programs were then run with the following, which averages the throughput of 1000GB (~1TB) of data:

program=$1
echo "[>>] Testing $program ..."
$program | pv -a -s 1000G -S > /dev/null

Note that the Reddit thread correctly points out that pv will become a bottleneck at larger throughputs as it itself makes kernel calls. Still, it's a reasonably good method for comparing throughput, but may not represent absolute maximum throughput.

Results

Without any further delay, here are the results:

Approach Binary Size Speed (GiB/s)
yes N/A 4.17
yes-orig 24K 3.94
yes-prebuild_buffer 32K 3.9
yes-setbuf 24K 3.6
yes-reduce_branch 24K 3.74
yes-aligned 24K 3.63
yes-vmsplice 24K 29.8
yes-single_char 24K 0.184

Yes, you read that correctly. Using vmsplice is 30-40x faster than the standard implementation - incredible!


  1. Pun intended.