Coffee Space – Coffee Space

128 Bit Computing

On HackerNews, somebody posted this article from 1995! One of the interesting future predictions is for 128 bit computing:

0001 FORECASTS for 64->128-bit transition:
0002 1) If memory density continues to increase at the same rate,
0003 and virtual memory pressure retains the 4:1 ratio, and we think we've just
0004 added 32 more bits, to be consumed 2 bits/3 years, we get:
0005         3*32/2 = 48 years
0006 and I arbitrarily pick 1995 as a year when:
0007         a) There was noticable pressure from some customers for 4GB+
0008         physical memories, and a few people buying more, in "vanilla"
0009         systems.
0010         b) One can expect 4 vendors to be shipping 64-bit chips,
0011         i.e., not a a complete oddity.
0012 Hence, one estimate would be 1995+48 = 2043 to be in leading edge of
0013 64->128-bit transition, based on *physical memory* pressure.
0014 That is: the pressure comes from the wish to conveniently address the
0015 memory that one might actually buy.

I then replied with the following about reasons for why 128 bit computing may see various pressures by 2043:

I think 128 bit computers will come around eventually, despite it having been declared that 64 bit is “enough”. Some pressures may come from:

Memory addressing - As the article suggests, addressing large amount of address space. Not just RAM, but disk control on the bit/byte level (something solid state drives may enable and new filesystems may take advantage of). There may also be applications where you want an etabyte disk as low-speed RAM.

Multi-byte processing - accelerating instructions like AVX have shown the power of processing multiple bytes at a time. One can imagine that wider registers would accelerate these processes and would allow for multi-byte processes to happen in parallel.

Gaming/simulation - We have seen quite a few examples where physics in games and simulations have broken down due to the inaccuracy of double for large values. I believe Minecraft physics for example used to become extremely unstable near the world border.

Hashing - With int32_t and large amounts of data, you will see a lot of collisions. int64_t is lesser, but still likely. int128_t is rarer but still possible. int256_t (long long on a 128 bit processor) would be highly unlikely. Being able to compare hashes in just a few clock cycles would be awesome.

Custom instructions - When programs can define a custom instruction to speed-up computation, 128 bits, or 16 bytes, could even be enough to contain the custom instruction and the payload.

These are just things I’ve noticed. I imagine there are others too. The prediction of 2043 is still quite realistic, I wouldn’t be surprised if we beat it.

I was quite disappointed to see many Linux distros give up on 32 bit support because it was too much effort to support. It probably points towards some crappy code that is highly dependant on the platform.

I’ll address a few of these points here…

Hashing

Regarding the point about hashing, after my own experiments the other day, I realised that larger hashes are important fore reducing hash collisions, and larger integers reduce computation to produce such integers. Anyway, I opensourced by experiments so other people can play with this.

Custom Instructions

One exciting thing 128 bit computing could enable are custom instructions. This part was not so suitable for a comment over on HackerNews as it requires a lot of unproven speculation.

Firstly, most mixed-width instructions could then feasibly fit into a single register. This would mean they could be processed more quickly (in theory).

Secondly would be the possibility of being able to write a small in-place VM. You would want the first two instructions to define the custom instruction to be run and the last 6 to act as the VM, such that the CPU looks for the patterns OOxx xxxx xxxx xxxx. The instruction could look like follows:

0016 +--8bit--+--8bit--+-8bit-+-8bit-+-8bit-+-8bit-+-8bit-+-8bit-+
0017 v        v        v      v      v      v      v      v      v
0018  [opcode] [opcode] [byte] [byte] [byte] [byte] [byte] [byte]
0019 
0020     +-8bit-+-8bit-+-8bit-+-8bit-+-8bit-+-8bit-+-8bit-+-8bit-+
0021     v      v      v      v      v      v      v      v      v
0022      [byte] [byte] [byte] [byte] [byte] [byte] [byte] [byte]

You would treat the [byte] bytes in pairs, where the first is the VM operation and the second is data <vmop, data> - but they would be split like follows:

0023 +--8bit--+--8bit--+-8bit-+-8bit-+-8bit-+-8bit-+-8bit-+-8bit-+
0024 v        v        v      v      v      v      v      v      v
0025  [opcode] [opcode] [vmop] [vmop] [vmop] [vmop] [vmop] [vmop]
0026 
0027     +-8bit-+-8bit-+-8bit-+-8bit-+-8bit-+-8bit-+-8bit-+-8bit-+
0028     v      v      v      v      v      v      v      v      v
0029      [vmop] [data] [data] [data] [data] [data] [data] [data]

This means that the resulting data can easily be obtained by accessing the lower register. This would likely be the easiest way to pass parameters into the instruction.

A vmop would itself be split into an instruction and offset. The 4 bit instruction could be one of 16 instructions (note I have not checked if this is really suitable, but it seems like a good start):

Bit pattern	Instruction	Description
`0b0000`	`hlt`	Halt VM, continue execution of program
`0b0001`	`jmp`	Jump to a given offset in any case
`0b0010`	`jz`	Jump if zero to offset
`0b0011`	`jnz`	Jump if not zero to offset
`0b0100`	`movi`(this)	Move this data to offset location
`0b0101`	`mova`(that)	Move data at offset to this location
`0b0110`	`noti`(this)	Bitwise NOT on this data, store at offset
`0b0111`	`nota`(that)	Bitwise NOT on that data, store here
`0b1000`	`ori` (this)	Bitwise OR on this data, store at offset
`0b1001`	`ora` (that)	Bitwise OR on that data, store here
`0b1010`	`andi`(this)	Bitwise AND on this data, store at offset
`0b1011`	`anda`(that)	Bitwise AND on that data, store here
`0b1100`	`xori`(this)	Bitwise XOR on this data, store at offset
`0b1101`	`xora`(that)	Bitwise XOR on that data, store here
`0b1110`	`addi`(this)	Add this data to offset location
`0b1111`	`adda`(that)	Add data at offset to this location

A subtraction is just a negative add, a multiplication is just multiple adds, a division is multiple substractions.

NOTE: A really powerful idea here that may be missed is that vmop can be overwritten on the fly, and so can the original opcode, allowing a different VM or special instruction to be run.

WARNING: It may be desired to throw an error flag somewhere if some maximum number of cycles is reached, rather than get stuck in an infinite loop. The error flag is required to allow the program to know that this occured. The error may also be required to allow the program to know that an invalid instructions was requested (i.e. opcode is overwritten with something impossible).

I think the idea is potentially cool, but it would need testing to see if it has any real legs. It’s not clear if useful programs can be built in the few available bytes, or that any real speed-up would be gained from using a tiny VM like this. The fact that it can do tonnes of computation without any fetches into RAM, etc, may yield interested results.

If somebody wants to pick this project up, I think the first step would be to build a small assembler and VM to see what kinds of programs could potentially be written within a VM of just 7 instructions (maybe 15 with a long long 256 bit wide register). Some potentially interesting programs could be:

Hashing - The ability to implement and accelerate small hashing functions.
Unicode decode - Decode unicode characters in a single instruction.

Let me know what you think!