So, this is the first new blog post in quite a while. This project had been on the backburner for a lot of people for the past year, although significant progress has been made. The current team of people is roughly the following:
PhiNotPi (myself) – Delegator in Chief. Created the Cogol language/compiler.
K Zhang – The hardware guy who has assembled the complete computer in VarLife.
El’endia Starman – Developed the online VarLife simulator and QFTASM interpreter.
Muddyfish – Is developing a new-and-improved language and compiler.
Cows quack – Is helping develop the new language and compiler.
Quartata – Is working on an LLVM backend.
Mego – Did some work on a circuit simulator, but is now working on a VarLife to Life converter.
The next few blog posts will be dedicated to individual subjects to summarize the current status of the project, which is actually very close to completion.
QFTASM (Quest for Tetris Assembly) is the “human readable” representation of the instructions that will be performed by our processor. El’endia has written an interpreter for QFTASM at his website, complete with documentation and example code.
Since we are using 4 bits for the opcode, there is a limit of 16 possible operations, of which we have currently assigned 11. Below is a more detailed explanation of each command, taken from El’endia’s documentation.
MNZ [test] [value] [dest] – Move if not zero; sets [dest] to [value] if [test] is not zero.
MLZ [test] [value] [dest] – Move if less than zero; sets [dest] to [value] if [test] is less than zero.
ADD [val1] [val2] [dest] – Add; adds [val1] to [val2] and stores the result in [dest].
SUB [val1] [val2] [dest] – Subtract; subtracts [val2] from [val1] and stores the result in [dest].
AND [val1] [val2] [dest] – Bitwise AND; bitwise ANDs together [val1] and [val2] and stores the result in [dest].
OR [val1] [val2] [dest] – Bitwise OR; bitwise ORs together [val1] and [val2] and stores the result in [dest].
XOR [val1] [val2] [dest] – Bitwise XOR; bitwise XORs together [val1] and [val2] and stores the result in [dest].
ANT [val1] [val2] [dest] – Bitwise AND-NOT; bitwise ANDs together [val1] and (NOT [val2]) and stores the result in [dest].
SL [val1] [val2] [dest] – Shift left; shifts [val1] left by [val2] bits and stores the result in [dest].
SRL [val1] [val2] [dest] – Shift right (logical); shifts [val1] right by [val2] bits and stores the result in [dest]. Doesn’t preserve sign.
SRA [val1] [val2] [dest] – Shift right (arithmetic); shifts [val1] right by [val2] bits and stores the result in [dest], while preserving sign.
As of right now, we do not have specialized JUMP commands or an unconditional MOV command, because all of those can be represented as conditional moves. We might eventually add an unconditional move command if we determine that it will provide a good performance improvement.
For example, an unconditional jump is typically represented as MLZ -1 28 0 to jump to instruction 28.
Addressing modes are represented as letter prefixes to the values. For example, the value 753 represents a hardcoded value (mode 0), while the value A753 represents the contents of address 753 (mode 1). Mode 2 is represented as B753, and mode 3 is represented as C753.
Input / Output
Input and output are represented using hard-coded RAM addresses, which serve as “ports” to other devices, like the “screen” or a “button.” This means that there are no special opcodes for I/O. With the interpreter, input is provided by manually editing a specified RAM address, and output is read from a specified address.
The interpreter requires that every line of QFTASM be prefixed by a line number, with one instruction per line. This is more for the convenience of the programmer, so that I can tell where jumps go.
Comments can be added to a line by adding a semicolon, after which everything is ignored. Comments can be hand-written or auto-generated by the compiler.
Among the opcodes that may be added to QFTASM in the future are:
NOT, which can be currently represented as ANT 65535 [value].
NAND, NOR, XNOR, which currently require two steps, but which we haven’t needed yet.
Something that’s related to multiplication, although we will probably just implement the shift-and-add algorithm in QFTASM.
Something that’s related to division or modulus, although we will probably just implement relevant algorithms in QFTASM.
The following is a very inefficient program that calculates prime numbers through trial subtraction. It was compiled by my compiler, which will be the subject of a later post. This program is runnable on the interpreter.
One of the more tedious steps of building the computer is the actual assembly, going from simple circuits to more complex ones. We have decided to actually leapfrog over this step and go straight to architectural design for a few reasons:
Circuit simulator is still a work in progress, plagued by disease and other unfortunate circumstances.
By deciding what our architecture is going to be, we will know exactly what we need to build and how we need to connect them.
Work is parallelized.
We have welcomed a couple additional people onto this project. User 7H3_H4CK3R has participated in most of the architectural design decisions featured on this page, and user Mego has taken over work on the circuit simulator.
As mentioned previously, we will be using an asynchronous design to speed up our processor. We will accomplish this by having a clock signal accompany the data everywhere it goes. This clock signal will serve as both read and write signals for the various memory devices (ROM, RAM, synchronizers, etc.).
There was much debate over the type of instructions that our processor would use. There were several options:
We could use a Transport Triggered Architecture (TTA), in which case instruction would be of the form (source, destination). We would hard-wire a lot of addresses to perform specific operations upon reading/writing, similar to the Wireworld computer.
We could have created a distinction between registers and RAM, and we could have an accumulator. Then we could have load/store commands to transfer between them. Then we could have instructions of the form (opcode, value) that could either load a certain RAM address, store a certain RAM address, or perform math, such as adding a given register to the accumulator.
We could choose to not have registers, and just use RAM for everything. Instructions would be of the form (opcode, value, value). We could either move between RAM addresses (one of which would be mapped to the program counter), or perform math in which the result would be stored in a hard-wired default RAM address (which would act like an accumulator).
We could avoid having a default RAM address by having instructions of the form (opcode, value, value, destination) so that a destination could be assigned to every result. This could also allow for stuff such as conditional moves.
We chose the fourth option because, the more detailed the instructions were, the fewer instructions we need, which I believe will speed up execution. Take an example of adding two numbers from RAM and storing the result in RAM. With TTA, that would require three moves (for the two operands and the result). With the third option, it might take two (add arguments, then move result) or three instructions (load arguments, store result). With the fourth option, it takes only one.
The fact that everything will be memory-mapped to RAM (including the PC), and that there are no special registers, make this design relatively simple.
It is beneficial to have several addressing modes for operands of each instruction. Our modes just involve sending the data through the RAM read cycle the desired number of times, which means they are easy to implement. These addressing modes take on slightly different meanings when used as a value (first two arguments of an instruction) or as a destination (the third argument), but the hardware is exactly the same.
Mode 0 is any constant hardcoded into ROM. This can be either a hardcoded value, such as an instruction that always adds 5 or always writes 17, or it can be a hardcoded destination, such as an instruction that always writes to address 42.
Mode 1 is any value that results from a single read of RAM. This would be used when we desire to add the value in RAM address 13, or if we need to read the value of address 37 to determine where our destination is.
Mode 2 represents dereferencing. If address 7 contains a pointer to the address in which our data is stored, we would use 7 as an address to read from RAM, use that data as an address to read from RAM again, and use that data as an operand in our ALU.
Mode 3 might not be used (or might not even exist, once we actually build it), but it would act as super-dereferencing. This might be useful for 2D arrays.
We have decided that 16-bit words are sufficient for our data, which will allow us to store values in the range 0 to 65535 for addressing or values -32768 to 32767 for numeric data using 2’s complement. Opcodes will probably be 4 bits, allowing for 16 instructions. Addressing modes will be 2 bits. This means that each instruction will be 4+2+16+2+16+2+16 = 58 bits long.
Stages of Execution
The steps involved in the execution of a single instruction are roughly as follows:
The address of the next instruction leaves the program counter, immediately after which the program counter is incremented. This post-increment allows for “jumps” to point directly towards the desired jump destination, rather than the instruction immediately prior to it.
The instruction address is sent to the ROM and the instruction is retrieved.
The instruction is split into its parts, being (opcode, mode, value, mode, value, mode, destination). The opcode is stored in synchronizers until the data is ready. Each of the three arguments undergoes the following process.
The mode and value (or destination) are sent to a counter.
If the counter equals zero, the current value is final, so send it to the synchronizer(s) in the next steps in the process.
Else, send the value to the read queue*, and decrement the counter. The read queue then reads the data from RAM using the value as the address, and sends the new data back to the counter. The counter cycles back to sub-step 2 above.
Once the opcode and two values are available, the opcode is used in a selector for the ALU. It selects which basic operation to perform, sending the clock signal along one of many possible paths. The ALU returns a result.
The conditional is evaluated to determine if the data should be written. For most opcodes (like ADD, OR, etc.) the conditional is automatically true. For special conditional commands, the first value is tested in one of two manners. We currently plan on having “val < 0” (sign bit) and “val != 0” tests.
If the conditional is false, return to step 1 for the next instruction.
Else, move the result to the write synchronizer, which also contains the desired write address from the end of step 3 above. The data is written, and we return to step 1.
*The read queue is one of the more “magical” components where I don’t really know how/if it will work. Ideally, it would do some parallelization between the three values. Worst-case scenario is that it ends up being incredibly inefficient without parellelization.
In order to speed up our processor, we will do a little bit of pipelining. In order for the computer to function nicely, we have to wait until after an instruction has written to RAM for the next instruction to read from RAM. It is possible do some overlapping, so that one instruction is being retrieved from ROM while another is being processed. This eliminates the delay that would be caused by having a program that is several thousand instructions long. On the other hand, this creates a branch delay slot, where the instruction immediately after a jump is executed regardless of the jump.
The GoL computer is now in the architecture design phase. There’s plenty of ideas floating around, but I don’t really have material to make a blog post about it yet, so I’m doing something different, A couple days ago, there was a sale on .xyz domains. To be more specific, each domain was $0.02 for the first year. (The sale price has since risen to $0.22, which isn’t as fun.) So, I decided #yolo and bought a couple domains. Some were for personal/friend use ($0.04 total), but I also noticed that their website has a list of “regularly priced premium” domains that were still available, specifically domains that were really short or were common words. I noticed that a couple of the premium categories were almost entirely sold out, so I spent $0.10 to buy the last ones. To be specific, these domains were:
independents (common word)
As of writing, each of those categories of domain is completely sold out. I don’t really know if any of these are particularly valuable, but I figure I’m going to try to sell them over the course of the next year or so. I think it’s worth the experience, and if I don’t manage to sell them, well I’ve wasted $0.10 on lesser things before. They’re listed for sale in a variety of places for a variety of prices, but if you’re reading this and decide that you want one, I advise contacting me directly.
The following is a somewhat cynical viewpoint on domain name trading.
Unless some glaringly miraculous opportunity arises, I think this is about it for my domain name trading. I’m no economist, but from what I can tell, the domain market has pretty much settled down. Every 4-letter .com has been registered, and while a lot of them are used for legitimate purposes (and thus won’t likely be sold), all the meaningless letter combinations just trade hands for prices that I don’t feel are deeply rooted in any tangible value that the domains have. Prices seem to be steeped in voodoo and superstition more than anything else (how to determine the “brandability” of an arbitrary 4-letter combination?). This is quite true in numerical domains, where there is nothing to determine relative value other than Chinese superstition (8 is lucky, 4 and 0 are unlucky). In normal investing, the money that the investors spend helps the company hire people, buy tangible assets, etc., which will help the company grow. On the other hand, how does a domain investor make money other than getting in the way of the domains ultimate owner, by registering/parking it first?
Imagine if there were a river on which trading boats often passed. A person buys a plot of land on the river, and ties a rope from one bank across to the other, only allowing boats through when they pay a toll. Now, many people might be willing to pay the toll, but that doesn’t mean that the person who’s collecting the money actually deserves any of it. Similarly, a company might be willing to pay $10,000 for a premium domain, but that doesn’t mean that the person who “ninja’d” the domain before them actually deserves any of it. I understand what role these investors play: using the concept of capitalism to distribute domain names to the most “deserving” recipients (large businesses who can afford to pay the cost to buy it). All this money, however, just benefits these miscellaneous third parties, who never used their domain names to host any meaningful content on the web. If anyone, it should be going either to ICANN, although I’m not quite sure what they’d do with it all, or to a previous “legitimate user” of the domain.
In the case of domain names, I’m not convinced that the current system is effective. The number of parked domains names is ridiculous. A domain investor only needs to sell a couple percent of his domains each year to turn a profit (given that the renewal fee is only like $9 or so per domain, so a $10,000 sale pays for parking over 1,000 other domains for a year), which leads to lots of domains that simply remained parked indefinitely. I’d also be interested to see statistics on what percent of those sales are just to other investors. I don’t know what ICANN will be able to do to alleviate this problem, but I want to see them take steps to increase the percentage of registered domains that are used to host meaningful content. Combined, all of the TLDs form one giant global namespace, and I believe it is being wasted as we speak.
Ultimately, it’s not what the domain name is, but rather what you make of it.
My own personal branding
One of reasons for deciding to spend $9 to get phinotpi.com, in particular, was that I desired to “claim my username.” I’ve wanted my own domain name and website since forever, and I decided to put it under one of my internet usernames, since this is probably my most *useful* internet presence (I checked, and the majority of domains relating to my real name are long gone. Even though it’s a “personal site” nothing on here is really “personal” enough to warrant having my real name posted anywhere on it). It allows me to freely use this domain for PPCG and programming-related stuff.
One of the two other domains I bought during the .xyz sale was phinotpi.xyz, which is currently serving as a redirect page. Since it looks like .xyz domains are more expensive to renew than .com domains, I probably won’t keep it. As of right now, I plan on keeping phinotpi.com indefinitely.
The hardest choice was which TLD I wanted to get as my main site. I know phinotpi.com isn’t the most elegant domain name, but I don’t have any better ideas at this point, and “.com” is typically considered the most demanded TLD. Other ideas, like “phinotp.io” (pio?) and “phinotpi.co” (almost 3x as expensive, and “pico”?), didn’t really speak to me.
Other people’s personal branding
I’ve also been checking out what domains/websites other PPCG members have, and I’ve found some pretty interesting/creative ones. I don’t know if I should mention any names here, though.
What do you think….
About my choice of phinotpi.com?
Of my investment choices? (I know they’re mediocre)
Processors are as slow as their slowest component. While true in conventional CPUs, that phrase takes on a whole new meaning in cellular automata. At any given moment, there are lots of little pieces of data flowing around in our computer, and in order for two pieces to properly interact, the selected pieces of data must be perfectly in sync with each other. In the previous post, I mentioned the choice between constant- and variable-timed ROM. That is actually a small part of a large decision: synchronous or asynchronous?
Synchronous: All of the data in the computer is kept in sync simply through wiring. All reads/writes have equal time, and all operations have equal time. Everything is as slow as the worst-case scenario (reading from furthest address, etc.). We don’t need to design any parts that can handle inputs at variable times, but we’ll need lots of delay wire to sync stuff up.
Asynchronous: The next instruction is executed as soon as the current one is finished. ROM/RAM data is available as soon as it is read, with smaller addresses being faster. Arithmetic is performed once both operands are available. We can make circuits shorter/faster, but need components that are capable of synchronizing pieces of data with each other, such as holding the data for some arbitrary amount of time until it is ready to be used.
An asynchronous design has the potential to cut average ROM/RAM access times in half. Given that the ROM may be several thousand tiles long, this will cut several thousand generations off each instruction’s execution time. Furthermore, it will reduce the complexity of wiring. This is why we have chosen the asynchronous paradigm for our processor.
The current plan to implement asynchronous computing is to have a second wire running parallel to each data wire. This wire will carry a “clock signal” that indicates the presence of data. For example, this clock signal will serve as the read/write signals to memory. The clock signal will always remain in sync with the data, but the path (and duration) of travel can vary between instructions. The computer will speed up when small addresses are being accessed, and will slow down for larger addresses.
There are, however, many times where two pieces of data must be put in sync with each other. This can work by storing one piece in a temporary “register,” which is read from when the second piece of data arrives. In order for an asynchronous design to work, this synchronization device has to be able to handle the two pieces of data coming in at any time (in practice, any multiple of 11 with some minimum separation between writing and reading).
Here’s how our synchronizer works: first, the incoming signal is sent through a serial-to-parallel converter, which works similarly to one in our ROM. It’s actually a more “true” converter: the data signal is sent down a sequence of wire splitters separated by delays, and then the “clock signal” passes across each wire and selects a single bit for each of the outgoing wires.
The core of the synchronizer is a set of length-44 loops, one for each bit of the word. These loops function similarly to the ones that might be in our RAM. As an example of using a loop to store information, below is an image of a memory loop constructed by El’endia:
There are four main components:
An AND-NOT gate that uses the write signal to destroy the information currently in the loop (which will happen in about 15 generation in this sim).
An OR gate which which adds the new data to the loop.
A wire splitter that repeatedly outputs the data stored in the loop.
An AND gate that is hooked up the the read signal to allow the data to exit the device (which will also happen in this sim).
In the synchronizer, the loop must be able to be read from at any given time (multiple of 11). So, we must have four electrons flowing around the loop to signify an “ON” bit. This means that the input bit (and write signal) must be tranformed into four bits in quick succession. My method for this is duplicate the signal twice: first by splitting the signal, delay one copy by 11 ticks, and joining them. Second, we split the signal, delay by 22 ticks, and join them.
The more difficult operation is the delay by 11 ticks. I decided to create a special component to accomplish this operation.
The second doubling is simpler: we split the wire and merge it using our typical components. This can be done in a 2×2 tile space, which there is probably little benefit in compressing. Once each of the bits (and write signal) have been amplified 4X, they are used to load the loops. The “ON” loops will emit a pulse every 11 ticks, with the “OFF” loops emitting nothing.
“The read signal is very dangerous and can attack at any time, so we must deal with it.”
Reading from the device requires two things to happen. First, the read signal travels across every output, using the same AND/Crossing multi-tile to select one electron from each loop. These electrons are then merged into a serial data stream, using OR gates separated by delay wire. The output electrons are in the same order as they were input.
A constructed synchronizer
Below is a synchronizer I constructed in VarLife. The serial-to-parallel converter is in columns 1 through 3. Column 5 has the 11-tick doubler, and columns 6 and 7 have the 22-tick doubler. Column 8 contains allows the write signal to reach each loop, with the loops themselves in columns 9 and 10. In columns 11 and 12, the read signal selects one bit from each loop. Finally, column 13 has the parallel-to-serial converter.
The Synchronizer in Practice
The synchronizer will most like be used in several places in the computer:
Storing one input for the ALU (arithmetic and logic unit) while the other is being retrieved from memory.
Storing the RAM address so that it can be synchronized with the larger data loops (probably 32 bits with 22-tick separation) of the RAM, so that data can be read and written at the correct starting and stopping points.
Pipelining/queuing: Putting one piece of data on hold while another one is using the device (probably ROM, maybe also RAM).
Each of these scenarios will require additional circuitry, which may include some or all of the following:
Something to keep track of whether or not new data is present. By this, I mean whether or not data has been written since the last read.
In the case of RAM, something that blocks the read signal if new data isn’t present and waits until the next loop cycle to try again.
In pipelining/queuing, something that marks the device as being “in use” or not. If the device is not in use, the write signal also serves as the read signal, so that the data may pass through (almost) immediately. If the device is in use, it waits until the other data exits the device, and uses that data’s clock signal as the read signal.
The bulk of the program for the computer will be stored in ROM, or Read-Only Memory. We chose to use ROM because it tends to be more compact than RAM, and the speed of memory access will be one of the bottlenecks of our computer (ROM will be massive, like 1000s of tiles).
We chose to use grid-based ROM. This means that data is stored in a grid, and the value of a particular row is accessed, sending down one bit per column of data. In order to do this, we needed two types of data cells:
A “false” bit will be a gate which passes the read signal through and the data signal through without affecting either of them.
A “true” bit will pass the read and data signals through, but will also output a 1 for the data signal if it receives a read signal.
It turns out a false bit is simply a wire crossing. The true bit requires more work. I had a sudden realization, that the “true” gate never has to deal with both signals coming in at once, since we are either reading the current row or some other row, but never both at the same time (which would mix the data together). So, a “true” gate is allowed to malfunction for that input combination. This is what I came up with:
The idea here is to line a bunch of these up, so that a single “read” signal reads out the entire word. By lining up the data wires, we allow for the data from different addresses to be passed down the grid.
Okay, so we have the data grid figured out. Next is addressing. Addressing is also done with a grid, but a different kind of grid. We have an address signal (each column receives 1 bit of the address) and a read signal (sent across each row to test that combination of bits for matching). A “false” gate allows the address bit through, but only allows the read signal through if the address bit is 0. A “true” gate allows the address bit through, but only allows the read signal through when the address bit is 1. I created two 1×2 multi-tiles to accomplish this.
The basic concept is that we line a lot of these addressing gates side-to-side (sharing the read line), and then the read signal can only get through all of them if the address matches at each bit. Then, that read signal goes on to access the data from that row from the data grid. The addressing gates are stacked vertically so the desired address is tested against each row of gates.
Parallel to Serial
So, we have data that is coming off of the data grid in parallel, how to we convert it to serial for use by the computer? It is actually very easy. The data is coming off in parallel, but isn’t synchronized horizontally, since it is produced as the read signal moves right-to-left. It turns out that to convert this to serial, all we have to do is OR the signals left-to-right (in the opposite direction) and the spacing between signals is automatically twice the width of the columns, which is exactly the separation we need.
Serial to Parallel
Much more difficult is the serial to parallel conversion we need for addressing. Performing a “true” conversion would be rather difficult, since we have to have timing circuits that are capable of waiting and selecting each bit to move down each wire. It turns out we can cheat. We can send the entire signal down each line, and the read signal for each row will take a “cross-section” of this signal which corresponds to the “correct” value of each bit. There’s a little more to this, though. Since the address signal starts going up each column in right-to-left order, and the read signal also travels in this direction, it would mean that our cross-section is skewed, and it will interact with the same bit (the last bit) of the address all the way across. To correct for this, we slow down the address with a 33:11 delay between each column, so that it enters the next column 33 ticks later instead of 11 ticks. This means that the read signal interacts with the last, then second-to-last, etc., then first bit of the address, which gives us the cross-section we want.
Putting Everything Together
Here is a 4-address, 3-bit-word ROM I constructed:
The table that represents this ROM is as follows, with data (appears big-endian, but is read off in reverse) on the left and address on the right.
The “read” signal that is send alongside the address should be 11 ticks behind (right-to-left) of the last bit in the address. It is possible to send the signal earlier but have delay wire that delays from entering the ROM it until after the address.
ROM in Practice
The ROM as I have it above is different than the exact configuration that we will eventually use. For example, the real version will probably have 16 address bits and who-knows-how-many bits of data. The main difference might be in the timing. The version above is faster to read from smaller addresses than larger ones, 22 ticks per address faster. This is both good and bad, because our computer wouldn’t be slowed unnecessarily when the wanted data is nearby, but it will have to deal with data coming back at a variable time. There are two solutions:
Return the read signal with the data to indicate when the data has been read. Then, other circuitry needs to be developed that can take data in at variable times and perform the operation once every piece of data is received, whenever that may be.
Reverse the order of the data grid, so that the data comes out the opposite side from where the data comes in (as opposed to the same side like above). Then, the data wire will have to come from the far end back to the computer. This effectively slows every read to the same speed.
Okay, so we have now created a notion of a standardized grid on which our circuits will be built. The next step involved a crucial design decision, one which will affect the design of every circuit in the computer. The question is, what format will the data be in? There are two big options:
Parallel: A byte of data will be represented as one electron on each of eight separate wires. The wires must be synchronized with each other.
Series: A byte of data will be represented as eight electrons on a single wire, carefully timed to indicate where the word begins and ends.
Real-world processors of course use parallel data, but they also don’t need to worry much about timings (race conditions are real issues, but they are caused by entirely different reasons). For example, if you are connecting 64 wires from one chip to another, and the outer wire is 5% longer than the inner wire, it doesn’t matter. In cellular automata, the exact distances matter a lot. If we want a 16-bit processor, that’s 16 bulky wires we would have to work around every corner (each of which will either flip their order or put them all out of sync). We would also have to expend additional effort to flip the order of bits or adjust the spacing between wires in order to get the data buses to connect to our circuits. The advantage of parallel data is processor speed (maybe?), the ability to stick with a more traditional architecture, and the ease of creating memory (maybe).
Serial data has the advantage that a change in word size requires a lot less redesign of the hardware. The only important factor is the distance between electrons, which matters for all circuits anyways. Serial-based circuits are also much smaller, since there is simply 1 wire per input stream. The downside is that serial data can sometimes be more clumsy to work with, since, by definition, circuits don’t see the entire word at once. Without a special counter, the circuits can’t tell where words begin or end. Performing “backwards” bitshifts is easy (with delay wire), but performing “forwards” bitshifts is hard (either using diagonal wiring to cut corners, or delaying everything else relative to the selected data).
We chose serial data for our project, primarily due to the vastly increased simplicity of wiring. Our decision was made in part by the difficulty of constructing a binary adder.
The Binary Adder
In order to get a better idea of how the computer will end up looking/working, I decided to build something larger and more complex, something that I didn’t know ahead of time how it would work. I chose to build a binary adder. Part of the reason I chose an adder was that the process of binary addition involves “carry over” between bits, which requires interaction between the bits of a number. Most of the other basic operations, like bitwise logic, are significantly simpler to implement.
The core of a binary adder is the full adder, which takes in three bits (A, B, and Carry-in) and outputs two bits (Sum and Carry-out). Below is a circuit diagram, but keep in mind that the order of inputs doesn’t matter.
The carry-in represents the “overflow” from the previous pair of bits, and carry-out represents the overflow from the current pair of bits. For parallel data, a set of full adders is used, with the C_out of one adder hooked up as an input to the next. Here is how a parallel adder would probably look for our computer, drawn with draw.io:
The idea behind a serial adder is much simpler: take a single full adder and loop the carry-out back around to carry-in. In real electronics, this would require a flip-flop to temporarily hold/delay the carry bit, but in our circuits the delay is simply given by the loop of wire. Serial adders, importantly, can only handle little-endian data, meaning that the least significant bit comes first. This means that the binary number 011 refers to the decimal number 6 (rather than a 3 with a leading zero).
The creation of a series adder took a reasonable amount of effort, but here was our second working prototype (after a string of failures). The first version was significantly larger. You can see the red corners of the 3×5 tile bounding box.
Below is an image showing the flow of information around this adder design. The carry loop is the length-22 loop in the bottom-right of the adder which rotates counter-clockwise, and this loop is packed with an AND, OR, and wire splitter. As information flows into the adder, both inputs are split, with one pair going to an XOR and the other pair going to an AND. The output of the XOR is split, with one half travelling straight to be XORed with the current carry value, resulting in the output value. The other half is ANDed with the carry, which is then ORed with the output of the first AND gate to produce the next value of the carry loop.
After an intense round of golfing, the following 2×3 tile design is probably final. It has the exact same connectivity, but the wires have been shortened. Since both designs are “lagless” and “linear,” this more compact version is superior in literally every way.
What Did We Learn?
We learned a couple things about how our computer will end up functioning:
It will use serial data (a sequence of electrons on a single wire to represent a word).
The spacing between electrons (well, the places where an electron could be) will be 22 ticks. I do have some potential worries about 22-tick technology, though, since it might lead to the creation of a lot more of multi-tile components whenever data needs to flow in a 22-tick loop. Some technology (graphics, global clock, etc.) might still use 11-tick technology.
We will probably use 2’s complement, since it does not require us to preset the carry-in of an adder. This might mean, however, that subtraction would be broken down into two steps: negation and addition. We would have a 2’s complement converter/command.
As I have mentioned in previous posts, it is important for all of the basic building blocks to have similar sizes and delays, so that they are compatible in larger circuits. The next level of abstraction is that of tiles, which each tile representing a single wire or gate. By holding ourselves to strict requirements for the underlying gates, we will be able to run our simulations in terms of tiles, without needing to simulate the underlying metapixels. I’ve said a few things about bounding boxes and delays, but now is time to get a lot more specific.
We have settled on 11×11 metapixel components. The figure below show the typical layout of a single gate. The top and left sides are input, and the right side is output. The red box is 11×11 (inclusive), and the red itself denotes cells that cannot be occupied by the gate. This is to provide the necessary 1-cell-wide padding to prevent interference between components. The output side is the only side that doesn’t need buffering, since buffering is provided by the input side of the adjacent component. The blue and green cells denote the mandatory wire buffers.
Every logic gate in the previous post can fit into a bounding box like this, and so can many of the wire components (the others won’t be used individually).
As far as timing is concerned, the distance between signals will be a multiple of 11 (often 22 for data), so a gate should be able to receive electrons every 11 ticks, and give the output exactly 11 ticks later. The only exceptions are the 22:11 and 33:11 delay wire units, which of course delay their output to a different multiple of 11.
When testing a new gate, it is important to include a length of wire on every side so that you can actually see if any erroneous/malformed electrons are produced.
The two main types of timing components (clocks and delay wires) must be special-built for 11-cell-based technology. Delay wires will be used to increase the delay between two components without having to increase the travel distance between them. Especially important is a 22:11 delay, which solves the checkerboard parity problem (moving an odd Manhattan distance in an even number of ticks). Clocks are used to synchronize the circuits and to create inverted gates. Ultimately, a couple large clocks will likely control the computer. Here are some delay wire units ( which were also previously featured on the “Building Blocks: Wires” post) and some small clocks:
Sometimes, components don’t fit into a single tile, but rather occupies the space of a group of tiles. There are typically two causes of this:
An 11×11 square is sometimes simply not enough space, and it’s not something that can be made with two separate tiles next to each other. We did consider larger tile sizes for our standard, like 16×16, but changing 11×11 to 16×16 would double the total area of the rest of the computer and dramatically increase the amount of empty space.
The given component can be made using existing components, but can be made significantly smaller by compacting its sub-components together. An example of this is the serial binary adder, which will be the subject of a future post.
In these cases, the component can be considered a “multi-tile” component, which fits neatly into a 11-wide grid but which can’t be divided into independent 11×11 squares. The multi-tile component will be treated as a single unit during simulation, without needing to simulate the different sub-components. Below is an example of a 33-clock in the form of a 2×1 multi-tile.
The timing of multi-tile components is more complicated, since their delays are influenced by their size and complexity. They should be able to receive input and give output every N*11 generations, for some integer N. The delay between input and output should be a multiple of 11 as well. A multi-tile component is considered “lagless” if the delay between input and output equals the distance between them. A component is “linear” if the input is aligned (either horizontally or vertically) with the output, so that the signal travels in a straight line.
Tile-Based Circuit Simulator
Conor is currently working on a tile-based circuit simulator, which will served as the next (and possibly final) platform for circuit design. His software will be able to simulate tile and multi-tile components without needing to worry about the underlying metapixel structure, allowing for easier development of complex circuits.
The second class of building blocks is where things get interesting: logic gates. In real-life digital design, the gates you would typically encounter are as follows:
With the exception of NOT (which has a single input), all of the gates listed above have two inputs and one output. Furthermore, they are symmetrical: swapping the order of inputs has no effect. Unfortunately, only a portion of these gates are available for use in our computer. The main issue we run into is that it is impossible for a gate to create electrons from nothing. This means that any gate with all zeroes input must give a zero output, immediately eliminating the possibility of NOT, NAND, NOR, and XNOR gates. To be more precise, those gates would require a clock as input to make sure that the outputs have the correct timing. For example, a NOT gate would be an XOR hooked up to a clock. At that point, however, we are no longer dealing with the basic building blocks.
There is some good news though. One gate left off the above list is the AND-NOT gate, which outputs true iff the first input is true and the second input is false. Unlike the gates above, this is an asymmetrical gate. Since it returns false when both inputs are false, we are able to build it as a core gate. This gate will help reduce our need for inverters.
Configuration is Key
There’s not much that isn’t important when designing logic gates. In order for a logic gate to be useful, it has to fit with the wires that we chose. Since we are arranging everything rectilinearly, each logic gate has four sides, two of which are input, and one or two of which are output. This gives rise to multiple distinct configurations of each gate. For an example, one version of a gate could take input from the left and right, while another version could take input from the left and top. These gates are distinct because there’s no sequence of rotations/reflections that transforms one to the other. We must individually, and explicitly, construct each distinct gate configuration. For symmetrical gates, there are 4 configurations, and for asymmetrical gates, there are 5.
Having a wide palette of gates is very important, since not having the right gate generally results in additional wiring or additional gates, which slows down the circuit, which means that other parts of the circuit (or computer) must be adjusted as to keep the timing in sync.
There is an important additional restriction on shape, which is that they fit into a square bounding box (of arbitrary size) with the input/output wires centered on each side. This helps the gate fit into the grid system and means they may be rotated/reflected and still fit properly without affecting timing.
Timing is Key
In order for a given logic gate to be compatible with our wiring system, they must follow an important timing principle:
The time between input and output is given by the Manhattan distance between input and output, not by the type of gate or what the input/output values are. If the logic gate were replaced by a straight or corner wire, the delay must be identical.
This rule is what ensures that the logic gates are compatible with each other, and that the next layer of abstraction (detailed in a later post) actually works. Several gates have been discovered that are completely unusable because of timing issues. In those cases, it is generally possible to correct the timing issue by simply bending some of the surrounding wires, but that generally adds to the size of the gate.
List of “Core” Logic Gates
Here is a (to-be-completed) comprehensive list of the known logic gates. Some of the animation needs to be updated to VarLife quality. The VarLife animations are also hyperlinked to an example implementation. In some images, a red bounding box (11×11, inclusive) is displayed, although that same bounding box can fit (centered) over any other gate here.
All of the possible AND gates are known.
The situation for XOR gates is similar. Missing is opposing input with two outputs.
The OR gate typically uses the B12/S1 cell type. It has, however, a second version of the adjacent-input configurations which is more compact. The larger version is often used, simply for consistency. It is likewise missing the opposing-inputs double-output version.
The AND-NOT gate is asymmetrical, so there’s five possible configurations So far, two have been found:
Adjacent inputs, single output, with input A across from the output.
Opposing inputs, single output.
Adjacent inputs, single output, with input B across from the output.
Adjacent inputs, two outputs
Missing is the following:
Opposing inputs, two outputs
What about those missing gates?
As you may have noticed, a couple theoretical gate configurations are missing. This is generally due to a lack of effort: a serious need for many of the double-output gates hasn’t arisen yet, so not much time has been spent searching for them. On the other hand, I do not believe that the double-output gates will be simple. If you find a missing gate, or a more compact version of an existing gate, please contact us. If it is compatible with our project, I will add it to this list.
Our vision for the GoL computer is one similar to real computers. That is, it’s built from logic gates, has memory, adders, clocks, etc., and is programmed in a format comparable to real assembly language. Just like regular computers, we have to start from the basics: wires and logic gates. The first task is to find implementations of those simple components in metapixels.
The order in which I present the various building blocks is a combination of discovery order, relevance, and complexity.
There’s actually a couple different options as far as wire goes. You can see some options in this simulation.
We chose the 3-wide wire, with electron tails 2 blocks behind, which is shown on the top row. There are some other variations on electron which are shown in the second row. The third row shows a type of 2-wide wire (discovered later) which has some inherit side-to-side flexibility. This 2-wide wire ends up being useful only as a single “module” within a larger wire, to help squeeze around an obstacle. The 3-wide wire is superior, in my opinion, because it is uniform (can take on any length) and handles 90 degree turns better. The electron as shown in the first row is the kind that arises most naturally in various circuits, such as the clock at the bottom. Care must be taken, however, to ensure that the tails are properly formed, since several logic gates malfunction if the tails are incorrect. Furthermore, a sporadic signal in the wire can end up reverberating to both ends.
There are probably other potential designs for a wire, using other metapixel rules or other arrangements, but I believe the B1/S and B2/S combination is one of the most useful. For comparison, wireworld electrons are B12/S (but have a third “dying” state).
I made some mention of corners in the last post, but here are some more variations:
The best kind of corner is one whose delay we can ignore, meaning that we do not have to count the number of corners to calculate the time it takes for a signal to travel down a long wire. Each corner above follows the correct Manhattan-distance-based timing, so they will all be compatible with each other and with any other circuits that obey Manhattan distance (which is not a give-in with a Moore neighborhood).
The most common corner used in our designs is the one in the top-left and bottom-right of the animation.
The corner in the top-right has the advantage of being narrower on one side, but is likewise wider on the others.
The corner in the bottom left is the widest, which can be both an advantage and disadvantage.
One important rule to keep in mind is that “something cannot be created from nothing,” when it comes to our CA, at least. The only way to generate an electron is if we already had one. The only way to increase our total electron count is through wire splitters.
The corners listed above can be converted into wire splitters, quite conveniently actually. The first corner can be transformed into a 2-way wire split (turning left and right). The second can be transformed to a 2-way turn-or-go-straight split. The third one can be converted into any kind of 2- or 3-way split.
Wire crossings allow signals to pass directly through each other. Since we are limited to a 2D world, a compact wire crossing is one of the most important building blocks. The discovery of the wire crossing actually came after a couple of the more simple logic gates. After the team did a decent amount of trial-and-error, El’endia found the wire crossing below. He introduced the B12/S1 rule, shown in red, as the fourth (and currently last) type of metapixel in our project.
I figured out a slightly more compact version, but so far I’ve never actually used it in bigger circuits:
Rarely in actual circuitry are parallel wires this close together, so the circuit will probably involve diagonal wiring to bring them together.
Diagonal wire is wire that shifts to the side as it travels forward, which enables the wire to cut corners if need be.
Or, less practically (requiring larger distances between electrons and dependent on the exact distance):
Also of importance is delay wire, which is wire that slows down the passage of electrons. Delay wire is classified by its length and delay (how long it is versus how long electrons take to flow through it). The only useful generic delay wire I’m currently aware of is this 4:3 wire by El’endia:
I have built some delay wire modules of the sizes 22:11 and 33:11 for our project. The use-cases and reasoning behind those sizes will be explained later.