Browsed by
Month: June 2016



QFTASM (Quest for Tetris Assembly) is the “human readable” representation of the instructions that will be performed by our processor.  El’endia has written an interpreter for QFTASM at his website, complete with documentation and example code.


Since we are using 4 bits for the opcode, there is a limit of 16 possible operations, of which we have currently assigned 11.  Below is a more detailed explanation of each command, taken from El’endia’s documentation.

  • MNZ [test] [value] [dest] – Move if not zero; sets [dest] to [value] if [test] is not zero.
  • MLZ [test] [value] [dest] – Move if less than zero; sets [dest] to [value] if [test] is less than zero.
  • ADD [val1] [val2] [dest] – Add; adds [val1] to [val2] and stores the result in [dest].
  • SUB [val1] [val2] [dest] – Subtract; subtracts [val2] from [val1] and stores the result in [dest].
  • AND [val1] [val2] [dest] – Bitwise AND; bitwise ANDs together [val1] and [val2] and stores the result in [dest].
  • OR [val1] [val2] [dest] – Bitwise OR; bitwise ORs together [val1] and [val2] and stores the result in [dest].
  • XOR [val1] [val2] [dest] – Bitwise XOR; bitwise XORs together [val1] and [val2] and stores the result in [dest].
  • ANT [val1] [val2] [dest] – Bitwise AND-NOT; bitwise ANDs together [val1] and (NOT [val2]) and stores the result in [dest].
  • SL [val1] [val2] [dest] – Shift left; shifts [val1] left by [val2] bits and stores the result in [dest].
  • SRL [val1] [val2] [dest] – Shift right (logical); shifts [val1] right by [val2] bits and stores the result in [dest]. Doesn’t preserve sign.
  • SRA [val1] [val2] [dest] – Shift right (arithmetic); shifts [val1] right by [val2] bits and stores the result in [dest], while preserving sign.


As of right now, we do not have specialized JUMP commands or an unconditional MOV command, because all of those can be represented as conditional moves. We might eventually add an unconditional move command if we determine that it will provide a good performance improvement.

For example, an unconditional jump is typically represented as MLZ -1 28 0 to jump to instruction 28.

Addressing Modes

Addressing modes are represented as letter prefixes to the values.  For example, the value 753 represents a hardcoded value (mode 0), while the value A753 represents the contents of address 753 (mode 1).  Mode 2 is represented as B753, and mode 3 is represented as C753.

Input / Output

Input and output are represented using hard-coded RAM addresses, which serve as “ports” to other devices, like the “screen” or a “button.”  This means that there are no special opcodes for I/O.  With the interpreter, input is provided by manually editing a specified RAM address, and output is read from a specified address.

Line Numbering

The interpreter requires that every line of QFTASM be prefixed by a line number, with one instruction per line.  This is more for the convenience of the programmer, so that I can tell where jumps go.


Comments can be added to a line by adding a semicolon, after which everything is ignored. Comments can be hand-written or auto-generated by the compiler.

Future Expansions

Among the opcodes that may be added to QFTASM in the future are:

  • NOT, which can be currently represented as ANT 65535 [value].
  • NAND, NOR, XNOR, which currently require two steps, but which we haven’t needed yet.
  • Something that’s related to multiplication, although we will probably just implement the shift-and-add algorithm in QFTASM.
  • Something that’s related to division or modulus, although we will probably just implement relevant algorithms in QFTASM.

Example Code

The following is a very inefficient program that calculates prime numbers through trial subtraction.  It was compiled by my compiler, which will be the subject of a later post.  This program is runnable on the interpreter.

0. MLZ -1 3 3;
1. MLZ -1 7 6; preloadCallStack
2. MLZ -1 2 1; beginDoWhile0_infinite_loop
3. MLZ -1 1 4; beginDoWhile1_trials
4. ADD A4 2 4;
5. MLZ -1 A3 5; beginDoWhile2_repeated_subtraction
6. SUB A5 A4 5;
7. SUB 0 A5 2;
8. MLZ A2 6 0;
9. MLZ 0 0 0; endDoWhile2_repeated_subtraction
10. MLZ A5 4 0;
11. MLZ 0 0 0; endDoWhile1_trials
12. SUB A4 A3 2;
13. MNZ A2 16 0; beginIf3_prime_found
14. MLZ 0 0 0;
15. MLZ -1 A3 1; endIf3_prime_found
16. ADD A3 2 3;
17. MLZ -1 4 0;
18. MLZ -1 1 4; endDoWhile0_infinite_loop
Architectural Design

Architectural Design

One of the more tedious steps of building the computer is the actual assembly, going from simple circuits to more complex ones. We have decided to actually leapfrog over this step and go straight to architectural design for a few reasons:

  1. Circuit simulator is still a work in progress, plagued by disease and other unfortunate circumstances.
  2. By deciding what our architecture is going to be, we will know exactly what we need to build and how we need to connect them.
  3. Work is parallelized.

We have welcomed a couple additional people onto this project. User 7H3_H4CK3R has participated in most of the architectural design decisions featured on this page, and user Mego has taken over work on the circuit simulator.

Asynchronous Design

As mentioned previously, we will be using an asynchronous design to speed up our processor. We will accomplish this by having a clock signal accompany the data everywhere it goes. This clock signal will serve as both read and write signals for the various memory devices (ROM, RAM, synchronizers, etc.).

Instruction Format

There was much debate over the type of instructions that our processor would use. There were several options:

  1. We could use a Transport Triggered Architecture (TTA), in which case instruction would be of the form (source, destination). We would hard-wire a lot of addresses to perform specific operations upon reading/writing, similar to the Wireworld computer.
  2. We could have created a distinction between registers and RAM, and we could have an accumulator. Then we could have load/store commands to transfer between them. Then we could have instructions of the form (opcode, value) that could either load a certain RAM address, store a certain RAM address, or perform math, such as adding a given register to the accumulator.
  3. We could choose to not have registers, and just use RAM for everything. Instructions would be of the form (opcode, value, value). We could either move between RAM addresses (one of which would be mapped to the program counter), or perform math in which the result would be stored in a hard-wired default RAM address (which would act like an accumulator).
  4. We could avoid having a default RAM address by having instructions of the form (opcode, value, value, destination) so that a destination could be assigned to every result. This could also allow for stuff such as conditional moves.

We chose the fourth option because, the more detailed the instructions were, the fewer instructions we need, which I believe will speed up execution. Take an example of adding two numbers from RAM and storing the result in RAM. With TTA, that would require three moves (for the two operands and the result). With the third option, it might take two (add arguments, then move result) or three instructions (load arguments, store result). With the fourth option, it takes only one.

The fact that everything will be memory-mapped to RAM (including the PC), and that there are no special registers, make this design relatively simple.

Addressing Modes

It is beneficial to have several addressing modes for operands of each instruction. Our modes just involve sending the data through the RAM read cycle the desired number of times, which means they are easy to implement. These addressing modes take on slightly different meanings when used as a value (first two arguments of an instruction) or as a destination (the third argument), but the hardware is exactly the same.

  • Mode 0 is any constant hardcoded into ROM. This can be either a hardcoded value, such as an instruction that always adds 5 or always writes 17, or it can be a hardcoded destination, such as an instruction that always writes to address 42.
  • Mode 1 is any value that results from a single read of RAM. This would be used when we desire to add the value in RAM address 13, or if we need to read the value of address 37 to determine where our destination is.
  • Mode 2 represents dereferencing. If address 7 contains a pointer to the address in which our data is stored, we would use 7 as an address to read from RAM, use that data as an address to read from RAM again, and use that data as an operand in our ALU.
  • Mode 3 might not be used (or might not even exist, once we actually build it), but it would act as super-dereferencing. This might be useful for 2D arrays.

Bit Width

We have decided that 16-bit words are sufficient for our data, which will allow us to store values in the range 0 to 65535 for addressing or values -32768 to 32767 for numeric data using 2’s complement. Opcodes will probably be 4 bits, allowing for 16 instructions. Addressing modes will be 2 bits. This means that each instruction will be 4+2+16+2+16+2+16 = 58 bits long.

Stages of Execution

The steps involved in the execution of a single instruction are roughly as follows:

  1. The address of the next instruction leaves the program counter, immediately after which the program counter is incremented. This post-increment allows for “jumps” to point directly towards the desired jump destination, rather than the instruction immediately prior to it.
  2. The instruction address is sent to the ROM and the instruction is retrieved.
  3. The instruction is split into its parts, being (opcode, mode, value, mode, value, mode, destination). The opcode is stored in synchronizers until the data is ready. Each of the three arguments undergoes the following process.
    1. The mode and value (or destination) are sent to a counter.
    2. If the counter equals zero, the current value is final, so send it to the synchronizer(s) in the next steps in the process.
    3. Else, send the value to the read queue*, and decrement the counter. The read queue then reads the data from RAM using the value as the address, and sends the new data back to the counter. The counter cycles back to sub-step 2 above.
  4. Once the opcode and two values are available, the opcode is used in a selector for the ALU. It selects which basic operation to perform, sending the clock signal along one of many possible paths. The ALU returns a result.
  5. The conditional is evaluated to determine if the data should be written. For most opcodes (like ADD, OR, etc.) the conditional is automatically true. For special conditional commands, the first value is tested in one of two manners. We currently plan on having “val < 0” (sign bit) and “val != 0” tests.
  6. If the conditional is false, return to step 1 for the next instruction.
  7. Else, move the result to the write synchronizer, which also contains the desired write address from the end of step 3 above. The data is written, and we return to step 1.

*The read queue is one of the more “magical” components where I don’t really know how/if it will work. Ideally, it would do some parallelization between the three values. Worst-case scenario is that it ends up being incredibly inefficient without parellelization.


In order to speed up our processor, we will do a little bit of pipelining. In order for the computer to function nicely, we have to wait until after an instruction has written to RAM for the next instruction to read from RAM. It is possible do some overlapping, so that one instruction is being retrieved from ROM while another is being processed. This eliminates the delay that would be caused by having a program that is several thousand instructions long. On the other hand, this creates a branch delay slot, where the instruction immediately after a jump is executed regardless of the jump.


A rough diagram of our architecture, drawn with


The GoL computer is now in the architecture design phase.  There’s plenty of ideas floating around, but I don’t really have material to make a blog post about it yet, so I’m doing something different,  A couple days ago, there was a sale on .xyz domains.  To be more specific, each domain was $0.02 for the first year.  (The sale price has since risen to $0.22, which isn’t as fun.)  So, I decided #yolo and bought a couple domains.  Some were for personal/friend use ($0.04 total), but I also noticed that their website has a list of “regularly priced premium” domains that were still available, specifically domains that were really short or were common words.  I noticed that a couple of the premium categories were almost entirely sold out, so I spent $0.10 to buy the last ones.  To be specific, these domains were:

  • qx3 (3-character)
  • jbij (4-letter)
  • ulub (4-letter)
  • xoix (4-letter)
  • independents (common word)

As of writing, each of those categories of domain is completely sold out.  I don’t really know if any of these are particularly valuable, but I figure I’m going to try to sell them over the course of the next year or so.  I think it’s worth the experience, and if I don’t manage to sell them, well I’ve wasted $0.10 on lesser things before.  They’re listed for sale in a variety of places for a variety of prices, but if you’re reading this and decide that you want one, I advise contacting me directly.


The following is a somewhat cynical viewpoint on domain name trading.

Unless some glaringly miraculous opportunity arises, I think this is about it for my domain name trading. I’m no economist, but from what I can tell, the domain market has pretty much settled down.  Every 4-letter .com has been registered, and while a lot of them are used for legitimate purposes (and thus won’t likely be sold), all the meaningless letter combinations just trade hands for prices that I don’t feel are deeply rooted in any tangible value that the domains have.  Prices seem to be steeped in voodoo and superstition more than anything else (how to determine the “brandability” of an arbitrary 4-letter combination?).  This is quite true in numerical domains, where there is nothing to determine relative value other than Chinese superstition (8 is lucky, 4 and 0 are unlucky). In normal investing, the money that the investors spend helps the company hire people, buy tangible assets, etc., which will help the company grow.  On the other hand, how does a domain investor make money other than getting in the way of the domains ultimate owner, by registering/parking it first?

Imagine if there were a river on which trading boats often passed.  A person buys a plot of land on the river, and ties a rope from one bank across to the other, only allowing boats through when they pay a toll.  Now, many people might be willing to pay the toll, but that doesn’t mean that the person who’s collecting the money actually deserves any of it.  Similarly, a company might be willing to pay $10,000 for a premium domain, but that doesn’t mean that the person who “ninja’d” the domain before them actually deserves any of it.  I understand what role these investors play: using the concept of capitalism to distribute domain names to the most “deserving” recipients (large businesses who can afford to pay the cost to buy it).  All this money, however, just benefits these miscellaneous third parties, who never used their domain names to host any meaningful content on the web.  If anyone, it should be going either to ICANN, although I’m not quite sure what they’d do with it all, or to a previous “legitimate user” of the domain.

In the case of domain names, I’m not convinced that the current system is effective.  The number of parked domains names is ridiculous.  A domain investor only needs to sell a couple percent of his domains each year to turn a profit (given that the renewal fee is only like $9 or so per domain, so a $10,000 sale pays for parking over 1,000 other domains for a year), which leads to lots of domains that simply remained parked indefinitely.  I’d also be interested to see statistics on what percent of those sales are just to other investors.  I don’t know what ICANN will be able to do to alleviate this problem, but I want to see them take steps to increase the percentage of registered domains that are used to host meaningful content.  Combined, all of the TLDs form one giant global namespace, and I believe it is being wasted as we speak.

Ultimately, it’s not what the domain name is, but rather what you make of it.

My own personal branding

One of reasons for deciding to spend $9 to get, in particular, was that I desired to “claim my username.”  I’ve wanted my own domain name and website since forever, and I decided to put it under one of my internet usernames, since this is probably my most *useful* internet presence (I checked, and the majority of domains relating to my real name are long gone.  Even though it’s a “personal site” nothing on here is really “personal” enough to warrant having my real name posted anywhere on it).  It allows me to freely use this domain for PPCG and programming-related stuff.

One of the two other domains I bought during the .xyz sale was, which is currently serving as a redirect page.  Since it looks like .xyz domains are more expensive to renew than .com domains, I probably won’t keep it.  As of right now, I plan on keeping indefinitely.

The hardest choice was which TLD I wanted to get as my main site.  I know isn’t the most elegant domain name, but I don’t have any better ideas at this point, and “.com” is typically considered the most demanded TLD.  Other ideas, like “” (pio?) and “” (almost 3x as expensive, and “pico”?), didn’t really speak to me.

Other people’s personal branding

I’ve also been checking out what domains/websites other PPCG members have, and I’ve found some pretty interesting/creative ones.  I don’t know if I should mention any names here, though.


What do you think….

  • About my choice of
  • Of my investment choices? (I know they’re mediocre)
  • About the state of the domain market?
  • Are some cool domains that I should know about?
The Synchronizer

The Synchronizer

Processors are as slow as their slowest component.  While true in conventional CPUs, that phrase takes on a whole new meaning in cellular automata.  At any given moment, there are lots of little pieces of data flowing around in our computer, and in order for two pieces to properly interact, the selected pieces of data must be perfectly in sync with each other.  In the previous post, I mentioned the choice between constant- and variable-timed ROM.  That is actually a small part of a large decision: synchronous or asynchronous?

  • Synchronous: All of the data in the computer is kept in sync simply through wiring.  All reads/writes have equal time, and all operations have equal time.  Everything is as slow as the worst-case scenario (reading from furthest address, etc.). We don’t need to design any parts that can handle inputs at variable times, but we’ll need lots of delay wire to sync stuff up.
  • Asynchronous: The next instruction is executed as soon as the current one is finished.  ROM/RAM data is available as soon as it is read, with smaller addresses being faster.  Arithmetic is performed once both operands are available.  We can make circuits shorter/faster, but need components that are capable of synchronizing pieces of data with each other, such as holding the data for some arbitrary amount of time until it is ready to be used.

An asynchronous design has the potential to cut average ROM/RAM access times in half.  Given that the ROM may be several thousand tiles long, this will cut several thousand generations off each instruction’s execution time.  Furthermore, it will reduce the complexity of wiring.  This is why we have chosen the asynchronous paradigm for our processor.

The current plan to implement asynchronous computing is to have a second wire running parallel to each data wire.  This wire will carry a “clock signal” that indicates the presence of data.  For example, this clock signal will serve as the read/write signals to memory.  The clock signal will always remain in sync with the data, but the path (and duration) of travel can vary between instructions.  The computer will speed up when small addresses are being accessed, and will slow down for larger addresses.

The Synchronizer

There are, however, many times where two pieces of data must be put in sync with each other.  This can work by storing one piece in a temporary “register,” which is read from when the second piece of data arrives.  In order for an asynchronous design to work, this synchronization device has to be able to handle the two pieces of data coming in at any time (in practice, any multiple of 11 with some minimum separation between writing and reading).

Here’s how our synchronizer works: first, the incoming signal is sent through a serial-to-parallel converter, which works similarly to one in our ROM.  It’s actually a more “true” converter: the data signal is sent down a sequence of wire splitters separated by delays, and then the “clock signal” passes across each wire and selects a single bit for each of the outgoing wires.

The core of the synchronizer is a set of length-44 loops, one for each bit of the word.  These loops function similarly to the ones that might be in our RAM.  As an example of using a loop to store information, below is an image of a memory loop constructed by El’endia:

An example memory loop, with data-in (Di), write (S), read (R), and data-out (Do).

There are four main components:

  1. An AND-NOT gate that uses the write signal to destroy the information currently in the loop (which will happen in about 15 generation in this sim).
  2. An OR gate which which adds the new data to the loop.
  3. A wire splitter that repeatedly outputs the data stored in the loop.
  4. An AND gate that is hooked up the the read signal to allow the data to exit the device (which will also happen in this sim).

In the synchronizer, the loop must be able to be read from at any given time (multiple of 11).  So, we must have four electrons flowing around the loop to signify an “ON” bit.  This means that the input bit (and write signal) must be tranformed into four bits in quick succession.  My method for this is duplicate the signal twice: first by splitting the signal, delay one copy by 11 ticks, and joining them.  Second, we split the signal, delay by 22 ticks, and join them.

The more difficult operation is the delay by 11 ticks.  I decided to create a special component to accomplish this operation.

Signal “doubler”: splits the signal, delays one copy by 11 ticks, and merges them together. It is probably possible to replace the OR with AND/XOR if desired.

The second doubling is simpler: we split the wire and merge it using our typical components.  This can be done in a 2×2 tile space, which there is probably little benefit in compressing.  Once each of the bits (and write signal) have been amplified 4X, they are used to load the loops.  The “ON” loops will emit a pulse every 11 ticks, with the “OFF” loops emitting nothing.

“The read signal is very dangerous and can attack at any time, so we must deal with it.”

Reading from the device requires two things to happen.  First, the read signal travels across every output, using the same AND/Crossing multi-tile to select one electron from each loop.  These electrons are then merged into a serial data stream, using OR gates separated by delay wire.  The output electrons are in the same order as they were input.

A constructed synchronizer

Below is a synchronizer I constructed in VarLife.  The serial-to-parallel converter is in columns 1 through 3.  Column 5 has the 11-tick doubler, and columns 6 and 7 have the 22-tick doubler.  Column 8 contains allows the write signal to reach each loop, with the loops themselves in columns 9 and 10.  In columns  11 and 12, the read signal selects one bit from each loop.  Finally, column 13 has the parallel-to-serial converter.

A sample 2-bit synchronizer. it’s actually 1 tile wider than necessary, with the serial-to-parallel converter separated from the rest of the device. As shown, it is about to store the number 3.

The Synchronizer in Practice

The synchronizer will most like be used in several places in the computer:

  • Storing one input for the ALU (arithmetic and logic unit) while the other is being retrieved from memory.
  • Storing the RAM address so that it can be synchronized with the larger data loops (probably 32 bits with 22-tick separation) of the RAM, so that data can be read and written at the correct starting and stopping points.
  • Pipelining/queuing: Putting one piece of data on hold while another one is using the device (probably ROM, maybe also RAM).

Each of these scenarios will require additional circuitry, which may include some or all of the following:

  • Something to keep track of whether or not new data is present.  By this, I mean whether or not data has been written since the last read.
  • In the case of RAM, something that blocks the read signal if new data isn’t present and waits until the next loop cycle to try again.
  • In pipelining/queuing, something that marks the device as being “in use” or not.  If the device is not in use, the write signal also serves as the read signal, so that the data may pass through (almost) immediately.  If the device is in use, it waits until the other data exits the device, and uses that data’s clock signal as the read signal.


The bulk of the program for the computer will be stored in ROM, or Read-Only Memory.  We chose to use ROM because it tends to be more compact than RAM, and the speed of memory access will be one of the bottlenecks of our computer (ROM will be massive, like 1000s of tiles).

Data Storage

We chose to use grid-based ROM.  This means that data is stored in a grid, and the value of a particular row is accessed, sending down one bit per column of data.  In order to do this, we needed two types of data cells:

  1. A “false” bit will be a gate which passes the read signal through and the data signal through without affecting either of them.
  2. A “true” bit will pass the read and data signals through, but will also output a 1 for the data signal if it receives a read signal.

It turns out a false bit is simply a wire crossing.  The true bit requires more work.  I had a sudden realization, that the “true” gate never has to deal with both signals coming in at once, since we are either reading the current row or some other row, but never both at the same time (which would mix the data together).  So, a “true” gate is allowed to malfunction for that input combination.  This is what I came up with:

A “true” ROM bit, with data being passed through, and the read signal triggering both outputs. Data wire is top-to-bottom, and read wire is right-to-left.

The idea here is to line a bunch of these up, so that a single “read” signal reads out the entire word.  By lining up the data wires, we allow for the data from different addresses to be passed down the grid.


Okay, so we have the data grid figured out.  Next is addressing.  Addressing is also done with a grid, but a different kind of grid.  We have an address signal (each column receives 1 bit of the address) and a read signal (sent across each row to test that combination of bits for matching).  A “false” gate allows the address bit through, but only allows the read signal through if the address bit is 0.  A “true” gate allows the address bit through, but only allows the read signal through when the address bit is 1.  I created two 1×2 multi-tiles to accomplish this.

A “false” address gate. It blocks the read signal when the address bit is 1.  Read signal right-to-left, address bit bottom-to-top.
A “true” addressing gate. It allows the read signal through only when the address bit is 1.

The basic concept is that we line a lot of these addressing gates side-to-side (sharing the read line), and then the read signal can only get through all of them if the address matches at each bit.  Then, that read signal goes on to access the data from that row from the data grid.  The addressing gates are stacked vertically so the desired address is tested against each row of gates.

Parallel to Serial

So, we have data that is coming off of the data grid in parallel, how to we convert it to serial for use by the computer?  It is actually very easy.  The data is coming off in parallel, but isn’t synchronized horizontally, since it is produced as the read signal moves right-to-left.  It turns out that to convert this to serial, all we have to do is OR the signals left-to-right (in the opposite direction) and the spacing between signals is automatically twice the width of the columns, which is exactly the separation we need.

Serial to Parallel

Much more difficult is the serial to parallel conversion we need for addressing.  Performing a “true” conversion would be rather difficult, since we have to have timing circuits that are capable of waiting and selecting each bit to move down each wire.  It turns out we can cheat.  We can send the entire signal down each line, and the read signal for each row will take a “cross-section” of this signal which corresponds to the “correct” value of each bit.  There’s a little more to this, though.  Since the address signal starts going up each column in right-to-left order, and the read signal also travels in this direction, it would mean that our cross-section is skewed, and it will interact with the same bit (the last bit) of the address all the way across.  To correct for this, we slow down the address with a 33:11 delay between each column, so that it enters the next column 33 ticks later instead of 11 ticks.  This means that the read signal interacts with the last, then second-to-last, etc., then first bit of the address, which gives us the cross-section we want.

Putting Everything Together

Here is a 4-address, 3-bit-word ROM I constructed:

ROM. The 4×3 data grid in top-left, 4×2 (physically 4×4) address grid to its right. The read signal travels up the rightmost column. The data is converted to series series in the bottom left, and the address is converted to parallel in the bottom center. As shown, it is set up to read from address 3.

The table that represents this ROM is as follows, with data (appears big-endian, but is read off in reverse) on the left and address on the right.

1 0 1 | 1 1 (5|3)
0 1 0 | 0 1 (2|2)
0 0 1 | 1 0 (1|1)
1 1 1 | 0 0 (7|0)

The “read” signal that is send alongside the address should be 11 ticks behind (right-to-left) of the last bit in the address.  It is possible to send the signal earlier but have delay wire that delays from entering the ROM it until after the address.

ROM in Practice

The ROM as I have it above is different than the exact configuration that we will eventually use.  For example, the real version will probably have 16 address bits and who-knows-how-many bits of data.  The main difference might be in the timing.  The version above is faster to read from smaller addresses than larger ones, 22 ticks per address faster.  This is both good and bad, because our computer wouldn’t be slowed unnecessarily when the wanted data is nearby, but it will have to deal with data coming back at a variable time.  There are two solutions:

  1. Return the read signal with the data to indicate when the data has been read.  Then, other circuitry needs to be developed that can take data in at variable times and perform the operation once every piece of data is received, whenever that may be.
  2. Reverse the order of the data grid, so that the data comes out the opposite side from where the data comes in (as opposed to the same side like above).  Then, the data wire will have to come from the far end back to the computer.  This effectively slows every read to the same speed.
Serial Data and Binary Addition

Serial Data and Binary Addition

Okay, so we have now created a notion of a standardized grid on which our circuits will be built.  The next step involved a crucial design decision, one which will affect the design of every circuit in the computer.   The question is, what format will the data be in?  There are two big options:

  1. Parallel: A byte of data will be represented as one electron on each of eight separate wires.  The wires must be synchronized with each other.
  2. Series: A byte of data will be represented as eight electrons on a single wire, carefully timed to indicate where the word begins and ends.

Real-world processors of course use parallel data, but they also don’t need to worry much about timings (race conditions are real issues, but they are caused by entirely different reasons).  For example, if you are connecting 64 wires from one chip to another, and the outer wire is 5% longer than the inner wire, it doesn’t matter.  In cellular automata, the exact distances matter a lot.  If we want a 16-bit processor, that’s 16 bulky wires we would have to work around every corner (each of which will either flip their order or put them all out of sync).  We would also have to expend additional effort to flip the order of bits or adjust the spacing between wires in order to get the data buses to connect to our circuits.  The advantage of parallel data is processor speed (maybe?), the ability to stick with a more traditional architecture, and the ease of creating memory (maybe).

A corner for parallel data, drawn in Logisim.

Serial data has the advantage that a change in word size requires a lot less redesign of the hardware.  The only important factor is the distance between electrons, which matters for all circuits anyways.  Serial-based circuits are also much smaller, since there is simply 1 wire per input stream.  The downside is that serial data can sometimes be more clumsy to work with, since, by definition, circuits don’t see the entire word at once.  Without a special counter, the circuits can’t tell where words begin or end.  Performing “backwards” bitshifts is easy (with delay wire), but performing “forwards” bitshifts is hard (either using diagonal wiring to cut corners, or delaying everything else relative to the selected data).

We chose serial data for our project, primarily due to the vastly increased simplicity of wiring.  Our decision was made in part by the difficulty of constructing a binary adder.

The Binary Adder

In order to get a better idea of how the computer will end up looking/working, I decided to build something larger and more complex, something that I didn’t know ahead of time how it would work.  I chose to build a binary adder.  Part of the reason I chose an adder was that the process of binary addition involves “carry over” between bits, which requires interaction between the bits of a number.  Most of the other basic operations, like bitwise logic, are significantly simpler to implement.

The core of a binary adder is the full adder, which takes in three bits (A, B, and Carry-in) and outputs two bits (Sum and Carry-out).  Below is a circuit diagram, but keep in mind that the order of inputs doesn’t matter.

Logic diagram of a full adder.

The carry-in represents the “overflow” from the previous pair of bits, and carry-out represents the overflow from the current pair of bits.  For parallel data, a set of full adders is used, with the C_out of one adder hooked up as an input to the next.  Here is how a parallel adder would probably look for our computer, drawn with

Parallel binary adder. An implementation would require a substantial amount of delay wire. Sometimes, a carry-in wire is connected to the first full adder to allow for 1’s complement subtraction. Without a carry-in wire, the first full adder could be replaced by a half adder.

The idea behind a serial adder is much simpler: take a single full adder and loop the carry-out back around to carry-in.  In real electronics, this would require a flip-flop to temporarily hold/delay the carry bit, but in our circuits the delay is simply given by the loop of wire.  Serial adders, importantly, can only handle little-endian data, meaning that the least significant bit comes first.  This means that the binary number 011 refers to the decimal number 6 (rather than a 3 with a leading zero).

The creation of a series adder took a reasonable amount of effort, but here was our second working prototype (after a string of failures).  The first version was significantly larger.   You can see the red corners of the 3×5 tile bounding box.

Serial adder, with 101 (5) in the top input and 10111 (29) in the left input.

Below is an image showing the flow of information around this adder design. The carry loop is the length-22 loop in the bottom-right of the adder which rotates counter-clockwise, and this loop is packed with an AND, OR, and wire splitter.  As information flows into the adder, both inputs are split, with one pair going to an XOR and the other pair going to an AND.   The output of the XOR is split, with one half travelling straight to be XORed with the current carry value, resulting in the output value.  The other half is ANDed with the carry, which is then ORed with the output of the first AND gate to produce the next value of the carry loop.

Serial adder with arrows indicating data flow.

After an intense round of golfing, the following 2×3 tile design is probably final.  It has the exact same connectivity, but the wires have been shortened.  Since both designs are “lagless” and “linear,” this more compact version is superior in literally every way.

Current best binary serial adder design, shown with inputs of 11 (3) and 011 (6).
Data flow of compressed adder.

What Did We Learn?

We learned a couple things about how our computer will end up functioning:

  1. It will use serial data (a sequence of electrons on a single wire to represent a word).
  2. The spacing between electrons (well, the places where an electron could be) will be 22 ticks.  I do have some potential worries about 22-tick technology, though, since it might lead to the creation of a lot more of multi-tile components whenever data needs to flow in a 22-tick loop.  Some technology (graphics, global clock, etc.) might still use 11-tick technology.
  3. We will probably use 2’s complement, since it does not require us to preset the carry-in of an adder.  This might mean, however, that subtraction would be broken down into two steps: negation and addition.  We would have a 2’s complement converter/command.