Step 3: Execute

Read time: 33 minutes (8360 words)

In the next phase of processing we activate the ALU to do the “number crunching”. We should already understand how that unit works, and modeling it is simple. However, we do need to make sure that we understand how data flows through the machine. Let’s take this opportunity to trace one instruction completely through the machine. We will use “Register Transfer Language” to do this.

For this exercise, we will trace the ADD instruction through the machine.

  • ADD R0, R1 ; R0 <- R0 + R1

From the AVR documentation, this instruction is encoded as follows:

  • 0000-11rd-dddd-rrrr


In the discussion that follows we will indicate clock ticks with a time marker t. After those clocks, data will move over wires during a wire “tock” indicated by a w marker.


During the fetch phase, we transfer the instruction from instruction memory to the F.D barrier:

  • t0: PC.out <- (next addr)
  • w0: IM.addr <- PC.out, <= PC.out
  • t1: IM.out <- IM[addr], INCR1.out <- INCR.in1 + 1
  • w1: F/D.inst_in <- IM.out, <- INCR1.out

That F/D register just stops signals passing on tot h next stage. A tick will send them on their way.

At this point, we have the first 16-bit chunk of instruction code available for decoding sitting at the F/D.inst point.


The decode phase is responsible for breaking out the instruction parts. In this instruction we have two register numbers to decode:

  • t2: F/D.inst_out <- F/D.inst_in
  • w2: DCD.inst_in <- F/D.inst_out ; instruction is at decoder block
  • t3: DCD.aluop <- DCD.inst_in[10-15], DCD.op1 <= DCD.inst_in[4-8], DCD.op2 <- inst_in[0-3,9], DCD.aluop <- DCD.inst_in[10-15]
  • w3: REG.in1 <- DCD.op1, REG.in2 <- DCD>op2, D/E.aluop <- DCD.aluop_in
  • t4: REG.out1 <- REG[op1], REG.out2 <- REG[[op2]
  • W4: D/E.in1 <- REG.out1, D/E.in2 <- REG.out2

At this point the barrier just before the execute stage holds the instruction for the ALU, and the two data values needed to perform the action.


In this phase, we perform the calculation in the ALU:

  • t5: D/E.out1 <- D/E.in1, D/E.out2 <- D/E.in2, D/E.aluop <- D/E.aluop_in
  • w5: ALU.op1 <- D/E.out1, ALU.op1 <- D/E.out2, ALU.op = D/E.aluop
  • t6: ALU.res <- ALU.op1 + ALU.op2, ALU.Z <- zero?
  • w6: E/W.in1 <- res, E/W.in1 <- Z, E/W.in3 <- R0

We need the destination register address to store the result we calculated in this phase. The zero flag is passed along in case it is needed later.


Finally, we need to save the result back in a register

  • t7: E/W.out1 <- res, E/W.out2 <- Z, E/W.out3 <- R0
  • w7: REG.in3 <- res, REG.in1 <- R0, REG.wr <- write, CTRL.z <- Z
  • T7: [R0] <- res, CTRL.[z] <- Z

Cycle Ends

At this point a complete instruction cycle has been completed. All calculations have been safely stored somewhere and the machine is “idle”. It is at this point that we can begin another instruction, or interrupt the normal p=instruction flow, and send the machine to another instruction entirely.

Let’s do another one, this time one that bypasses the ALU:

  • LDS Rd, k ; Rd <- [k]
  • 10001 000d dddd 0000 k16

This one is more complicated, since it needs another chunk from the instruction memory to do the job. It also involves the data memory component


There is not much different in this step:

  • t0: PC.out <- (next addr)
  • w0: IM.addr <- PC.out, <= PC.out
  • t1: IM.out <- [addr]
  • w1: F/D.in1 <- IM.out


For this instruction, we do not have two registers identified, but we do have the extra complication of needing another word from the IM.

  • t2: F/D.out1 <- F/D.in1
  • w2: DCD.in1 <- F/D.out1 ; instruction is at decoder block
  • t3: DCD.op1 < R0

There is more to do here. We need the PC so we can go back and get another word. That means we really need to move the PC across that first barrier as well. Add this RTL on the w0 step:

  • w0: F/D.in2 = PC

And on t2 add this:

  • t2: F/D.out2 <- F/D.in2

We need to increment the PC and get another word from the IM:

  • w3: INCR1.in1 <- FD.out2 (PC)
  • t4: INCR1.out <- PC + 1
  • w5: IM.addr <- INCR1.out (PC_n)
  • t6: IM.out <= [addr] = K

Now, we have all the parts needed for this step. Notice that we do not extract anything from register memory, since the register specified is just the destination.

  • w7: D/E.in1 = R0, D/E.in3 = IM.out (k)


Notice that the RTL for each stage ends by placing signals in the barrier register. That register stops the actions within a stage, and prepares the machine for the next step. The next clock tick will make that happen.


There is nothing to do involving the ALU in this step. In our current thinking, the only thing needed here is to fetch the required data from the memory.

Some designers introduce another stage here, one that only accesses the data memory. We will not do that, but will do the work here to get the required data:

  • t8: D/E.out1 <- R0, D/E.out2 <- k
  • w5: DM.addr <- k, DM.RD = 1
  • t9: DM.out <- [k]
  • w6: E/W.in1 <- [k], E/W.in3 <- R0


Finally, we again need to save the result back in a register

  • t7: E/W.out1 <- [k], E/W.out3 <- R0
  • w7: REG.in3 <- R0, REG.in1 <- R0, REG.wr <- write
  • T8: [R0] <- res

There is one missing piece to this puzzle. I said that each step ends with a wire action. We have not included that in either of these instructions. In fact, we need to make sure the PC is updated for the correct nect instruction.

For the ADD instruction, we already have the next address calculated. It was placed on the increment component back in the fetch stage. All we need to do there is complete the action to move that value back into the PC register.

  • t9: <- INCR1.out

However, for the LDS instruction, we need to figure out the next address.

  • w9 <- PC_n
  • t10: INCR1.out <= PC_n + 1

Now we feed this value to the to the program counter:

  • t10: <- PC_n

Examining Instructions

In reviewing this RTL definition of how one instruction works, we see that the needed wires are defined by where signals need to travel. If we build a matrix of these pieces of information, we can deduce how the machine is to be constructed. That will introduce both simple wires, and multiplexors to the system. If we see that two wires in this set need to place their values on the same component input pin, we need a multiplexor, set by the control unit, to figure out which wire should get through for this particular instruction. The control unit will be responsible for that, and the needed signals will be established during the decoding step.

Ultimately, we need to perform this kind of analysis for every instruction we propose including in the machine. It is not difficult, but it is tedious to do. This is part of the reason the HDL languages have been developed. These languages let you describe the flow f the data through the machine. The HDL compiler figures out the needed wires, and add in the multiplexors as needed to complete the machine.


Notice that we discussed this data flow without referring to a wiring diagram. We did postulate the existence of a set of basic parts, and defined the input and output points on each component. The signals defined must travel over wires between the two component “ports”.

Streamlining the RTL

Let’s simplify the RTL and get rid of port designations and data values. We will focus just on register names. We will also skip passing data across a register, and just show the operations involved inside (the math).


t0: IM <- PC
t1: DCD <- IM[PC]
t2: DCD[OP1] <= INST[0-3,9] ; Rd,
    DCD[OP2] <- INST[4-8]; Rr,
    ALUOP <= add
t3: ALUres <- REG[Rd] + REG[Rr]
t4: REG[Rd] <- ALUres
t5: PC <- PC + 1

Much shorter. It is possible to derive the resulting machine and get more detailed in the design from this.


t0: IM <- PC
t1: DCD [IM[PC}
t2: DCD[op2] <- INST[4-8]; Rd
t3: PC <= PC +1
t4: k <- IM[PC]
t5: MEMres <- DM[k]
t6: REG[Rd] <= MEMres
t7: PC <- PC + 1

How is the PC really going to be updated in this? Perhaps we need two increment units. One for the final update, and another to deal with the two-word instructions.

Laying Out the Machine

In real circuit design, a wiring diagram is defined using something called a “net-list”, which is just a spreadsheet holding all the parts, connection points, and needed connections. Special software takes this net-list and performs the geometric analysis to lay out the parts so the lengths of the wires can be minimized. In building electronic boards, there may be several layers of printed circuit pathways (wires) that move between component attach points. I saw a supercomputer board with a dozen layers, all designed to keep data paths as short as possible:

Short wires make fast machines!


I had the idea of adapting this technique to our simulator. If we define the machine using RTL, we might be able to derive the machine diagram automatically. Building a tool to do this is not that complex (we had a start in our first HLD parser).

As a final look at the entire system, only for another processor. Here is a full diagram. From this you can see how they use barrier registers to control the movement of signals from one stage to the next.