Implements carry-lookahead adders using circuits of logic gates Written in Verilog HDL for Altera and Xilinx FPGA’s.

This chapter introduces algorithms to reduce the delay when adding numbers. We will look at two carry-lookahead adders using logic gates. 

Adding a carry-lookahead circuit can make the carry generation much faster. The individual bit adders no longer calculate outgoing carries, but instead generate propagate and generate signals.

We define a Partial Full Adder (pfa) module with the usual ports a and b for the summands, carry input ci , the sum s, and two new signals propagate p and generate g. The propagate (p) indicates that the bit would not generate the carry bit itself, but will pass through a carry from a lower bit. The generate (g) signal indicates that the bit generates a carry independent of the incoming carry. The functionality of the pfa module is expression the circuit and Boolean equations shown below.

For bit position n, the outgoing carry cn is a function of pn, gn and the incoming carry ci,n. Except for bit position 0, the incoming carry equals the outgoing carry of the previous pfa, $$c_{i,n}=c_{o,n-1}$$ \begin{align*} c_n &= g_n + c_{in_{n}} \cdot p_n \\ &= g_n + c_{n-1} \cdot p_n \end{align*}

For a 4-bit cla this results in the following equations for the carryout signals: \begin{align*} c_0& = g_0 + c_i \cdot p_0 \\ c_1& = g_1 + c_0 \cdot p_1 \\ c_2& = g_2 + c_1 \cdot p_2 \\ c_3& = g_3 + c_2 \cdot p_3 \\ \end{align*}

Substituting the cn-1

\begin{align*} c_0 &= g_0 + c_{i} \cdot p_0 \\ c_1 &= g_1 + (g_0 + c_{i} \cdot p_0) \cdot p_1 \\ &= g_1 + g_0 \cdot p_1 + c_{i} \cdot p_0 \cdot p_1 \\ c_2 &= g_2 + (g_1 + g_0 \cdot p_1 + c_{i} \cdot p_0 \cdot p_1) \cdot p_2 \\ &= g_2 + g_1 \cdot p_2 + g_0 \cdot p_1 \cdot p_2 + c_{i} \cdot p_0 \cdot p_1 \cdot p_2 \\ c_3 &= g_3 + (g_2 + g_1 \cdot p_2 + g_0 \cdot p_1 \cdot p_2 + c_{i} \cdot p_0 \cdot p_1 \cdot p_2) \cdot p_3 \\ &= g_3 + g_2 \cdot p_3 + g_1 \cdot p_2 \cdot p_3 + g_0 \cdot p_1 \cdot p_2 \cdot p_3 +c_{i} \cdot p_0 \cdot p_1 \cdot p_2 \cdot p_3 \end{align*}

The outgoing carries c0…3 no longer depend on each other, thereby eliminating the “ripple effect”. The outgoing carries can now be implemented with only 3 gate delays (1 for p/g generation, 1 for the ANDs and 1 for the final OR assuming gates with 5 inputs).

The circuit below gives an example of a 4-bit carry look-ahead adder.

The complexity of the carry look-ahead increases dramatically with the bit number. Instead of calculating higher bit carries, one may daisy chaining the carry logic as shown for the 12-bit adder below.

An implementation can be found at GitHub

#### Results

The propagation delay tpd depends on size n and the value of operands. For a given size n, adding the value 1 to an operand that contains all zeroes causes the longest propagation delay. The post-map Timing Analysis tool reveals the worst-case propagation delays for the Terasic Altera Cyclone IV DE0-Nano. The exact value depends on the model and speed grade of the FPGA, the silicon itself, voltage and the die temperature.

To improve speed for larger word sizes, we can add a second level of carry look ahead. To facilitate this, we extend the cla circuit by adding $$p_{i,j}$$ and $$g_{i,j}$$ outputs. The propagate signal $$p_{i,j}$$ indicates that an incoming carry propagates from bit position $$i$$ to $$j$$. The generate signal $$g_{i,j}$$ indicates that a carry is generated at bit position $$j$$, or if a carry out is generated at a lower bit position and propagates to position $$j$$.

For a 4-bit block the equations are \begin{align*} p_{0,3} &= p_3 \cdot p_2 \cdot p_1 \cdot p_0 \\ g_{0,3} &= g_3 + p_3 \cdot g_2 + p_3 \cdot p_2 \cdot g_1 + p_3 \cdot p_2 \cdot p_1 \cdot g_0 \\ c_o &= g_{3,0} + p_{3,0} \cdot c_i \end{align*}

An implementation can be found at GitHub

#### Results

Once more, the propagation delay $$t_{pd}$$ depends size $$N$$ and the value of operands. For a given size $$N$$, adding the value 1 to an operand that contains all zeroes causes the longest propagation delay.

Once more, the post-map Timing Analysis predicts the worst-case propagation delays for the Terasic Altera Cyclone IV DE0-Nano. As usual, the exact value depends on the model and speed grade of the FPGA, the silicon itself, voltage and the die temperature.

### Others

Following this “Carry-lookahead adders using logic gates”, the next chapter shows an implementation of a multiplier introduced in Chapter 7 of the inquiry “How do Computers do Math?“.

Implements an adder and subtractor using circuits of logic gates. Written in parameterized Verilog HDL for Altera and Xilinx FPGA’s.

## Adder and subtractor using logic gates

The inquiry “How do Computers do Math?” introduced the carry-propagate adder and the borrow-propagate subtractor. Here we will recapitulate and implement these using Verilog HDL.

The full adder (fa) forms the basic building block. This full adder adds two 1-bit values a and b to the incoming carry (ci), and outputs a 1-bit sum (s) and a 1-bit outgoing carry (co). The circuit and Boolean equations, shown below, give the relations between these inputs and outputs.

To build a n-bit propagate-adder, we combine n fa blocks. The circuits adds the least significant bits and passes the carry on to the next bit, and so on. Combining the output bits forms the sum s. The circuit shown below gives an example of a 4-bit carry-propagate adder.

We describe the circuit using array instantiation. In this a[] and b[] are the summands, ci[] is the in-coming carry, co[] is the out-going carry and s[] is the sum.

math_adder_fa_block fa [N-1:0] ( .a  ( a ),
.b  ( b ),
.ci ( {c[N-2:0], 1'b0} ),
.s  ( s[N-1:0] ),
.co ( {s[N], c[N-2:0]}) );

The compiler will optimize the fa blocks with forced inputs, but most fa blocks will compile to a RTL netlist as shown below.

The 4-bit adder compiles into the daisy chained fa blocks as shown.

The complete Verilog HDL code along with the test bench and constraint files are available through GitHub for Xilinx (Spartan-6 LX9) and Altera (Cyclone IV DE0-Nano) boards.

#### Results

The propagation delay $$t_{pd}$$ depends on size N and the value of operands. For a given size N, adding the value 1 to an operand that contains all zeroes causes the longest propagation delay.

The worst-case propagation delays for the Altera Cyclone IV on the DE0-Nano are found using the post-map Timing Analysis tool. The exact values depend on the model and speed grade of the FPGA, the silicon itself, voltage and the die temperature.

### Borrow-propagate subtractor

Similar to addition, the simplest subtraction method is borrow-propagate, as introduced in Chapter 7 of the inquiry “How do Computers do Math?“. Again, we will start by building a 1-bit subtractor (fs). The inputs a and b represent the 1-bit binary numbers being added. Output d, the difference. li and lo are the incoming and outgoing borrow/loan signals.

The outputs d and lo can be expressed as a function of the inputs. The difference d is an Exclusive-OR function, just as the sum was for addition.

To build a 4-bit subtractor we combine four of these building blocks.

The implementation of the borrow-propagate subtractor is very similar to the adder can be found at GitHub.

Another method to subtract is based around the fact that $$a – b = a + (-b)$$. This allows us to build a circuit that can add or subtract. When the operation input op equals 1, it subtracts b from a, otherwise it adds the values.

Under two’s complement, subtracting b is the same as adding the bit-wise complement of b and adding 1. The inputs b is negated by inverting its bits (using an XOR with signal op), and 1 is added by setting the least significant carry input to 1. We can build this a 4-bit adder/subtractor using fa blocks as shown below, where r is the result.

Note that the circuit also includes a overflow detection for two’s complement. Overflow occurs when c2 differs from the final carry-out c3.

In all these ripple carry adder and subtractors, the carry propagates from the lowest to the highest bit position. This propagation causes a delay that is linear with the number bits.

Moving on from this math adder and subtractor using logic gates, the next chapter explores faster circuits.

## Introduction

This article series describes implementations of math operations using circuits of logic gates. Written in Verilog HDL for Altera and Xilinx FPGA’s.

Primary school teaches our students methods for addition, subtraction, multiplication and division. Computer hardware implements similar methods and performs them with astounding speed. This enables applications such as computer vision, process control, encryption, hearing aids, video compression to dubious practices as high-frequency trading.

## Introduction

This article describes how to build such hardware. In such, it is a sequel to the inquiry “How do Computers do Math?” that in the chapter “Math Operations Using Gates” introduced conceptual circuits using logic gates. We will model the various math operations using digital gates. Here we combine the gates into circuits and describe them using the Verilog Hardware Description Language (HDL). These Verilog HDL descriptions are compiled and mapped to a Field Programmable Gate Array (FPGA).

Working knowledge of the Verilog HDL is assumed. To learn more about Verilog HDL, I recommend the book FPGA Prototyping with Verilog Examples, an online class or lecture slides. To help you get up to speed with the development boards, wrote Getting Started documents for two popular Altera and Xilinx boards.

We aim to study algorithms to implement the algorithms in generic VLSI, and as such do will not use the highly optimized carry chain connections present on many FPGAs.

### Hail to the FPGA

Let’s take this moment to honor the virtues of the FPGA. A FPGA can do many things at the same time and still respond immediately to input events. It is generally more complicated to create and debug the same logic in a FPGA compared to a microprocessor. Because of the challenges that FPGA development poses, many systems combine FPGAs and microprocessors to get the best of both worlds. For example, to recognize objects in a video stream, one would implement the pre-processing (noise removal, normalization, edge detection) in an FPGA, but the higher-level logic on a CPU or DSP.

With a FPGA, you can put massive parallel structures in place. For example, an high-speed data stream can be distributed across the whole FPGA chip to be processed in parallel, instead of having a microprocessor deal with it sequentially.

FPGAs can accelerate machine learning algorithms, video encoding, custom algorithms, compression, indexing and cryptography. By implements a soft microprocessor as part of the FPGA it can also handle high level protocols such as handle board management, protocol bridging and security tasks. [IEEExplore]

### Tools

The code examples were tested on an Altera “Cyclone IV E” FPGA using Quartus Prime 16.1. Earlier code iterations used a Xilinx “Spartan-6” with their ISE Design Suite. The Verilog descriptions should work equally well on other boards or environments.

1. Demonstration
4. Multiplier
5. Faster multiplier
6. Divider
7. Square root (and conclusion)

## Starting with Xilinx

This describes how to install the development environment for the Avnet’s Xilinx Spartan-6 FPGA LX9 MicroBoard under Windows 10 x64. For Altera boards, refer to Getting started with FPGA design using Altera Quartus Prime. Another interesting Xilinx based board is the XuLA (XC3S200A).

Note 1: you need to be logged into em.avnet.com to access the links. This might be the only occasion where I had to use the Edge browser to access the Support & Download area.

Note 2: I can’t describe the install for the ChipScope Pro, because my license expired.

## Install the Xilinx ISE Design Suite

1. ISE Design Suite
• extract the tar file, and run xsetup.exe
• select ISE WebPACK
• have some patience.
:
• will start automatically
• generate a node locked license for ISE WebPACK
• the Xilinx.lic will arrive as an email attachment
:
3. Running on Windows 10
• ISE is in maintenance mode, and doesn’t support Windows 10 (or 8.x)
• According eevblog, crashes with file dialogs in ISE and iMPACT can be prevented by turning off Smart Heap. To do so:
• rename libPortability.dll to libPortability.dll.orig
• copy libPortabilityNOSH.dll to libPortability.dll in
• in
• C:\Xilinx\14.7\ISE_DS\ISE\lib\nt64
• C:\Xilinx\14.7\ISE_DS\common\lib\nt64 (copy dll from first location)
:
4. See if it starts
• Double-click the ISE Design Suite icon on your desktop
:

## Install Avnet Board Support

1. Install the board description (XDB)
• Install the Spartan-6 FPGA LX9 MicroBoard XDB file from em.avnet.com
• unzip, then again
• unzip avnet_edk14_3_xbd_files.zip file to the \board folder.
:
2. Install USB-to-UART driver
• Extract, and install by running CP210xVCPInstaller_x64.exe
• Verify that Serial Port “Silicon Lapbs CP2010x USB to UART Bridge (COMx)” appears in Device Manager when the micro-USB cable is plugged in. For details see the CP210x_setup_guide.
• Walk through the examples in the board’s Getting Started Guide. Note that instead of HyperTerm, you can use PuTTY.
:
3. Install USB-to-JTAG driver
• Run the executable to install
• Verify that the “Digilent USB Controller” appears in Device Manager when the USB Type A plug is plugged in.
:
4. Install JTAG programming utility
• Extract, and follow instructions in the enclosed Digilent_Plug-in_Xilinx_v14.pdf
• Copy the files from the nt64 folder to C:\Xilinx\14.7\ISE_DS\ISE\lib\nt64\plugins\Digilent\libCseDigilent\ :
5. Later, to program using the JTAG interface
• Xilinx ISE » Tools » iMPACT
• Double-click boundary scan
• Output -> Cable setup
• select “Digilent USB JTAG cable”
• the port will show your port(s)
• speed = select speed
• click Ok. The speed field will become empty. click Ok once more.
• Right-click boundry window, and select Initialize chain (with the microboard connected)
• set the configuration file (.bit)
• select as target device
• Save the prj.
:

## A first circuit

I walked through the examples in the (old) book FPGA Prototyping with Verilog Examples. It targets Spartan 3, but still seems useful. In particular see chapter 2.6.1.

1. Create a new design
• Double-click the ISE icon on the desktop
• File » New Project
• location, working directory = ..
• name = eq2
• top level src type = HDL
• Evaluation Dev Board = Avnet Spartan-6 LX9 MicroBoard
• Synthesis tool = XST
• Simulator = ISim
• Project » New Source » Verilog Module
• Enter port names
• a input bus 1 0
• b input bus 1 0
• aeqb output
• Use the text editor to enter the code eq2.v as shown below
timescale 1ns / 1ps

module eq2(
input [1:0] a,
input [1:0] b,
output aeqb
);

wire e0, e1; // internal signal declaration

eq1 eq_bit0_unit(.i0(a[0]), .i1(b[0]), .eq(e0));
eq1 eq_bit1_unit(.i0(a[1]), .i1(b[1]), .eq(e1));

assign aeqb = e0 &amp;amp; e1; // a and b are equal if individual bits are equal
endmodule
• Project » New Source » Verilog Module
• Enter port names
• i0 input
• i1 input
• eq output
• Use the text editor to enter the code eq1.v as shown below
timescale 1ns / 1ps
module eq1(
input i0,
input i1,
output eq
);

wire p0, p1; // internal signal declaration

assign eq = p0 | p1 ;
assign p0 = ~i0 &amp;amp; ~i1 ;
assign p1 = i0 &amp;amp; i1 ;

endmodule
• Project » New Source » Implementation Constraints File
• Enter the physical I/O pin assignments are user constraints.
• Refer to the schematics, or hardware guide for details.
• Use the text editor to enter the code eq2.ucf as shown below.
CONFIG VCCAUX=3.3;
NET a&amp;lt;0&amp;gt; LOC = B3 | IOSTANDARD = LVCMOS33 | PULLDOWN; #DIP switch-1
NET a&amp;lt;1&amp;gt; LOC = A3 | IOSTANDARD = LVCMOS33 | PULLDOWN; #DIP switch-2
NET b&amp;lt;0&amp;gt; LOC = B4 | IOSTANDARD = LVCMOS33 | PULLDOWN; #DIP switch-3
NET b&amp;lt;1&amp;gt; LOC = A4 | IOSTANDARD = LVCMOS33 | PULLDOWN; #DIP switch-4
NET aeqb LOC = P4 | IOSTANDARD = LVCMOS18;            #LED D2
• Verify
• Select the desired source file
• In the process window (below), click the ‘+’ before “Synthesize – XST”
• Double-click “Check Syntax”
• The results will be shown in the transcript at the bottom
:
2. Synthesis
• Generates a .bit file to be uploaded to the FPGA later
• Select the top-level verilog file (has a little green square in the icon)
• In the Process Window, double click “Generate Programming File”
• The transcript at the bottom will show the results
• Correct problems if needed
• Check the design summary (Process Windows » Design Summary)
:
3. Create a test bench
• Project » New Source » Verilog Test Fixture
• name = eq2_test
• associate with eq2
• add the stimulus as shown in eq2_test.v below
timescale 1ns / 1ps

module eq2_test;

reg [1:0] a;  // inputs
reg [1:0] b;

wire aeqb;  // output

// Instantiate the Unit Under Test (UUT)
eq2 uut ( .a(a),
.b(b),
.aeqb(aeqb) );

initial begin
a = 0;  // initialize inputs
b = 0;

#100;  // wait 100 ns for global reset to finish

// stimulus starts here
a = 2'b00; b = 2'b00; #100 $display(&amp;quot;%b&amp;quot;, aeqb); a = 2'b01; b = 2'b00; #100$display(&amp;quot;%b&amp;quot;, aeqb);
a = 2'b01; b = 2'b11; #100 $display(&amp;quot;%b&amp;quot;, aeqb); a = 2'b10; b = 2'b10; #100$display(&amp;quot;%b&amp;quot;, aeqb);
a = 2'b10; b = 2'b00; #100 $display(&amp;quot;%b&amp;quot;, aeqb); a = 2'b11; b = 2'b11; #100$display(&amp;quot;%b&amp;quot;, aeqb);
a = 2'b11; b = 2'b01; #100 $display(&amp;quot;%b&amp;quot;, aeqb); end endmodule : 4. Behavior Simulation • Xilinx ISE comes packages with the ISim simulator. It is straightforward to use and fine for basic test benches. Other choices are ModelSim (hard to install under Windows 10, in my case the install suddenly continued after >24 hours), Active-HDL, and the online tool edaplayground.com. • Design Window (top left) » Simulation radio-button. In the drop-down list below it, select “Behavioral” view. • Select the eq2_test.v file • In the Process Window, double-click the Simulate Behavior Model • Will give a Windows Security Alert for isimgui.exe. Allow it access. • Navigate the ISim window to verify functionality. Use the F7 to zoom out. We expect an output like: : 5. Timing Simulation • In the Design Window (top left), select “Simulation”. In the drop down list below it, select “Post-Route”. • Select the eq2_test.v file • In the Process Window, double-click “Simulate Post-Place & Route Model”. This will reveal the timing delays as shown below : 6. Configure FPGA • Plug-in the USB type B connector from the LX9 microboard • Process Window » double click “Configure Target Device” • Before starting iMPACT, it will warn you that “No iMPACT project file exists”. Click OK to proceed. • Double-click “Boundary Scan” • Right-click in the right window, and select “Cable Setup” • Communication Mode = “Digilent USB JTAG cable” • Verify that port will show up • speed = select speed • click OK twice • Right-click in the right window, and select “Initialize chain” • assign the configuration file (.bit) created earlier • do not attached SPI or BPI PROM • click OK • right-click the Xilinx block, and select as “Set Target Device” • File » Save Project as “eq2” in the same directory as the source files. • Will tell you to “Set the new project file from the Configure Target Device process properties”. Don’t worry, it seems to do this automatically. Click OK to proceed. • Right-click the Xlilinx block, and select program. • This should report “Program Succeeded” • Close iMPACT, and save it once more on the way out. : 7. Give it a spin • It is finally time to try the real FPGA board • Input is through the DIP switches (SW1) on the left of the FPGA. • Output is the red LED (D2) located just below the FPGA. • We expect the LED to be “on” when switch position 1 and 2 are identical to position 3 and 4. • If you prefer bigger switches, I suggest wiring up a breadboard to PMOD1 (J5) connectors. • Vcc and Ground are available on respectively pin 5 (or 11) and 6 (or 12). • Remember to modify the user constraints file accordingly. For reference, I attached a fairly complete user constraints file spartan6-lx9.ucf. See Xilinx’ student area for more info c’est tout ## Implementation Shows an implementation of the LC-3 instruction set in Verilog HDL. Includes test benches and simulation results. ## Implementation One dark Oregon winter afternoon, I said “Let’s build a micro processor”. What started as a noble thought became a rather intense but fun project. This section describes the implementation of the LC-3 using a Field Programmable Logic Array. An FPGA is an array blocks with basic functionality such as Lookup table, a full adder and a flip-flop. For more information on FPGAs refer to the section Programmable Logic in the inquiry “How do computers do math?“. The FPGA used to implement the LC-3 microprocessor is a Xilinx Spartan6, but others will fit equally well. My choice was inspired by the pricing of the development board and the fairly good free development tools. Other choices would be Altera for the FPGA, their IDE or Icarus Verilog for the synthesizer and simulator and GTKWave for the waveform viewer. Refer to the end of this article for links and references to introductory Verilog books. ### Schematic The top level schematic is shown below. The modules are defined using Verilog, an hardware description language (HDL) used to model digital logic. This is my first Verilog implementation, please bear with me .. ### State #### State.v Implementation of the LC-3 instruction set in Verilog, source file State.v:  cCtrl, // controller control signal input eREADY, // external memory ready signal output wire pEn, // update PC enable output wire fEn, // fetch output enable output wire dEn, // decode enable output wire [2:0] mOp, // memory operation selector output wire rWe ); // register write enable include "UpdatePC.vh" include "Fetch.vh" include "Decode.vh" include "Registers.vh" include "MemoryIF.vh" parameter [3:0] STATE_UPDATEPC = 4'd0, // update program counter STATE_FETCH = 4'd1, // fetch instruction STATE_DECODE = 4'd2, // decode STATE_ALU = 4'd3, // ALU STATE_ADDRNPC = 4'd4, // calc tPC address STATE_ADDRMEM = 4'd5, // calc memory address STATE_INDMEM = 4'd6, // indirect memory address STATE_RDMEM = 4'd7, // read memory STATE_WRMEM = 4'd8, // write memory STATE_WRREG = 4'd9, // write register STATE_ILLEGAL = 4'd15; // illegal state parameter EREADY_INA = 1'b0, // external memory not ready EREADY_ACT = 1'b1, // external memory ready EREADY_X = 1'bx; wire [1:0] iType = cCtrl[4:3]; // instruction type (00=alu, 01=ctrl, 10=mem) wire [1:0] maType = cCtrl[2:1]; // memory access type (00=indaddr, 01=read, 02=write, 03=updreg) wire indType = cCtrl[0]; // indirect memory access type reg [3:0] state; // current state reg [3:0] nState; // next state reg [6:0] out; // current output signals reg [6:0] nOut; // next output signals assign pEn = out[6]; assign fEn = out[5]; assign dEn = out[4]; assign mOp = out[3:1]; assign rWe = out[0]; // the combinational logic always @(state, eREADY, iType, maType, indType, state, out) casex ({state, eREADY, iType, maType, indType}) {STATE_UPDATEPC, EREADY_X, ITYPE_X, MATYPE_X, INDTYPE_X} : begin nState = STATE_FETCH; nOut = {PEN_0, FEN_1, DEN_0, MOP_NONE, RWE_0}; end {STATE_FETCH, EREADY_ACT, ITYPE_X, MATYPE_X, INDTYPE_X} : begin nState = STATE_DECODE; nOut = {PEN_0, FEN_0, DEN_1, MOP_NONE, RWE_0}; end {STATE_DECODE, EREADY_X, ITYPE_ALU, MATYPE_X, INDTYPE_X} : begin nState = STATE_ALU; nOut = {PEN_0, FEN_0, DEN_0, MOP_NONE, RWE_0}; end {STATE_DECODE, EREADY_X, ITYPE_CTL, MATYPE_X, INDTYPE_X} : begin nState = STATE_ADDRNPC; nOut = {PEN_0, FEN_0, DEN_0, MOP_NONE, RWE_0}; end {STATE_DECODE, EREADY_X, ITYPE_MEM, MATYPE_X, INDTYPE_X} : begin nState = STATE_ADDRMEM; nOut = {PEN_0, FEN_0, DEN_0, MOP_NONE, RWE_0}; end {STATE_ADDRMEM, EREADY_X, ITYPE_X, MATYPE_IND, INDTYPE_X} : begin nState = STATE_INDMEM; nOut = {PEN_0, FEN_0, DEN_0, MOP_RD, RWE_0}; end {STATE_ADDRMEM, EREADY_X, ITYPE_X, MATYPE_RD, INDTYPE_X} : begin nState = STATE_RDMEM; nOut = {PEN_0, FEN_0, DEN_0, MOP_RD, RWE_0}; end {STATE_INDMEM, EREADY_ACT, ITYPE_X, MATYPE_X, INDTYPE_RD} : begin nState = STATE_RDMEM; nOut = {PEN_0, FEN_0, DEN_0, MOP_RDI, RWE_0}; end {STATE_ADDRMEM, EREADY_X, ITYPE_X, MATYPE_WR, INDTYPE_X} : begin nState = STATE_WRMEM; nOut = {PEN_0, FEN_0, DEN_0, MOP_WR, RWE_0}; end {STATE_INDMEM, EREADY_ACT, ITYPE_X, MATYPE_X, INDTYPE_WR} : begin nState = STATE_WRMEM; nOut = {PEN_0, FEN_0, DEN_0, MOP_WR, RWE_0}; end {STATE_ALU, EREADY_X, ITYPE_X, MATYPE_X, INDTYPE_X} : begin nState = STATE_WRREG; nOut = {PEN_0, FEN_0, DEN_0, MOP_NONE, RWE_1}; end {STATE_ADDRMEM, EREADY_X, ITYPE_X, MATYPE_REG, INDTYPE_X} : begin nState = STATE_WRREG; nOut = {PEN_0, FEN_0, DEN_0, MOP_NONE, RWE_1}; end {STATE_RDMEM, EREADY_ACT, ITYPE_X, MATYPE_X, INDTYPE_X} : begin nState = STATE_WRREG; nOut = {PEN_0, FEN_0, DEN_0, MOP_NONE, RWE_1}; end {STATE_WRMEM, EREADY_ACT, ITYPE_X, MATYPE_X, INDTYPE_X} : begin nState = STATE_UPDATEPC; nOut = {PEN_1, FEN_0, DEN_0, MOP_NONE, RWE_0}; end {STATE_WRREG, EREADY_X, ITYPE_X, MATYPE_X, INDTYPE_X} : begin nState = STATE_UPDATEPC; nOut = {PEN_1, FEN_0, DEN_0, MOP_NONE, RWE_0}; end {STATE_ADDRNPC, EREADY_X, ITYPE_X, MATYPE_X, INDTYPE_X} : begin nState = STATE_UPDATEPC; nOut = {PEN_1, FEN_0, DEN_0, MOP_NONE, RWE_0}; end {STATE_FETCH, EREADY_INA, ITYPE_X, MATYPE_X, INDTYPE_X} : begin nState = state; nOut = out; end {STATE_INDMEM, EREADY_INA, ITYPE_X, MATYPE_X, INDTYPE_X} : begin nState = state; nOut = out; end {STATE_RDMEM, EREADY_INA, ITYPE_X, MATYPE_X, INDTYPE_X} : begin nState = state; nOut = out; end {STATE_WRMEM, EREADY_INA, ITYPE_X, MATYPE_X, INDTYPE_X} : begin nState = state; nOut = out; end default : begin nState = STATE_ILLEGAL; nOut = {PEN_0, FEN_0, DEN_0, MOP_NONE, RWE_0}; end endcase // the sequential logic always @(negedge clock, posedge reset) if (reset) begin state <= STATE_UPDATEPC; out <= {PEN_0, FEN_0, DEN_0, MOP_NONE, RWE_0}; end else begin state <= nState; out <= nOut; end; endmodule&#91;/code&#93; </p> <h3> Decode </h3> <h4> Decode.vh </h4> <p> Implementation of the LC-3 instruction set in Verilog, header file <code>Decode.vh</code>: [code class="brush" gutter="true" toolbar="off" tab-size="3" language="verilog"]parameter DEN_0 = 1'b0, // Decode enable DEN_1 = 1'b1; parameter [1:0] ITYPE_ALU = 2'b00, // generalized instruction type ITYPE_CTL = 2'b01, ITYPE_MEM = 2'b10, ITYPE_HLT = 2'b11, ITYPE_X = 2'bxx; parameter [1:0] MATYPE_IND = 2'b00, // generalized memory access type MATYPE_RD = 2'b01, MATYPE_WR = 2'b10, MATYPE_REG = 2'b11, MATYPE_X = 2'bxx; parameter INDTYPE_WR = 1'b0, // generalized memory indirection type INDTYPE_RD = 1'b1, INDTYPE_X = 1'bx; #### Decode.v Implementation of the LC-3 instruction set in Verilog, source file Decode.v: module Decode( input clock, input reset, input en, // input enable input [15:0] eDIN, // external memory data input input [2:0] psr, // processor status register output [4:0] cCtrl, // various control signals output [1:0] drSrc, // selects what to write to DR output [2:0] uOp, // selecta ALU operation output aOp, // selects Address operation output pNext, // selects if PC should branch output [2:0] sr1ID, // source register 1 ID output [2:0] sr2ID, // source register 2 ID output [2:0] drID, // destination register ID output wire [4:0] imm, // lower 5 bits from IR value output wire [8:0] offset ); // lower 9 bits from IR value include "ALU.vh" include "Address.vh" include "MemoryIF.vh" include "DrMux.vh" include "UpdatePC.vh" include "Decode.vh" parameter [2:0] ID_X = 3'bxxx; // Instruction Register (ir) // read instruction from external memory bus (after Fetch initiated the bus cycle) reg [15:0] ir; assign imm = ir[4:0]; // output the lower 5 bits assign offset = ir[8:0]; // output the lower 9 bits always @(posedge clock, posedge reset) if (reset) ir = 16'hffff; else if (en == DEN_1) ir = eDIN; parameter [3:0] // opcodes for the instructions I_BR = 4'b0000, I_ADD = 4'b0001, I_LD = 4'b0010, I_ST = 4'b0011, I_AND = 4'b0101, I_LDR = 4'b0110, I_STR = 4'b0111, I_NOT = 4'b1001, I_LDI = 4'b1010, I_STI = 4'b1011, I_JMP = 4'b1100, I_LEA = 4'b1110, I_HLT = 4'b1111; reg [20:0] ctl; // current control signal bundle // untangle control signal bundle assign cCtrl = ctl[ 20:16 ]; // { iType, maType, indRd } assign uOp = ctl[ 15:13 ]; assign aOp = ctl[ 12 ]; assign drSrc = ctl[ 11:10 ]; assign pNext = ctl[ 9 ]; assign drID = ctl[ 8: 6 ]; assign sr1ID = ctl[ 5: 3 ]; assign sr2ID = ctl[ 2: 0 ]; // combinational logic to determine control signals wire [2:0] uOpAddC = (ir[5]) ? UOP_ADDIMM : UOP_ADDREG; // candidate for uOp in case of ADD instruction wire [2:0] uOpAndC = (ir[5]) ? UOP_ANDIMM : UOP_ANDREG; // candidate for uOp in case of AND instruction wire pNextC = |(ir[11:9] & psr) ? PNEXT_TPC : PNEXT_NPC; // candidate for pNext in case of BR instruction always @(ir[15:12], uOpAddC, uOpAndC, pNextC) // State State State ALU Address DrSource UpdatePC Registers RegistersRegisters case (ir[15:12])// iType maType indType uOp aOp drSrc pNext drID sr1ID sr2ID I_ADD : ctl = {ITYPE_ALU, MATYPE_X, INDTYPE_X, uOpAddC, AOP_X, DRSRC_ALU, PNEXT_NPC, ir[11:9], ir[8:6], ir[2:0] }; I_AND : ctl = {ITYPE_ALU, MATYPE_X, INDTYPE_X, uOpAndC, AOP_X, DRSRC_ALU, PNEXT_NPC, ir[11:9], ir[8:6], ir[2:0] }; I_NOT : ctl = {ITYPE_ALU, MATYPE_X, INDTYPE_X, UOP_NOT, AOP_X, DRSRC_ALU, PNEXT_NPC, ir[11:9], ir[8:6], ID_X }; I_BR : ctl = {ITYPE_CTL, MATYPE_X, INDTYPE_X, UOP_X, AOP_NPC, DRSRC_X, pNextC, ID_X, ID_X, ID_X }; I_JMP : ctl = {ITYPE_CTL, MATYPE_X, INDTYPE_X, UOP_X, AOP_SR1, DRSRC_X, PNEXT_TPC, ID_X, ir[8:6], ID_X }; I_LD : ctl = {ITYPE_MEM, MATYPE_RD, INDTYPE_X, UOP_X, AOP_NPC, DRSRC_MEM, PNEXT_NPC, ir[11:9], ID_X, ID_X }; I_LDR : ctl = {ITYPE_MEM, MATYPE_RD, INDTYPE_X, UOP_X, AOP_SR1, DRSRC_MEM, PNEXT_NPC, ir[11:9], ir[8:6], ID_X }; I_LDI : ctl = {ITYPE_MEM, MATYPE_IND, INDTYPE_RD, UOP_X, AOP_NPC, DRSRC_MEM, PNEXT_NPC, ir[11:9], ID_X, ID_X }; I_LEA : ctl = {ITYPE_MEM, MATYPE_REG, INDTYPE_X, UOP_X, AOP_NPC, DRSRC_ADDR, PNEXT_NPC, ir[11:9], ID_X, ID_X }; I_ST : ctl = {ITYPE_MEM, MATYPE_WR, INDTYPE_X, UOP_X, AOP_NPC, DRSRC_X, PNEXT_NPC, ID_X, ID_X, ir[11:9]}; I_STR : ctl = {ITYPE_MEM, MATYPE_WR, INDTYPE_X, UOP_X, AOP_SR1, DRSRC_X, PNEXT_NPC, ID_X, ir[8:6], ir[11:9]}; I_STI : ctl = {ITYPE_MEM, MATYPE_IND, INDTYPE_WR, UOP_X, AOP_NPC, DRSRC_X, PNEXT_NPC, ID_X, ID_X, ir[11:9]}; default : ctl = {ITYPE_HLT, MATYPE_X, INDTYPE_X, UOP_X, AOP_X, DRSRC_X, PNEXT_X, ID_X, ID_X, ID_X }; endcase endmodule ### UpdatePC #### UpdatePC.vh Implementation of the LC-3 instruction set in Verilog, header file UpdatePC.vh: parameter PNEXT_NPC = 1'b0, // UpdatePC branch signal PNEXT_TPC = 1'b1, PNEXT_X = 1'bx; parameter PEN_0 = 1'b0, // UpdatePC enable PEN_1 = 1'b1; #### UpdatePC.v Implementation of the LC-3 instruction set in Verilog, source file UpdatePC.v: module UpdatePC( input clock, input reset, input en, // enable signal input [15:0] tPC, // target program counter input pNext, // if 1 then branch to tPC output reg [15:0] pc, // program counter output reg [15:0] nPC ); // next program counter (pc+1) include "UpdatePC.vh" wire [15:0] a = (pNext) ? tPC : nPC; // if pNext==1, then jump to tPC wire [15:0] b = (en == PEN_1) ? a : pc; // change PC only in "Update PC" state wire [15:0] c = b + 1'b1; // use carry input always @(posedge clock, posedge reset) if (reset) begin pc <= 16'h3000; nPC <= 16'h3001; end else begin pc <= b; nPC <= c; end; endmodule&#91;/code&#93; </p> <h3> Fetch </h3> <h4> Fetch.vh </h4> <p> Implementation of the LC-3 instruction set in Verilog, header file <code>Fetch.vh</code>: [code class="brush" gutter="true" toolbar="off" tab-size="3" language="verilog"]parameter FEN_0 = 1'b0, // fetch enable FEN_1 = 1'b1; #### Fetch.v Implementation of the LC-3 instruction set in Verilog, source file Fetch.v: module Fetch( input en, // output enable input [15:0] pc, // program counter output reg iBR, // internal memory address lines output reg [15:0] iADDR, // internal memory address lines output reg iWEA ); // internal memory write enable include "Fetch.vh" always @(en, pc) begin iBR <= ( en == FEN_1 ) ? 1'b1 : 1'b0; iADDR <= ( en == FEN_1 ) ? pc : 16'hxxxx; iWEA <= ( en == FEN_1 ) ? 1'b0 : 1'bx; end endmodule&#91;/code&#93; </p> <h3> Registers </h3> <h4> Registers.vh </h4> <p> Implementation of the LC-3 instruction set in Verilog, header file <code>Registers.vh</code>: [code class="brush" gutter="true" toolbar="off" tab-size="3" language="verilog"]parameter RWE_0 = 1'b0, // register write enable RWE_1 = 1'b1; parameter [2:0] PSR_POSITIVE = 3'b001, // processor status register bits PSR_ZERO = 3'b010, // should match BR instruction PSR_NEGATIVE = 3'b100; #### Registers.v Implementation of the LC-3 instruction set in Verilog, source file Registers.v: module Registers( input clock, input reset, input we, // write enable input [2:0] sr1ID, // source register 1 ID input [2:0] sr2ID, // source register 2 ID input [2:0] drID, // destination register ID input [15:0] dr, // destination register value output reg [15:0] sr1, // source register 1 value output reg [15:0] sr2, // source register 2 value output reg [2:0] psr ); // processor status register include "Registers.vh" reg [3:0] id; reg [15:0] gpr [0:7]; // general purpose registers // write the destination register value, and update Process Status Register (psr) always @(posedge clock, posedge reset) if (reset) for (id = 0; id < 7; id = id + 1) // initial all registers to 0 gpr&#91; id &#93; <= 16'h0000; else if (we == RWE_1) // when enabled by the FSM begin if (dr&#91; 15 &#93;) // update processor status register (neg,zero,pos) psr <= PSR_NEGATIVE; else if (|dr) psr <= PSR_POSITIVE; else psr <= PSR_ZERO; gpr&#91; drID &#93; <= dr; // write the value dr to the register identified by drID end // output the value of the register identified by "sr1ID" on output "sr1" // output the value of the register identified by "sr2ID" on output "sr2" always @(sr1ID, sr2ID, gpr&#91; sr1ID &#93;, gpr&#91; sr2ID &#93;) begin sr1 = gpr&#91; sr1ID &#93;; sr2 = gpr&#91; sr2ID &#93;; end endmodule&#91;/code&#93; </p> <h3> ALU </h3> <h4> ALU.vh </h4> <p> [code class="brush" gutter="true" toolbar="off" tab-size="3" language="verilog"]parameter [2:0] UOP_ADDREG = 3'b000, // ALU operation UOP_ADDIMM = 3'b001, UOP_ANDREG = 3'b010, UOP_ANDIMM = 3'b011, UOP_NOT = 3'b100, UOP_X = 3'bxxx; #### ALU.v module ALU( input [2:0] uOp, // operation selector input [15:0] sr1, // source register 1 value (SR1) input [15:0] sr2, // source register 2 value (SR2) input [4:0] imm, // lower 5 bits from instruction register output reg [15:0] uOut ); // result of ALU operation include "ALU.vh" wire [15:0] imm5 = ({ {11{imm[4]}}, imm[4:0] }); // sign extend to 16 bits always @(uOp or sr1 or sr2 or imm5) casex (uOp) 3'b000: uOut = sr1 + sr2; // ADD Mode 0 3'b001: uOut = sr1 + imm5; // ADD Mode 1 3'b010: uOut = sr1 & sr2; // AND Mode 0 3'b011: uOut = sr1 & imm5; // AND Mode 1 3'b1xx: uOut = ~(sr1); // NOT endcase endmodule ### Address #### Address.vh parameter AOP_SR1 = 1'b0, // address operation AOP_NPC = 1'b1, AOP_X = 1'bx; #### Address.v module Address( input aOp, // operation selector input [15:0] sr1, // value source register 1 input [15:0] nPC, // next program counter (PC), always PC+1 input [8:0] offset, // lower 9 bits from instruction register output reg [15:0] aOut ); // target program counter include "Address.vh" wire [15:0] offset6 = ({{10{offset[5]}}, offset[5:0]}); // sign extended the 6-bit offset wire [15:0] offset9 = ({{ 7{offset[8]}}, offset[8:0]}); // sign extended the 9-bit offset always @(aOp or sr1 or nPC or offset6 or offset9) case (aOp) AOP_SR1 : aOut = sr1 + offset6; // register + offset AOP_NPC : aOut = nPC + offset9; // next PC + offset endcase endmodule ### MemoryIF #### MemoryIF.vh parameter [2:0] MOP_NONE = 3'b000, // MemoryIF operation MOP_RD = 3'b100, MOP_RDI = 3'b101, MOP_WR = 3'b110, MOP_WRI = 3'b111; #### MemoryIF.v module MemoryIF( input [2:0] mOp, // memory operation selector input [15:0] sr2, // source register 2 value input [15:0] addr, // address for read or write input [15:0] eDIN, // external memory data input output reg iBR, // internal bus request output reg [15:0] iADDR, // internal memory address lines output tri [15:0] eDOUT, // internal memory data output output reg iWEA ); // internal memory write enable include "MemoryIF.vh" reg [15:0] eDOUTr; assign eDOUT = eDOUTr; always @(mOp, sr2, addr, eDIN) case (mOp) MOP_RD : begin iBR=1; iWEA = 1'b0; iADDR = addr; eDOUTr = 16'hzzzz; end MOP_RDI : begin iBR=1; iWEA = 1'b0; iADDR = eDIN; eDOUTr = 16'hzzzz; end MOP_WR : begin iBR=1; iWEA = 1'b1; iADDR = addr; eDOUTr = sr2; end MOP_WRI : begin iBR=1; iWEA = 1'b1; iADDR = eDIN; eDOUTr = sr2; end default : begin iBR=0; iWEA = 1'bx; iADDR = 16'hxxxx; eDOUTr = 16'hzzzz; end endcase endmodule ### DrMux #### DrMux.vh parameter [1:0] DRSRC_ALU = 2'b00, // destination register source selector DRSRC_MEM = 2'b01, DRSRC_ADDR = 2'b10, DRSRC_X = 2'bxx; #### DrMux.v module DrMux( input [1:0] drSrc, // multiplexor selector input [15:0] eDIN, // external memory data input input [15:0] addr, // effective memory address input [15:0] uOut, // result from ALU output reg [15:0] dr ); // data that will be stored in DR include "DrMux.vh" always @(drSrc or uOut or eDIN or addr) case (drSrc) DRSRC_ALU : dr = uOut; DRSRC_MEM : dr = eDIN; DRSRC_ADDR : dr = addr; default : dr = 16'hxxxx; endcase endmodule: ### BusDriver #### BusDriver.v module BusDriver( input br0, // input 0, bus request input [15:0] iADDR0, // input 0, internal memory address lines input iWEA0, // input 0, internal memory write enable input br1, // input 1, bus request input [15:0] iADDR1, // input 1, internal memory address lines input iWEA1, // input 1, internal memory write enable output tri [15:0] eADDR, // external memory address lines output tri eWEA ); // external memory write enable assign eWEA = br1 ? iWEA1 : br0 ? iWEA0 : 1'bz; assign eADDR = br1 ? iADDR1 : br0 ? iADDR0 : 16'hzzzz; endmodule ## Functional simulation The functionality of our microprocessor can be tested by building a test bench. The bench will supply the clock signal and reset pulse and simulate a random access memory (RAM) containing the test program. The program is written using in a assembly language and compiled using LC3Edit. ### Test program Exercises a variety of instructions: • memory read • alu • memory write • control instructions Written in assembly language LC3Edit compiles this into the object file: As part of the compilation, LC3Edit also creates a .hex file. The contents of this file can be tweaked into a .coe file to be preloaded in the test bench memory. ### Memory The random access memory (RAM) is created using Xilinx IP’s Block Memory Generator 6.2. The following parameters are used: • native i/f, single port ram, no byte write enable, minimum area algorithm, width 16, depth 4096, write first, use ENA pin, no output registers, no RSTA pin, enable all warnings. Initialize memory from the .coe file. ### Test bench and clock/reset signals Generate a 50 MHz symmetric clock. Integrate the parts into a test bench using Verilog. timescale 1ns / 1ps module SimpleLC3_SimpleLC3_sch_tb(); reg clock; // clock (generated by test fixture) reg reset; // reset (generated by test fixture) wire [15:0] eADDR; // external address (from LC3 to memory) wire [15:0] eDIN; // external data (from memory to LC3) wire [15:0] eDOUT; // external data (from LC3 to memory) wire eWEA; // external write(~read) enable (from LC3 to memory) // Instantiate the Unit Under Test SimpleLC3 UUT ( .eDOUT(eDOUT), .eWEA(eWEA), .clock(clock), .eREADY(1’b1), // ready (always 1, for now) .eDIN(eDIN), .reset(reset), .eADDR(eADDR) ); // Instantiate the Memory, created using Xilinx IP’s Block Memory Generator 6.2: // Initialize from memory.coe, created from compiling memory.asm using LC3Edit. memory RAM( .clka(clock), .ena(eENA), .wea(eWEA), .addra(eADDR[11:0]), .dina(eDOUT), .douta(eDIN) ); wire eENA = |(eADDR[15:12] == 4’h3); // memory is at h3xxx initial begin clock = 0; reset = 0; #15 reset = 1; // wait for global reset to finish #22 reset = 0; end always #10 clock <= ~clock; // 20 ns clock period (50 MHz) endmodule;[/code] ### Simulation results For the functional simulation we use ISim that comes bundled with the Xilinx IDE. The simulation needs to be ran for 1600 ns. Waveform diagrams are shown below (click to enlarge) ### Timing simulation The free Xilinx IDE doesn’t support timing simulations. Instead we will use Icarus Verilog for the synthesis and simulation, GTKWave for viewing the generated waveforms, and Emacs verilog-mode for editing. We will run them natively under Linux. For those interested, Windows binaries are available from bleyer.org. This concludes the “Implementation of the LC-3 instruction set in Verilog”. ## Design Presents a CPU Design for LC-3 instruction set, that we later implement using Verilog HDL. The illustrations help visualize the design. The instruction set is based on the book Introduction to Computer Systems by Patt and Partel. For this text we push the simplicity of this little microprocessor (LC-3) even further as described in Instruction Set. ## Design The microprocessor consists of a Data Path and a Control Unit. Together they implement the various instruction phases. This section describes an architecture for the LC-3. It aims at staying true to the von Neumann architecture and instruction cycle names. However, here we assume the program counter and instruction register are in the data path. ### Data Path The schematic below shows the Data Path. We use the following conventions • The shaded blocks are modules that implement various functionality. The module names have been chosen to reflect the instruction phases. • Signals connect the blocks. A signal can be a single wire, or a collection of wires such as the 16 bits that represent the value of the program counter. Signal names are chosen to overlap with operand names where possible. • The microprocessor connects to an external memory through the external interface. #### Modules Module Description UpdatePC Maintains the program counter, pc. Fetch Initiates the bus cycle, to read the instruction pointed to by pc. Decode Reads the instruction from the memory bus and extracts its operands. Registers Maintains the register values and processor status register. ALU Performs arithmetic and logical operations. Address Calculates memory address for memory or control instructions. MemoryIF Initiates the external memory bus cycle to read or write data. DrMux Destination register multiplexor, selects the value that will be written to the destination register. BusDriver Simple arbiter for memory read requests from Fetch and MemoryIF. #### Signals Group Signal Description Program counters pc Program Counter nPC Next program counter (always has the value pc+1) tPC Target program counter, for JMP / BR*. Operands sr1ID Source register 1 identifier. Also used as baseRID for JMP / LDR / STR sr2ID Source register 2 identifier. Also used as srID for ST / STI / STR. imm Immediate value offset Memory address offset Register values sr1 Value of the register identified by signal sr1ID sr2 Value of the register identified by sr2ID dr Value written to the register identified by drID psr Value of the processor status register Intermediate values uOut Result of the ALU operation aOut=addr=tPC Result of the address calculation External bus eADDR Memory address eDIN Instruction/data being read from memory eDOUT Data being written from memory eWEA Write enable signal going to memory. Value 0 for read, 1 for write. Internal bus iBR0, IBR1 Internal bus request signals iADDR0, iADDR1 Internal memory addresses iWEA0, iWEA1 Internal write enable signals #### Examples ##### Read memory Assume: the instruction at address 3000 is 201F. Assigning the label LDv to memory location 3020, this instruction decodes to Address Value Label Mnemonic x3000 x201F LD r0, LDv x3020 x1234 LDv Issuing a reset, triggers the following sequence of events: # Module Action Signals 1. UpdatePC Resets the program counter to its initial value pc=3000, nPC=3001 2. Fetch Starts a read cycle for the instruction br0=1, iADDR0=3000, iWEA0=0 3. BusDriver Forwards the read cycle to the external memory bus eADDR=3000, eWEA=0 4. ExtMemory Responds with the instruction eDIN=201f 5. Decode Extracts the operands offset=1f, drID=0 6. Address Adds the offset to nPC addr=3020 7. MemoryIF Starts a read cycle for the data iBR1=1, iADDR1=3020, iWEA1=0 8. BusDriver Forwards the read cycle to the external memory bus eADDR=3020, eWEA=0 9. ExtMemory Responds with the data eDIN=1234 10. DrMux Selects the eDIN input dr=1234 11. Registers Writes the value dr to the register identified by drID ##### ALU operation Assume: pc=3003, the register R0=1234 and R1=4321. The instruction at the next address 3004 is 1801. This instruction decodes to Address Value Label Mnemonic x3004 x1801 ADD R4, R0, R1 The following sequence of events will happen: . # Module Action Signals 1. UpdatePC Increments the program counter pc=3004 2. Fetch Starts a read cycle for the instruction iBR0=1, iADDR0=3004, iWEA0=0 3. BusDriver Forwards the read cycle to the external memory bus eADDR=3004, eWEA=0 4. ExtMemory Responds with the instruction eDIN=1801 5. Decode Extracts the operands sr1ID=0, sr2ID=1, drID=4 6. Registers Supplies the values for the registers identified by sr1ID and sr2ID sr1=1234, sr2=4321 7. ALU Calculates the sum of sr1 and sr2 uOut=5555 8. DrMux Selects the uOut input. dr=5555 9. Registers Writes the value dr to the register identified by drID ##### Write memory Assume: pc=3007, register R4=AAAA and the label STIa refers to data address 3024 containing the value 3028. The instruction at the next address 3008 is B81D. This instruction decodes to Address Value Label Mnemonic x3008 xB81D STI R4, STIa x3024 x3028 STIa x3028 xBAD0 The following sequence of events will happen: . # Module Action Signals 1. UpdatePC Increments the program counter pc=3008, nPC=3009 2. Fetch Starts a read cycle for the instruction. iBR0=1, iADDR0=3008, iWEA0=0 3. Bus driver Forwards the read cycle to the external memory bus. eADDR=3008, eWEA=0 4. ExtMemory Responds with the instruction. eDIN=b81f 5. Decode Extracts the operands (sr2ID represents the SR operand) sr2ID=4, offset=1d 6. Registers Supplies the value for the register identified by sr2ID sr2=aaaa 7. Address Adds the offset to nPC addr=3024 8. MemoryIF Starts a read cycle to retrieve the address where to store the value. iBR1=1, iADDR1=3024, iWEA1=0 9. BusDriver Forwards the read cycle to the external memory bus. eADDR=3024, eWEA=0 10. ExtMemory Responds with the value eDIN=3028 11. MemoryIF Starts a write cycle to write the value of register R4 to address 3028 iBR1=1, iADDR1=3028, iWEA1=1 12. BusDriver Forwards the write cycle to the external memory bus. eADDR=3028, eWEA=1 ### Control unit Instructions can be broken up into micro instructions. These can be implemented using a finite state machine (FSM), where each state corresponds to one micro instruction. The finite state machine can be visualized as shown in the figure below. • circles, represent the states identified by a unique number and name. • double circle, represents the initial state. • arrows, represent state transitions. Labels represent the condition that must be met for the transition to occur. • shading, is used to identify the implementation modules. • eREADY, indicates that the external memory finished a read or write operation. • iType, maType, indType refer to the generalized instruction types generated by Decoder. #### State diagram #### Details • Policies: • State transitions, are only possible during the falling edge of the clock signal (from 1 to 0); • Outputs, to the external memory interface, are driven in response to state transitions; • Inputs, from the external memory interface, are sampled on the rising edge of the clock signal (from 0 to 1); • Control signals, change only during the falling edge of the clock signal to minimize glitches. • Each state: • depends on both input signals and the previous state’ • generates control signals control signals for the data path (with the help of the Decode module). • The control unit consists of two modules: • State, implements the state machine, and generates state specific control signals. • Decode, generalizes the instruction for the state machine, and generates state independent control signals. #### Schematic for the control unit The next section describes the signals for the control unit in the CPU Design for LC-3 instruction set. #### Signals for the control unit Group Signal Description External interface eREADY==1 Indicates that the external memory finished a read or write operation. clock External supplied clock Internal to the State module state Current state nState Next state as determined by the combinational logic Generalized instruction types (bundled into cCtrl) iType Instruction type maType Memory access type indType Indirect memory access type Data path control pNext Signals UpdatePC to change the program counter to tPC pEn Enables UpdatePC to change the program counter. fEn Enable Fetch to start external memory bus cycle to read the instruction. dEn Enables Decode to read the instruction from the external memory bus. rWe Enables Registers to store the value of dr in the register identified by drID. uOp Chooses the operation and inputs of the ALU aOp Chooses the operation and inputs of the Address calculation. mOp Chooses the memory operation to be performed by MemoryIF. drSrc Selects the destination register source input on DrMux The next section gives a detailed description of the modules for the CPU Design for LC-3 instruction set. #### Modules (detailed description) Module Description State Generates the state specific control signals for each micro instruction being executed. Refer to the signals described above for details. UpdatePC Updates the program counter, pc, at the end of each instruction cycle. The new value is: • 3000 if reset is asserted, or • the value of the tPC input, when a JMP or BR* instruction was executed (and its condition was met), or • otherwise, the previous value of pc+1. Fetch • Initiates the external bus cycle (iBR0, iADDR0, iWEA0) to read the instruction from the memory location pointed to by pc. • The control unit will maintain this state until the external memory reports that the data is available (eREADY). Decode • Finishes the external bus cycle by reading the instruction from the external memory bus (eDIN). • Decodes the instruction: • Based on the opcode (ir[15:11], it generalizes the instruction type for the State module (cCtrl). • Based on the operands (ir[10:0]), it configures the data path using state independent control signals: • For ALU instructions, uOp, sr1ID, sr2ID, drSrc, drID • For memory instructions, aOp, sr1ID (BaseR for LDR / STR), sr2ID (sr for ST / STI / STR), offset, drID • For control instructions, pNext, sr1 (BaseR for JMP). Registers • Maintains the general purpose register (R0..R7). • Supplies the values for the registers identified by sr1ID, sr2ID. • Updates the register specified by drID to the value dr when rWe is asserted. ALU • The input uOp selects both the operation type and inputs. • For ADD do if ir[5]==1 then uOut=sr1+imm5, else uOut=sr1+sr2. • For AND do if ir[5]==1 then uOut=sr1&imm5, else uOut=sr1&sr2. • For NOT do uOut=~sr1. Address • Input aOp selects both the calculation type and inputs. • For BR* do aOut=nPC+offset9. • For LD / LDI / LEA / ST / STI, do aOut=nPC+offset9. • For JMP / LDR / STR, do aOut=sr1+offset6. • Note that aOut is connected to the addr input on MemoryIF, and the tPC input on FetchPC. MemoryIF • Input mOp selects the memory access mode and inputs. • For LDI / STI, under the direction of the Control Unit, it first initiates a memory read cycle for addr (aOut). The Control Unit will maintain this state until the external memory reports that the data is available (eREADY). It then takes the value read from memory (eDIN), and • for LDI, it initiates a read cycle for address eDIN; • for STI, it initiates a write cycle to write the value sr2 to address eDIN. • For the other instructions, it takes the address, and • for LD / LDR, read the value from addr (aOut); • for LEA, do nothing; • for ST / STR, write the value sr2 to addr (aOut). • The control unit will maintain this state until the external memory reports that the data is available (eREADY). DrMux* • Input drSrc selects the value that will be written to the destination register. • For ADD / AND / NOT do forward uOut to dr. • For LD / LDR / LDI do forward deign to dr. • For LEA do forward aOut to dr. *) DrMux is an abbreviation for Destination Register Multiplexor. To continue this CPU Design for LC-3 instruction set, read about its implementation on the next page. ## Instruction set Introduces a simplified LC-3 instruction set, that we later will design a CPU for and implement in Verilog HDL. ## Instruction Set The Instruction Set Architecture (ISA) specifies all the information to write a program in machine language. It contains: • Memory organization, specifies the address maps; how many bits per location; • Register set, specifies the size of the internal registers; how many registers; and how they can be used; • Instruction set, specifies the opcodes; operands; data types; and addressing modes ### Simplicity rules The book Introduction to Computer Systems by Patt and Partel, introduces an hypothetical microprocessor called LC-3. For this text we push the simplicity of this little computer (LC-3) even further by: • not supporting subroutine calls, JSR JSRR RET • not supporting interrupt handling, RTI TRAP • not supporting overflow detection in arithmetic operations • not validating the Instruction encoding • replacing the TRAP 0, with a simple HALT instruction. Implementing this very basic Instruction Set helps us understand the inner workings of a microprocessor. With the exception of these simplifications, the Instruction Set Architecture (ISA) is specified in the book “Introduction to Computer Systems“. The following sections summarize this ISA. For more details, refer to Appendix A.3 of the book. ### Overview • Memory organization: • 16-bit addresses; word addressable only, • 16-bit memory words. • Memory map • User programs start at memory location 3000 hex, and may extend to FDFF. • Bit numbering • Bits are numbered from right (least significant bit) to left (most significant bit), starting with bit 0. • Registers • A 16-bit program counter (PC), contains the address of the next instruction. • Eight 16-bit general purpose registers, numbered 000 .. 111 binary, for register R0 .. R7. • A 3-bit processor status register (PSR), that is updated when an instructions writes to a register. • psr[2]==1, when the 2’s complement value is negative (n). • psr[1]==1, when the 2’s complement value is zero (z). • psr[0]==1, when the 2’s complement value is positive (p). • Instructions • 16-bit instructions, RISC (all instructions the same size). • the opcode, is encoded in the the 4 most significant bits of the instruction (bit 15..12). • the operands, are encoded in the remaining 12 bits of the instruction. • ALU performs ADD AND and NOT operations on 16-bit words. ### Instructions #### Operand conventions As mentioned above, from the 16 bit instruction, only 12 bits are available for the operands. This implies that 16-bit data values or memory addresses have to be specified indirectly. For instance by referring to a value in a register. Addressing modes: • PC relative, the address is calculated by adding an offset to the incremented program counter, pc. • Register relative, address is read from a register. • Indirect, address is read from a memory location who”s address is calculated by adding an offset to the incremented program counter. • Load effective address, address is calculated by adding an offset to the incremented program counter. The address itself (not its value) is stored in a register. The table below shows the conventions used in describing the instructions. Operand Description srID, sr1ID, sr2ID Source Register Identifiers (000..111 for R0..R7) drID Destination Register Identifier (000..111 for R0..R7) baseRID Base Register Identifier (000..111 for R0..R7) sr, sr1, sr2 16-bit Source Register value dr 16-bit Destination Register value baseR Base Register value, used together with 2’s complement offset to calculate memory address. imm5 5-bit immediate value as 2’s complement integer mem[address] Contents of memory at the given address offset6 6-bit value as 2’s complement integer offset9 9-bit value as 2’s complement integer SX Sign-extend, by replicating the most significant bit as many times as necessary to extend to the word size of 16 bits. Conventions #### ALU instructions There are two variations of the ADD and AND instructions. The difference is in bit 5 of the instruction word. One takes the second argument from sr2, the other takes it from the immediate value imm5. ##### Instruction types Opcode Name Assembly Operation ADD Addition ADD DR, SR1, SR2 dr = sr1 + sr2 ADD DR, SR1, imm5 dr = sr1 + SX(imm5) AND Logical AND AND DR, SR1, SR2 dr = sr1 & sr2 AND DR, SR1, imm5 dr = sr1 & SX(imm5) NOT Logical NOT NOT DR, SR dr = ~sr ##### Instruction encoding Opcode 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 ADD 0 0 0 1 drID sr1ID 0 0 0 sr2ID 0 0 0 1 drID sr1ID 1 imm5 AND 0 1 0 1 drID sr1ID 0 0 0 sr2ID 0 1 0 1 drID sr1ID 1 imm5 NOT 1 0 0 1 drID srID 1 1 1 1 1 1 #### Memory instructions ##### Instruction types Opcode Name Assembly Operation LD Load LD DR, label dr = mem[pc + SX(offset9)] LDR Load Register LDR DR, BaseR, offset6 dr = mem[baseR + SX(offset6)] LDI Load Indirect LDI DR, label dr = mem[mem[pc + SX(offset9)]] LEA Load Eff. Addr. LEA DR, target dr = pc + SX(offset9) ST Store ST SR, label mem[pc + SX(offset9)] = sr STR Store Register STR SR, BaseR, offset6 mem[baseR + SX(offset6)] = sr STI Store Indirect STI SR, label mem[mem[pc + SX(offset9)]] = sr ##### Instruction encoding opcode 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 LD 0 0 1 0 drID offset9 LDR 0 1 1 0 drID baseRID offset6 LDI 1 0 1 0 drID offset9 LEA 1 1 1 0 drID offset9 ST 0 0 1 1 srID offset9 STR 0 1 1 1 srID baseRID offset6 STI 1 0 1 1 srID offset9 #### Control instructions ##### Instruction types Opcode Name Assembly Operation BR* Branch BR* label if (condition*) pc = pc + SX(offset9) JMP Jump JMP BaseR pc = baseR HALT Halt HALT stop program execution (simplified TRAP 0) *) The assembler instruction for BR* can be either • BRn label, test for state bit n • BRz label, test for state bit z • BRn label, test for state bit p • BRzp label, test for state bits z and p • BRnp label, test for state bits n and p • BRnz label, test for state bits n and z • BRnzp label, test for state bits n, z and p ##### Instruction encoding opcode 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 BR* 0 0 0 0 n z p offset9 JMP 1 1 0 0 0 0 0 baseRID 0 0 0 0 0 0 HALT 1 1 1 1 0 0 0 0 0 0 1 0 0 1 0 1 This article “Simplified LC-3 Instruction set” continues with Design on the next page. ## Architecture Explains how a CPU works by implementing the LC-3 instruction set. Includes an in-depth look at the instruction cycle phases. The inquiry “How do computers do math introduced the components needed to build a microprocessor. This article series continues by introducing the microprocessor. It uses a top-down approach. We use a top down approach to help break up the complex microprocessor into simple, more manageable parts. Starting from the architecture, it dives down to an instruction set. In the second half we build a microprocessor using a field programmable gate array. We assume a familiarity with assembly code. ## Credits This text leans on chapter 4 from the excellent book “Introduction to Computer Systems” by Patt and Partel. The implementation borrows from Davis’ Project#1 description at NC State University. ## Architecture World War II bought widespread destruction, but also spurred a flurry of computer innovations. The first electric computer was built using electro-mechanical relays in Nazi Germany (1941). Two years later the US army built ENIAC using 18,000 vacuum tubes for calculating artillery firing tables and simulating the H bomb. This computer could perform complex sequences of operations, including loops, branches and subroutines. The program in this computer was hardwired using switches and dials. It was a thousand times faster as the electro-mechanical machine, but it took great effort to change the program. The programming was hard-wired into their design, meaning that “reprogramming” a computer simply wasn’t possible: Instead, computers would have to be physically disassembled and redesigned. To explain how a CPU works, we take a look at the leading architecture. ### von Neuman Architecture In 1945, the mathematician John von Neumann formalized processor methods developed at the University of Pennsylvania. His computer architecture design consists of a Control Unit, Arithmetic and Logic Unit (ALU), Memory Unit, Registers and Inputs/Outputs. These methods became known as the von Neumann architecture and still forms the foundation for today’s computers. Using the von Neumann architecture, computers were able to be modified and programmed via the input of instructions in computer code. This way, the functionality could be simply rewritten using a programming language. The von Neuman Architecture is based on the principle of: 1. Fetch an instruction from memory 2. Decode the instruction 3. Execute the Instruction This process is repeated indefinitely, and is known as the fetch-decode-execute cycle. The Central Processing Unit (CPU) is the electronic circuit responsible for executing the instructions of a computer program. It is also referred to as the microprocessor. The CPU contains the Control Unit, ALU and Registers. The Control Unit interprets program instructions and orchestrates the execution of these instructions. Registers store data before it can be processed. The ALU carries out arithmetic and logical operations. ### Instructions Instructions are a fundamental unit of work that are executed completely, or not at all. In memory, the instructions look just like data — a collection of bits. They are just interpreted differently. An instructions includes: • an opcode, the operation to be performed; • operands, the data (locations) to be used in the operation. There are three instruction types: • arithmetic and logical instructions, such as addition and subtraction, or logical operations such as AND, OR and NOT; • memory access instructions, such as load and store; • control instructions, that may change the address in the program counter, permitting loops or conditional branches. This article “How a CPU works” continues with Instruction Set on the next page. ## Programmable logic Complexity – CAD – Simulation …. Logic devices can be classified into two broad categories • Fixed devices, where the circuits are permanent. Their function cannot be changed. Examples are: • gates (NAND, NOR, XOR), • binary counters, • multiplexers, and • adders. • Application-Specific Integrated Circuit (ASIC) • The manufacturer defines a integrated circuit containing transistors, but does not connect them together. • The user specifies the metal mask that connects the transistors. • The manufacturer uses this mask to finish the ASIC. • Introduced by Fairchild in 1967. Have since grown to contain over 100 million gates. ### Programmable logic devices Programmable logic devices (PLD), can be changed at any time to perform any number of functions. Prominent flavors of PLDs are: • Programmable array logic (PAL) • based on sum-of-products, with programmable “fuses”, • used for simple combinational logic (a few 100′s gates), • introduced by MMI (Birkner and Chua, 1978). • The figure on the right shows an example of an AND function with programmable fuses. • Complex programmable logic device (CPLD) • based on sum-of-products, • for medium size combinational logic (10,000′s gates). • Field-programmable gate array (FPGA) • based on blocks containing a look-up tables, full adder and d flip-flop, • used for complex combinational or sequential logic such as state machines (1,000,000′s gates), • introduced by Xilinx (Freeman, Vonderschmitt) in 1985. A CPLD would be sufficient to implement the combinational circuits discussed so far, however our ultimate goal is to create a modest microprocessor circuit. As we will see later, a microprocessor circuit requires a state machine for which we need a FPGA. As a result the remainder of this text will focus on a FPGAs implementation. ### Interconnected cells The core of FPGAs contains a vast array of interconnected logic cells. The exact logic cell architecture depends on the vendor. (refer to FPGA logic cells for typical cell architectures.) The main vendors are: • Xilinx for leading edge products, and • Altera (Intel) for lean and efficient devices. Each logic cell consists of: • a look-up table (LUT), to implement any 4-input Boolean function, • a full adder with an additional AND gate, to implement multiplication. • a D flip-flop, to implement sequential logic, and • a 2-to-1 multiplexer, to bypass the flip-flop if desired Each IO cell consists of: • a D flip-flop, to implement sequential logic, and • a 2-to-1 multiplexer, to bypass the flip-flop if desired Programmable interconnects • Reconfigurable interconnects allow the logic cells to be “wired together”. • The functionality of an FPGA can be changed by downloading a different configuration. • The circuits are often much faster as with discrete components, because the signals stay within the silicon die of the FPGA. The figure below shows a typical matrix organization of the logic cells that are interconnected using programmable interconnects. ### Lab environment (thanks Dylon) • Altera (now Intel), much better tools. • Boards • Xilinx, development boards are easy to find. E.g. Spartan6 ($89 at Avnet) that has a USB-to-UART chip on it so you can plug it right into your computer to download new FPGA code as well as use it as a UART.
• Alternatively, the Xilinx Spartan3E development board is an old standby that works well.
• Simulator
• icarus verilogg (free simulator, yum install iverilog) and GTKWave (free waveform viewer, yum install gtkwave) work great. They are just as good as most of the bundled simulators that you’ll find with the tools.
• a web copy of ModelSim bundled with Xilinx or Altera that wouldn’t be bad either.
• Cliff Cummings posted papers about Verilog, and book recommendations.
• OpenCores has lots of Verilog and VHDL code for most any kind of core you can imagine.
• Scripts SimShop and Tizzy for simulation and state machines! SimShop provides an easy scriptable way to set up a simulation environment, and Tizzy allows you to write state machines in .dot and will do a conversion to Verilog for you.

### The typical workflow:

• The desired logic is specified using traditional schematics or a hardware description language.
• The logic function is compiled into a binary file that can be downloaded into the FPGA.
• Test vectors and output verification.

The application-specific integrated circuit (ASIC), is similar to the FPGA, except that it is not reprogrammable. The advantage is higher speed and smaller footprint.

Hardware description language (HDL)

1. Verilog/VHDL
2. netlist
3. synthesis optimizes the functions
4. mapping to hardware

Build-in components are called macros (counters, RAM, multiplexers, adders, LUT)

1. See “Introduction to Verilog
2. In order the obtain reasonable speeds (wires are not ideal), the utilization is typically limited to about 50%.

### What’s next?

The logic next step is the Arithmetic Logical Unit that forms the heart of today’s computers.

### Arithmetic Logical Unit (ALU)

1. Arithmetic Logical Unit (ALU)
• http://ecen3233.okstate.edu/Fall%202009/labs/Lab05.pdf
• soft cores for Xilinx, http://www.1-core.com/library/digital/soft-cpu-cores/
2. Add Simple picture showing different functions feeding into a multiplexor where the operation is the selector.

Now let us build something with Gate-Level Verilog! I also published the companion article that implements the functionality using an FPGA

The inquiry “How do microprocessors work?” picks up from here.

## Synchronous sequential

The logic circuits that we have seen so far are referred to as combinatorial circuits. While these circuits can be used quite successfully for math operations, their simplicity comes at a price:

• The input values need to remain constant during the calculation.
• The output can have multiple logical transitions before settling to the correct value. The figure below shows that even adding two numbers without carry may cause multiple transitions.
• There is no indication when the output has settled to the correct value.

This chapter address solutions to some of these issues by introducing sequential circuits in which where the output not only depends on the current inputs, but also on the past sequence of the inputs values. That is, sequential logic has state (memory). [wiki]

In general, such a memory element can be made using positive feedback.

Digital sequential logic circuits are divided into asynchronous and synchronous circuits.

### Asynchronous

The advantage of asynchronous circuits is that the speed is only limited by the propagation delays of the gates, because the circuit does not have to wait for a clock signal to process inputs.

The state of asynchronous circuits can change at any time in response to changing inputs. As a result, similar to combinatorial circuits, the output can have multiple logical transitions before settling to the correct value.

Another disadvantage arises from the fact that memory elements are sensitive to the order that their input signals arrive. If two signals arrive at a logic gate at almost the same time, which state the circuit goes into can depend on which signal gets to the gate first. This may causes small manufacturing differences to lead to different behavior.

These disadvantages make designing asynchronous circuits very challenging and limited to critical parts where speed is at a premium. [wiki]

A very basic memory element can be made using only two inverters. This circuit will maintain its state, but lacking any input that state cannot be changed.

We continue with some common asynchronous circuits.

#### Set-Reset (SR) Latch (async, level sensitive)

The SR-latch builds on the idea of the inverter latch and introducing two inputs. Set ($$S$$), forces the next value of the output ($$Q_{n+1}$$) to $$1$$. Reset ($$R$$), force the next value of the output ($$Q_{n+1}$$) to $$0$$.

The state transition diagram provides a visual abstraction. It uses circles for the output states, and arrows for the transition conditions.

The state transition table shows the relationship between the inputs, the current value of the output $$(Q_n$$), and the next value of the output ($$Q_{n+1}$$). The ‘×‘ represents a “don’t care” condition.

In Boolean algebra, this function can be expressed as: \begin{align*} Q_{n+1} &= S+\overline{R}\cdot{Q_n} \\ &= \overline{\overline{S+Q_n}+R} \end{align*}

The function can be built with two NOR gates as shown below.

With the circuit in hand, let us take a closer look at its operation:

• When S=1 while R=0, drives output Q to 1.
• When R=1 while S=0, drives output Q to 0.
• When S=R=0, the latch latches and maintains it’s previously state.
• When both inputs change to 1 sufficiently close in time, there is a problem. Whatever gate is first, will win the race condition. In the real world, it is impossible to predict which gate that would be, since it depends on minute manufacturing differences. A similar problem occurs when the device powers and both $$Q$$ and $$\overline Q$$ are $$0$$.

In Verilog HDL we could model this as module SR_latch( input S, input R, output Q ); wire Qbar; nor( Q, R, Qbar ); nor( Qbar, S, Q ); endmodule

#### D-latch (async, level sensitive)

Here a $$D$$ input makes the $$R$$ and $$S$$ complements of each other, thereby removing the possibility of invalid input states (metastability) as we saw in the SR-latch. $$D=1$$ sets the latch to 1, and $$D=0$$ resets the latch to 0.

Note that in the circuit below the SR-latch here is drawn as cross-coupled NOR gates.

An enable signal controls when the value of the inputs $$R$$ and $$S$$ input matter. The output reflects the input only when enable is active ($$EN=1$$). In other words, the enable signal serves as a level triggered clock input. Level triggered means that the input is passed to the output for was long as the clock is active.

D-latches cannot be chained, because changes will just race through the chain. Once could prevent this by inverting the ENABLE signal going to the 2nd D-latch.

In Verilog HDL we would model this as module D_latch( input D, input Enable, output Q ); always @(D or Enable) if (Enable) Q &<= D; endmodule

### Synchronous circuits

In synchronous circuits, a clock signal synchronizes state transitions. Inputs are only sampled during the active edge of the clock cycle. Outputs are “held” until the next state is computed, thereby preventing multiple logical transitions. Changes to the logic signals throughout the circuit all begin at the same time, synchronized by the clock.

The figure below shows an example of a synchronous sequential circuit. In it, a D flip-flop serves as a clocked memory element. The following section will examine the various memory element.

The main advantage of synchronous logic is its simplicity. The logic gates which perform the operations on the data require a finite amount of time to respond to changes to their inputs. This is called propagation delay. The interval between clock pulses must be long enough so that all the logic gates have time to respond to the changes and their outputs “settle” to stable logic values, before the next clock pulse occurs. As long as this condition is met (ignoring certain other details) the circuit is guaranteed to be stable and reliable. This determines the maximum operating speed of a synchronous circuit.

Synchronous circuits also have disadvantages. The maximum clock signal is determined by the slowest (critical) path in the circuit, because every operation must complete in one clock cycle. The best work-around is by making all the paths take roughly the same time, by splitting complex operations into several simple operations, which can be performed over multiple clock cycles. (pipelining)

In addition, synchronous circuits require the usually high clock frequency to be distributed throughout the circuit, causing power dissipation.

#### D flip-flop (edge triggered)

The classic D-flip-flop is similar to a D-latch, except for the important fact that it only samples the input on a clock transition; in this case that is the rising edge of the clock.

In other words, while $$clk=0$$ the value of $$D$$ is copied in the first latch. The moment that $$clk$$ becomes $$1$$, that value remains stable and is copied to output $$Q$$.

The use of the clock signal implies that the flip-flop cannot just hold its previous value and samples the input every rising clock edge. Note that the ‘>‘ symbol indicates that the clock input is sampled on the rising clock edge.

The advantage of triggering on the clock edge, is that the input signal only needs to remain stable while it is being copied to the second latch. The so-called timing window:

In this timing window, the setup time $$t_su$$ is the minimum time before rising clock by which the input must be stable. The hold time $$t_h$$ is the minimum time after the clock event during which the input must remain stable. The clock-to-output (propagation) delay $$t_{cq}$$ is the maximum time after the clock event for the output to change.

A setup or hold violation causes metastability where the output goes to intermediate voltage values which are eventually resolved to an unknown state.

In Verilog HDL we would model this as module D_flipflop( input D, input clock, output Q ); always @(posedge clock) Q &<= D; endmodule

The example above is provided for general understanding of the principle. In practice, one would use the better and more efficient solutions only requiring 6 NOR gates. This solution prevents the inverter in on the enable input of the first latch.

#### register

A register is a memory element that expands on the D-flip-flip. A load signal $$LD$$, limits when new data is loaded into the register (only if $$LD=1$$ during the active edge of the clock). In the circuit below, this is implemented using a 2:1 multiplexer.

Multiple data bits are stored together. An $$n$$-bit register consists of $$n$$ blocks that share the $$LD$$ and $$clk$$ signals.

A sequence of bits is commonly written as $$D[15:0]$$ referring to bits /(D_{15}\dots D_0/).

#### Large memories

Memory consists of a large number of locations that each can store a value. Each location in a memory is given a number, called an address.

Memory locations are identified by a $$k$$-bit wide address. Each memory location can store a $$n$$-bit wide value.

The figure below gives an example of 16-bit addresses storing 16-bit values. To save space in this figure, hexadecimal (base-16) notation is used to represent the address and value.

Bit density is key in building large memories. Instead of D flip-flops, large memories use more efficient methods such as:

• Static Random Access Memory (SRAM) that uses six transistors per memory bit. As we have seen this relies on a feedback between two gates.
• Dynamic Random Access Memory (DRAM) that uses only one transistor per memory bit. The mechanism relies on an electrical charge stored in the capacitor of a MOSFET gate. The drawback is that the charge has to be refreshed periodically.

Let’s take a closer look at DRAM: a single bit (cell) can be implemented as shown below. In this, the capacitor saves the state. The transistor limits access to the capacitor.

To read, select raised; the charge in the capacitor the appears on pin D. To write, select is raised for long enough to charge or drain the capacitor to the value of D.

These cells can be combined for form a large memory. The cells are organized in a matrix structure, to keep the size of the address demultiplexer practical. Otherwise, to implement k address lines, a demux with 2k outputs would be needed. The figure below shows a simplified structure implementation using a 4-bit address (and 1 bit wide).

To make a n-bit wide memory, n memory DRAM chips can be combined.

For more info refer to slides from MIT lectures ”Sequential building blocks” and “Memory Elements” and the web site lwn.net)

### Good design practices

[src]

• Use a single clock, single edge synchronous design wherever possible.
• Asynchronous interfaces lead to metastability. Minimize the asynchronous interface and use a double clock data to reduce the chance of metastability.
• Avoid asynchronous presets & clears on FFs. Use synchronous presets & clears whenever possible.
• Do not gate clocks! Instead, create clock enabled FFs via a MUX to feed back current data.

### Clock

In sequential circuits, a clock signal orchestrates the state transitions. This clock is generally a square wave generated by an astable multivibrator.

A delayed negative feedback causing it to oscillates between 0 and 1 .

An implementation using inverters and a RC circuit is shown below.

The functionality can be explained as follows:

• Suppose initially:
• output U3=0V (a logical 0), and
• the capacitor is not charged .·. U1=U3
• 0→1:
• The capacitor charges through resistor R .·. U1 increases towards 5V.
• Once U1≥2V (the 1 threshold) .·. U2 becomes 0V .·. output U3 becomes 5V (a logical 1)
• 1→0:
• The capacitor charge reverses through the resistor R .·. U1 decreases towards 0V.
• Once U1≤0.7V (the 0 threshold) .·. U2 becomes 5V .·. output U3 becomes 0V (a logical 0), and the cycle repeats itself.

### Hands On

• D latch, Yenka Technology, Digital Electronics, build d-latch using gates, use models for d-type flip-flip, binary counter
• Build or simulate a set-reset latch using NOR gates. (see Digital logic projects, page 27)
• Build or simulate a D-latch using NAND gates. (see Digital logic projects, page 6)

The following chapter introduces programmable logic that allows us to build more dense and flexible hardware systems.