Starting with Altera

A short introduction to compiling, simulating and uploading using the Altera Quartus development environment for the Cyclone IV on a DE0-Nano board.

A (relatively) short introduction to compiling, simulating and uploading using the Altera Quartus development environment for the Terasic Altera Cyclone IV DE0-Nano under Windows 10. An equivalent tutorial is available for the reader who prefers Xilinx based boards.

Install the FPGA Design Suite

Start by installing the free Quartus Prime Lite, this includes the IDE and required tool chain to create configuration files for the Altera FPGA.

  1. Install Quartus Prime Lite 21.1 (>16.1)
    • Download the Quartus Prime Lite Edition from altera.com
    • Unpack Quartus-lite-21.1.0.842-windows.tar; run the setup.bat and install to a path without spaces in the name (e.g. C:\intelFPGA_lite\21.1)
    • Include device support for Cyclone IV and Questa – Intel FPGA Starter Edition (was: ModelSim-Altera Starter Edition).
    • Select USB Blaster II driver (JTAG) installation
    • Run the Quartus Prime software
  2. Run the Quartus Prime 21.1 Device Installer
    • install Cyclone IV and ModelSim-Altera Starter support

The USB Blaster driver needs some finishing up

  1. Use a USB cable to connect your computer to the NE0-Nano board
  2. Go in Window’s Device Manager
    • Right-click Other devices » USB-Blaster » Update Driver Software
    • Browse my computer for driver software at C:\intelFPGA_lite\21.1

Install Board Support for DE0-Nano

The Terasic board support for DE0-Nano includes examples, user manual and the Terasic System Builder tool.

  • Download DE0-Nano CD-ROM from terasic.com.tw and unzip to a directory of your liking (e.g. C:\intelFPGA_lite\DE0-Nano)

Note that its Control Panel fails with Load DLL (TERASIC_JTAG_DRIVE.dll) under 64-bit Windows. No biggie, we do not need it.

Before you start

The simulator doesn’t play well with UNC paths (e.g. network shares). It triggers numerous errors and may cause timing simulations to disregard delays. Keeping the files on the server is fine, for as long as you access them through a symbolic link. To create a symbolic link, change to your home directory and create a symbolic link to your file server (in an elevated Power-Shell window do cd ~ ; New-Item -ItemType SymbolicLink -Path "Hardware.sym" -Target "\\server\path\to\files").

If you add the symbolic link to Explorer’s Quick Access, it will resolve the link first. To work around this, first create a regular directory and add that to the Quick Access. Then replace that directory with the symbolic link.

A few last tip before you take off:

Stay clear of absolute path names. Not only do they cause havoc when moving a project to a new directory, but they also confuse the simulator. If the ellipsis () file selector returns an absolute path, simply edit it to make the file relative to the project directory. E.g. change \C:\Users\you\path\project.v to project.v. If you want to quickly change all absolute paths, I suggest closing Quartus and pulling .qsf into an text editor (e.g. emacs).

Lastly, store the files on a local drive. It makes the process of compiling much faster.

Build a circuit

Terasic advises to start with their System Builder to reduce the risk of damaging the board by incorrect I/O settings. I will throw this caution to the wind and use an example Quartus Setting file instead

Start Quartus Prime and create a new project

  • File » New Project Wizard
    • Choose a working directory (.c:\users\you\whatever\example1); project name = example1
    • Project type = Empty Project
    • do not add files
    • Family = Cyclone IV E; name filter = EP4CE22F17C6; click the only device that matches
    • Finish, accepting the default EDA Tools Settings

Start with a Verilog HDL module that tests if inputs are equal (File » New » Design file > Verilog HDL, and later save it as eq1.v)

`timescale 1ns / 1ps
module eq1( input i0, input i1, output eq ); 
    wire p0, p1; // internal signal declaration
    assign eq = p0 | p1; 
    assign p0 = ~i0 & ~i1;
    assign p1 = i0 & i1; 
endmodule

Add a Verilog HDL module (File » New » Design file > Verilog HDL, and later save it as eq2.v)

`timescale 1ns / 1ps
module eq2( input [1:0] a, input [1:0] b, output aeqb ); 
    wire e0, e1; // internal signal declaration
    eq1 eq_bit0_unit(.i0(a[0]), .i1(b[0]), .eq(e0));
    eq1 eq_bit1_unit(.i0(a[1]), .i1(b[1]), .eq(e1));
    assign aeqb = e0 & e1; // a and b are equal if individual bits are equal
endmodule

For the top level, we will use a schematic. Create a symbol file (eq2.sym) for the eq2 module so we can reference it in the schematic.

  • File » Create/Update » Create Symbol Files for Current File

Create the top level schematic

  • File » New » Design file » Block Diagram/Schematic
  • File » Save as » example1.bdf
  • Make sure that this top-level module has the same name as its source file and is the same name as your project name.
  • Update the block diagram (.bdf)
    • select the block diagram tab
    • place the new symbol
      • double-click the canvas
      • expand the project directory and select the eq2.sym file that we just created
      • place the symbol in the block diagram
    • add inputs
      • double-click the canvas
      • expand the libraries directory and select “primitive » pin » input”
      • place the pin so that it touches input a
      • change the pin name to SWITCH[1..0]
      • copy and paste the input pin, and place it so that it touches input b
      • change the pin name to SWITCH[3..2]
    • add output
      • double-click the canvas
      • expand the libraries directory and select “primitive » pin » output”
      • place it so it touches output aeqb (a equals b)
      • change the pin name to LED[0]
  • Last, but not least: Mark this file as the top-level module
    • Right-click the file name, and select “Set as Top-Level Entry”

Compile

  • Processing » Start » Start Analysis & Elaboration

Implementation Constraints

Time to constrain the implementation by specifying the input and output pins along with timing requirements.

External pins assignments

  • Assignments » Pin Planner
    • This will show the five I/O pins
    • Don’t worry about the Fitter Location, it is just whatever the fitter chose last time
    • Double-click the Location field next each of the pin names to add the pin numbers (based on Table 3-2 and 3-3 in the Terasic board’s user manual)
      LED[0] PIN_A15
      SWITCH[3] PIN_M15
      SWITCH[2] PIN_B9
      SWITCH[1] PIN_T8
      SWITCH[0] PIN_M1
    • Change the I/O standard to 3.3V_LVTTL based on the same user manual (change the first one, then copy and paste)

You should end up with something like

altera-intro-pins
Pin assignments

If you plan to branch out, I suggest downloading and importing the settings file (.qsf) with pin assignments from university.altera.com.

Timing requirements

For most design you will want to specify timing requirements. For our simple example however we will not do this.

For the record, to request specific timing requirements (default time quest) you would create Synopsys Design Constraints (File » New SCD File) and save it with the same base name as the top level file (e.g. example1.sdc). For our example it just contains some comments as a reminder.

#create_clock -period 20.000 -name CLOCK_50
#derive_pll_clocks
#derive_clock_uncertainty

For more info, refer to Altera’s class Constraining Source Synchronous Interfaces.

Synthesize and upload to FPGA

Before moving ahead, I like to shorten the path names in the Quartus Settings file (.qsf). This prevents problems when you move the project and especially when you have it stored on a file server. As we will see later, the ModelSim simulator doesn’t support UNC path names, but will honor relative paths even when the projects is on a file server.

  • Assignments » Settings
    • Remove the files (.bdf, .v, .sdc)
    • Add the files back in. After selecting each file using the ellipsis () file selector, edit the resulting path name to exclude the path and press Add.
  • Project Navigator
    • Right-click example1.v and choose Set as Top-Level Entry
  • Alternatively, you can close Quartus and pull the .qsf file in a text editor (e.g. Emacs) and shorten the path names (*_FILE) at the end of the file.

Let us move ahead and generate the binary SRAM Object File (.sof). This is the configuration file to be uploaded to the FPGA

  • Click Compile Design in the Tasks window, or press the Play button in the toolbar.
    • The Compilation Report tab shows the result
    • Correct any critical warnings or errors

This would be the moment to do a timing analysis and/or simulation. However, at this point I’m simply too curious to see if it works just out of the box, so let’s give it a shot.

Connect the DE0-Nano board with a USB cable and upload the configuration file to the FPGA SRAM. (Alternately upload to Flash)

  • Quartus Prime Lite » Tools » Programmer (or double-click Program Device in the task list)
    • Click Hardware Setup, and select USB-Blaster [USB-0]
    • Click Add File, and select your SRAM object file (.sof) in the output_files directory
    • Click Start
    • Save as example1.cdf, so it opens with these settings next time
    • Great! We did it, the design is now on the FPGA (until we remove the power)

Give it a spin

With the FPGA configured, time has come to test it

  • Input is through the DIP switches (SWITCH) on the bottom-left of the board.
  • Output is the right LED (LED[0]) located at the top of the board.
  • We expect the LED to be “on” when switch positions 0 and 1 are identical to positions 2 and 3.

If you prefer bigger switches, you can wire up a breadboard to the GPIO-0 connector.

  • VCC3p3 and Ground are available on the 40-pin expansion header at respectively pin 29 and 12 (or 30) as specified in the board manual.
  • welldoneGPIO_00, 01, 02, 03 are on FPGA BGA pins at respectively D3, C3, A2, A3 and on the 40-pin expansion header at respectively pin 2, 4, 5 and 6.
  • Modify the user constraints file accordingly.

Give yourself a pat on the back; you have achieved something new today!

Timing Analysis

With the first hurdle cleared, we are going to take a closer look at the circuit. The first step will analyze the delays in the circuit to determine the conditions under which the circuit operates reliably. We will follow up with a functional and timing simulation.

If you just want to the timing information on the port-to-port path, you can use

  • Task » Reports » Custom Reports » Report Timing
    • From » click ...
      • Collection = get_ports; List; select SWITCH[0] through SWITCH[3]; click ‘>
    • To » click ...
      • Collection = get_ports; List; select LED[0]; click ‘>

To place timing constraints on the port-to-port paths:

  • Tools » TimeQuest Timing Analyzer
    • Tasks » Update Timing Netlist ; double-click
      • Constraints » Set Maximum Delay
      • From » click ...
        • Collection = get_ports; List; select SWITCH[0] through SWITCH[3]; click ‘>
      • To » click ...
        • Collection = get_ports; List; select LED[0]; click ‘>
      • Delay value = 100 ns (more or less random value)
      • Run
    • Constraints » Write SDC File to example1.sdc
    • Task » Reports » Custom Reports » Report Timing
      • no need to specify the path, just click Report Timing
      • this reveals a Data Delay of about 6.7 ns.

If you want Quartus to meet more stringent restrains, you need to specify these and recompile. This will direct Quartus to seek an implementation that meets these constraints. However, in this case we only specified restraints because we’re interested in the values. [TimeQuest User Guide]

Functional Simulation

functionalThe Verilog HDL (.v) instructions are compiled (Analysis and Synthesis) into Register Transfer Logic (RTL) netlists, called Compiler Database Files (.cdb). As the name implies, data is moved through gates and register, typically subject to some clocking condition. A functional simulation tests the circuit at this RTL level. As such it will simulate the functionality, but not the propagation delays.

Altera Quartus ships bundled with their flavor of ModelSim that lets you draw waveforms for the signals going to the module under test. You can then save these waveforms as a wave.do file or HDL test bench. For this example we to skip the step of drawing waveforms, and jump straight to using test benches.

Start by creating a test bench (eq2_tb.v) that instantiate module under test and drive its inputs. As before, you should remove the path from the file name (Assignments » Settings » Files).

`timescale 1ns / 100ps
`default_nettype none

module eq2_tb;
  reg [1:0] a;  // inputs
  reg [1:0] b; 
  wire aeqb;  // output
 
  // Instantiate the Device Under Test (UUT)
  eq2 dut ( .a(a), 
            .b(b), 
            .aeqb(aeqb) );
  initial begin
    a = 0;  // initialize inputs
    b = 0;
 
    #100;  // wait 100 ns for global reset to finish (Xilinx only?)
         
    // stimulus starts here
    a = 2'b00; b = 2'b00; #10 $display("%b", aeqb);
    a = 2'b01; b = 2'b00; #10 $display("%b", aeqb);
    a = 2'b01; b = 2'b11; #10 $display("%b", aeqb);
    a = 2'b10; b = 2'b10; #10 $display("%b", aeqb);
    a = 2'b10; b = 2'b00; #10 $display("%b", aeqb);
    a = 2'b11; b = 2'b11; #10 $display("%b", aeqb);
    a = 2'b11; b = 2'b01; #10 $display("%b", aeqb);
    #200 $stop;
   end
endmodule

Configure Quartus to create a script that compiles the test bench and modules for ModelSim

  • Assignments » Settings » EDA Tool Settings » Simulation » Compile test bench
    • New
    • Test bench name = Functional test bench
    • Top level module in test bench = eq2_tb
    • File name = eq2_tb.v, eq1.v, eq2.v (name the test bench and all the modules that ModelSim needs to compile). After selecting each file using the ellipsis () file selector, edit the resulting path name to exclude the project path, then press Add.

If things don’t go your way, please check:

  • I strongly suggest putting the test bench in a separate file, when doing timing simulations. This keeps ModelSim from using the .v file instead of the .vo file what causes propagation delays not to show in timing simulations.
  • (sdfcomp-7) Failed to open SDF file “whatever.sdo” in read mode is most likely caused by the files residing on a UNC path. Refer to the quote at the beginning of this article.
  • (sdfcomp-14) Failed to open “modelsim.ini” specified by the MODELSIM environment variable is also most likely caused by the files residing on a UNC path. Refer to the quote at the beginning of this article.
  • When you get error deleting “msim_transcript” permission denied, close ModelSim first before starting it.
  • Errors like Instantiation of ‘cycloneive_io_obuf’ failed. The design unit was not found indicate that the global libraries altera_ver or cycloneive_ver were not included in Assignments » Settings » Libraries.
  • Keep in mind that the Windows filename limit (260) may be exceeded.
  • To prevent Warning: (vsim-WLF-5000) WLF file currently in use: vsim.wlf, quit ModelSim and delete simulation/modelsim/sim.wlf and simulation/modelsim/wlft*.

Compile the circuit to RTL netlists

  • Processing » Start » Start Analysis & Elaboration

Start the simulation

  • Tools » Simulation Tool » RTL Simulation
  • Select signals of interest
    • Objects » select the signals of interest » Add Wave (or the green circle with the + sign)
    • In the text field in the toolbar, change the simulation time from 1 ps to 200 ns and click the run button to the right (or type run 1 us at the command prompt)
    • Undock the Wave Window (icon on the far top right of the window). In case the Wave Window is hidden, use Windows » Wave.
    • Click “zoom full” and observe the simulated waveforms.

You should see something like

own work
Snip

If you make a change to either your DUT or your test bench

  • right-click the test bench and select recompile.
  • make sure that you click the Restart button directly to the right of the simulation time window (or simply type restart at the command prompt).

For more information, refer to Quartus II Handbook 3: Verification

Timing Simulation

timingBefore you start a timing simulation, first close ModelSim.

After compiling your circuit into RTL netlists (.cdb), it is fitted to the physical device (.qsf) trying to satisfy resource assignments and constraints. It then attempts to optimize the remaining logic. The Netlist Writer can export this information for ModelSim in the form of Verilog Output Files (.vo), Standard Delay Format Output Files (.sdo) and scripts to prepare ModelSim through compilation and initialization.

For this exercise, we will do a timing simulation for the whole system. With our top-level entry being a schematic (.bdf), we first need to convert this to Verilog HDL

  • Open the schematic
  • File » Create/Update » Create HDL Design File from Current File
    • File type = Verilog HDL
  • Assignments » Settings » Files
    • replace the example1.bdf file with example1.v
  • Make it the top-level module (so much for using a schematic … eh)

Create a system test bench (example1_tb.v). As before, you should remove the path from the file name (Assignments » Settings » Files).

`timescale 1ns / 100ps
`default_nettype none

module example1_tb;
  reg [3:0] SWITCH;  // inputs
  wire LED;          // output
 
  example1 uut ( .SWITCH(SWITCH), 
                 .LED(LED) );
  initial begin
    SWITCH[3:0] = 2'b0000; // initialize inputs
 
    #100;  // wait 100 ns for global reset to finish (Xlinx only?)
         
    // stimulus starts here
    SWITCH[1:0] = 2'b00; SWITCH[3:2] = 2'b00; #10 $display("%b", LED);
    SWITCH[1:0] = 2'b01; SWITCH[3:2] = 2'b00; #10 $display("%b", LED);
    SWITCH[1:0] = 2'b01; SWITCH[3:2] = 2'b11; #10 $display("%b", LED);
    SWITCH[1:0] = 2'b10; SWITCH[3:2] = 2'b10; #10 $display("%b", LED);
    SWITCH[1:0] = 2'b10; SWITCH[3:2] = 2'b00; #10 $display("%b", LED);
    SWITCH[1:0] = 2'b11; SWITCH[3:2] = 2'b11; #10 $display("%b", LED);
    SWITCH[1:0] = 2'b11; SWITCH[3:2] = 2'b01; #10 $display("%b", LED);
    #200 $stop;
  end
endmodule

Create the verilog output files

  • Processing » Start » Start EDA Netlist Writer

Configure Quartus to create a script that compiles the test bench and modules for ModelSim

  • Assignments » Settings » Libraries » Global
    • Add
      • altera_ver
      • cycloneive_ver
  • Assignments » Settings » EDA Tool Settings » Simulation
    • Compile test bench
      • Delete the old Functional test bench
      • New
      • Test bench name = Timing test bench
      • Top-level module in test bench = example1_tb
      • Add the files listed below. After selecting each file using the ellipsis () file selector, edit the resulting path name to exclude the project path, then press Add.
        • example_tb.v
        • simulation/modelsim/example1.vo
          • it appears that eq2 and eq1 were included inside simulation/modelsim/example1.vo. If you have larger modules, make sure you include al the .vo files here.
        • if the simulation/modelsim directory doesn’t exist: recompile, and then add the .vo module.
      • If things don’t go your way, refer to the section “Some hints ..” under “Test bench for functional simulation”.

Start a full synthesis for the circuit

  • Processing » Start » Start Compilation

Start ModelSim

  • Tools » Simulation Tool » Gate Level Simulation
    • Timing model = Slow -6 1.2V 85 Model (default), this simulates a slow -6 speed grade model at 1.2 V and 85 °C.

Select the signals

  • Objects » select the signals of interest » Add Wave
  • Increase simulation time to 1 µs and click the run button on the right
  • Undock the Wave Window (so you can expand it)
  • Click “zoom full” and observe the simulated waveforms. You should see something like

Expect something like

own work
Timing

For more hands on experience, refer to Altera’s excellent University Program lessons, or the Terasic CD-ROM files (C:\altera_lite\DE0-Nano) for examples and documentation. Altera also has a more generic Become a FPGA Designer video-based class.

Our first implementation is the SPI interface Math Talk. This is then used to build a demonstration of math operations in FPGA as eluded to from the inquiry How do Computer do Math?.

c’est tout

Square root circuit

(c) Copyright 2016 Coert Vonk

Implements a math square root using a circuit of logic gates. Written in parameterized Verilog HDL for Altera and Xilinx FPGA’s.

Square root using logic gates

(c) Copyright 2016 Coert Vonk The square root method implemented here is a simplification of Samavi and Sutikno improvements of the non-restoring digit recurrence square root algorithm. For details about this method, refer to Chapter 7 of the inquiry “How do Computers do Math?“. \(\)

Simplified Samovi Square Root

The square root of an integer can be calculated using a circuit of Controlled Subtract-Multiplex (csm) blocks. The blocks were introduced as part of the divider implementation. The square root circuit for an 8-bits value is given in shown below.

Each row performs one “attempt subtraction” cycle. For a start, the top row attempts to subtracts binary 01. If the answer would be negative, the most significant bit of the output will be ‘1’. This msb drives the drives the Output Select (\(os\)) inputs that effectively cancels the subtraction if the result would be negative.

8-bit simplified-Samovi square-root

Similar to the divider, using Verilog HDL we can generate instances of csm blocks based on the word length of the radicand (xWIDTH) . Once more, to describe the circuit in Verilog HDL, we need to derive the rules that govern the connections between the blocks.

Start by numbering the output ports based on their location in the matrix. For this circuit, we have the output signals difference \(d\) and borrow-out \(b\). E.g. \(d_{13}\) identifies the difference signal for the block in row 1 and column 3. Next, we express the input signals as a function of the output signal names \(d\) and \(b\) and do the same for the quotient itself as shown in the table below.

2BD

Based on this table, we can now express the interconnects using Verilog HDL using ?: expressions.

generate genvar ii, jj;
: 2BD
endgenerate

The complete Verilog HDL source code along with the test bench and constraints is available at:

Results

As usual, the propagation delay \(t_{pd}\) depends size \(N\) and the value of operands. For a given size \(N\), the maximum propagation delay occurs when each subtraction needs to be cancelled.

Post-map Timing Analysis reveals the worst-case propagation delays. The values in the table below, assume that the size of both operands is the same. The exact value depends on the model and speed grade of the FPGA, the silicon itself, voltage and the die temperature.

\(N\) Timing Analysis Measured
slow 85°C slow 0°C fast 0°C actual
4-bits
8-bits
16-bits
27-bits
32-bits
Propagation delay for simplified-Samovi square-root

Conclusion

Judging from the number of research papers popping up each year, we can deduce that this is a still active field.

Add elementary functions such as sin, cos, tan, exponential, and logarithm.

Suggestion for further reading

Divider circuit

(c) Copyright 2016 Coert Vonk

Implements a math divider using a circuit of logic gates. Written in parameterized Verilog HDL for Altera and Xilinx FPGA’s.

Divider using logic gates

(c) Copyright 2016 Coert Vonk The attempt-subtraction divider was introduced in the inquiry How do Computer do Math.\(\) This most basic divider consists of interconnected Controlled Subtract-Multiplex (csm) blocks. Each blocks contains a 1-bit Full Subtractor (fs) with the usual inputs a, b and bi and outputs d and bo. The output select signal, os, signal selects between input x and and the difference x-y.

$$ \begin{align*} d^\prime &=x \oplus y \oplus b_i\\ d&=os \cdot x + \overline{os} \cdot d^\prime\\ b_o&=\overline{x} \cdot y + b_i \cdot (\overline{x \oplus y}) \end{align*} $$
1-bit Controlled Subtract-Multiplex

Attempt-subtraction divider

A complete 8:4-bit divider can therefore be implemented by a matrix of csm modules connected on rows and columns as shown in figure below. Each row performs one “attempt subtraction” cycle. Note that the most significant bit is used to drive the Output Select os inputs. (For more details see “Combinational arithmetic“.)

8:4-bit Attempt-Subtraction

Similar to the multipliers, using Verilog HDL we can generate instances of csm blocks based on the word length of the dividend (xWIDTH) and divisor (yWIDTH). To describe the circuit in Verilog HDL, we need to derive the rules that govern the connections between the blocks.

Start by numbering the output ports based on their location in the matrix. For this circuit, we have the output signals difference (\(d\)) and borrow-out (\(b\)). E.g. \(d_{13}\) identifies the difference signal for the block in row 1 and column 3. Next, we express the input signals as a function of the output signal names (\(d\) and \(b\)) and do the same for the quotient itself as shown in the table below.

(c) Copyright 2016 Coert Vonk
Signals do, bo, x, y, bi and os

Based on this table, we can now express the interconnects using Verilog HDL using ?: expressions. generate genvar ii, jj; for ( ii = 0; ii <; xWIDTH; ii = ii + 1) begin: gen_ii for ( jj = 0; jj <; yWIDTH + 1; jj = jj + 1) begin: gen_jj math_divider_csm_block csm( .a ( jj <; 1 ? x[xWIDTH-1-ii] : ii > 0 ? d[ii-1][jj-1] : 1’b0 ), .b ( jj <; yWIDTH ? y[jj] : 1'b0 ), .bi ( jj > 0 ? b[ii][jj-1] : 1’b0 ), .os ( b[ii][yWIDTH] ), .d ( d[ii][jj] ), .bo ( b[ii][jj] ) ); end end for ( ii = 0; ii <; xWIDTH; ii = ii + 1) begin: gen_p assign q[xWIDTH-1-ii] = ~b[ii][yWIDTH]; end for ( jj = 0; jj <;= yWIDTH; jj = jj + 1) begin: gen_r assign r[jj] = d[xWIDTH-1][jj]; end endgenerate[/code]

The complete Verilog HDL source code along with the test bench and constraints is available at:

Results

As usual, the propagation delay \(t_{pd}\) depends size \(N\) and the value of operands. For a given size \(N\), the maximum propagation delay occurs when each subtraction needs to be cancelled.

The worst-case propagation delays for the Terasic Altera Cyclone IV DE0-Nano are found using the post-map Timing Analysis tool. The values in the table below, assume that the size of both operands is the same. The exact value depends on the model and speed grade of the FPGA, the silicon itself, voltage and the die temperature.

\(N\) Timing Analysis Measured
slow 85°C slow 0°C fast 0°C actual
4-bits 11.2 ns 10.0 ns 6.8 ns
8-bits 37.1 ns 33.2 ns 22.0 ns
16-bits 148 ns 132 ns 84.8 ns
27-bits 408 ns 365 ns 236 ns
32-bits 485 ns 434 ns 279 ns
Propagation delay in attempt-subtraction divider

Continuing from “Divider using logic gates”, the next chapter shows an implementation of the square root algorithm introduced in Chapter 7 of the inquiry “How do Computers do Math?“.

A faster multiplier circuit

(c) Copyright 2016 Coert Vonk

Implements a carry-save array multiplier using a circuit of logic gates. Written in Verilog HDL for Altera and Xilinx FPGA’s.

Carry-save array multiplier

(c) Copyright 2016 Coert Vonk \(\) The propagating carry limits the performance of the previous algorithm. Here we investigates methods of implementing binary multiplication with a smaller latency. Low latency demands an efficient algorithm and high performance circuitry to limit propagation delays. Crucial to the performance of multipliers are high-speed adders. [Bewick]

The speed of the multiplier is directly related to this execution time of these Digital Signal Processing (DSP) applications.

Since multiplication dominates the execution time of most DSP algorithms, so there is a need of high-speed multiplier. Examples are convolution, Fast Fourier Transform (FFT), filtering and in ALU of microprocessors.

Carry-save Array Multiplier

An important advance in improving the speed of multipliers, pioneered by Wallace, is the use of carry save adders (CSA). Even though the building block is still the multiplying adder (ma), the topology of prevents a ripple carry by ensuring that, wherever possible, the carry-out signal propagates downward and not sideways.

The illustration below gives an example of this multiplication process.

(c) Copyright 2016 Coert Vonk
Inside a carry-save array multiplier

Again, the building block is the multiplying adder (ma) as describe on the previous page. However, the topology is so that the carry-out from one adder is not connected to the carry-in of the next adder. Hence preventing a ripple carry. The circuit diagram below shows the connections between these blocks.

4-bit carry-save array multiplier

The observant reader might notice that ma0x can be replaced with simple AND gates, ma4x can be replaced by adders. Also the block ma43 is not needed. More interesting, the the ripple adder in the last row, can be replace with the faster carry look ahead adder.

Similar to the carry-propagate array multiplier, using Verilog HDL we can generate instances of ma blocks based on the word length of the multiplicand and multiplier (N). To describe the circuit in Verilog HDL, we need to derive the rules that govern the connections between the blocks.

Start by numbering the output ports based on their location in the matrix. For this circuit, we have the output signals sum (s) and carry-out (c). E.g. c_13 identifies the carry-out signal for the block in row 1 and column 3. Next, we express the input signals as a function of the output signal names s and c and do the same for the product itself as shown in the table below.

Function for output signals ‘so’ and ‘co’ and output signals ‘x’, ‘y’, ‘si’, ‘ci’ and ‘p’

Based on this table, we can now express the interconnects using Verilog HDL using ?: expressions.

generate genvar ii, jj;
    for ( ii = 0; ii <;= N; ii = ii + 1) begin: gen_ii
        for ( jj = 0; jj <; N; jj = jj + 1) begin: gen_jj
            math_multiplier_ma_block ma(
                .x ( ii <; N ? a&#91;jj&#93; : (jj > 0) ? c[N][jj-1] : 1'b0 ),
                .y ( ii <; N ? b&#91;ii&#93; : 1'b1 ),
                .si ( ii > 0  jj <; N - 1 ? s&#91;ii-1&#93;&#91;jj+1&#93; : 1'b0 ),
                .ci ( ii > 0 ? c[ii-1][jj] : 1'b0 ),
                .so ( s[ii][jj] ),
                .co ( c[ii][jj] ) );
            if ( ii == N ) assign p[N+jj] = s[N][jj];
        end
        assign p[ii] = s[ii][0];
    end
endgenerate

The complete Verilog HDL source code along with the test bench and constraints is available at:

Results

The propagation delay \(t_{pd}\) depends size \(N\) and the value of operands. For a given size \(N\), the maximum propagation delay occurs when the low order bit cause a carry/sum that propagate to the highest order bit. This worst-case propagation delay is linear with \(2N\), this makes this carry-save multiplier is about 33% faster as the ripple-carry multiplier. Note that the average propagation delay is about half of this.

The post-map Timing Analysis tool shows the worst-case propagation delays for the Terasic Altera Cyclone IV DE0-Nano. The exact value depends on the model and speed grade of the FPGA, the silicon itself, voltage and the die temperature.

\(N\) Timing Analysis Measured
slow 85°C slow 0°C fast 0°C actual
4-bits 9.0 ns 8.0 ns 5.6 ns
8-bits 18.7 ns 16.8 ns 11.4 ns
16-bits 30.9 ns 27.6 ns 18.3 ns
27-bits 46.8 ns 41.9 ns 27.7 ns
32-bits 57.9 ns 51.6 ns 34.3 ns
Propagation delay in carry-save array multiplier

Other multipliers

The Wallace Multiplier decreases the latency by reorganizing the additions. Wikipedia has a good description of the algorithm. Due to the irregular routing, they can be difficult to route on a FPGA. As a consequence, additional wire delays may cause it to perform slower than carry-safe array multipliers.

The Wallace Multiplier can be combined with Booth Coding. The Booth Multiplier (alt) uses, say, 2 bits of the multiplier in generating each partial product thereby using only half the number of rows. Booth Multipliers with more fancy VLSI technique such as 0.6μ BiCMOS process using emitter coupled logic makes 53×53 multipliers possible with a latency of less than 2.6 nanoseconds [ref].

Booth multiplication is a technique that allows for smaller, faster multiplication circuits, by reordering the values to be multiplied. It is the standard technique used in chip design.

Vedic arithmetic is the ancient system of Indian mathematics which has a unique technique of calculations based on 16 Sutras (Formulae)

Another algorithm is Karatsuba. Overview.

Following this “Carry-save array multiplier using logic gates”, the next chapter shows an implementation of the divider introduced in Chapter 7 of the inquiry “How do Computers do Math?“.

Multiplier circuit

(c) Copyright 2016 Coert Vonk

Implements a math multiplier using circuits of logic gates. Written in parameterized Verilog HDL for Altera and Xilinx FPGA’s.

Multiplier using logic gates

multiplier icon We introduced the carry-propagate array multiplier in the inquiry “How do Computers do Math?“.\(\)

This multiplier is build around Multiplier Adder (ma) blocks. These ma blocks are themselves build around the Full Adder (fa) blocks introduced in the adder section. These fa blocks have the usual inputs \(a\) and \(b\), \(c_i\) and outputs \(s\) and \(c_o\). The special thing is that the internal signal \(b\) is an AND function of the inputs \(x\) and \(y\) as depicted below.

$$ \begin{align*} a &= s_i\\ b &= x \cdot y\\ s_o &= s_i\oplus b\oplus c_i\\ c_o &= s_i \cdot b + c_i \cdot(s_i \oplus b) \end{align*} $$
1-bit multiplying-adder

Carry-propagate Array Multiplier

As shown in the inquiry “How do Computers do Math?“, a carry-propagate array multiplier can be built by combining many of these ma blocks. The circuit diagram below shows the connections between these blocks for a 4-bit multiplier.

4-bit carry-propagate array multiplier

For an implementation in Verilog HDL, we can instantiate ma blocks based on the word length of the multiplicand and multiplier (\(N\)). If you are new to Verilog HDL, remember that the generate code segment expands during compilation time. In other words, it is just a short hand for writing out the long list of ma block instances.

generate genvar ii, jj;
    for ( ii = 0; ii <; N; ii = ii + 1) begin: gen_ii
        for ( jj = 0; jj <; N; jj = jj + 1) begin: gen_jj
            math_multiplier_ma_block ma( 
                .x(?), .y(?), .si(?), .ci(?),
                .so(?), .co(?) );
        end
    end
endgenerate&#91;/code&#93;
</p>
<p>
    As you might notice, the input and output ports are not described. For this, we need to derive the rules that govern these interconnects. Start by numbering the output ports based on their location in the matrix. For this circuit, we have the output signals <em>sum</em> (\(s\)) and <em>carry-out</em> (\(c\)). E.g. \(c_{13}\) identifies the carry-out signal for the block in row <code>1</code> and column <code>3</code>. Note that the circuit description depicts the matrix in a slanted fashion.

    <div class="flex-container">
        <figure>
            <a class="hide-anchor fancybox" href="/wp-content/uploads/math-multiplier-ripple-carry-tbl-ma-output.png">
                <img class="wp-image-16504" src="https://coertvonk.com/wp-content/uploads/math-multiplier-ripple-carry-tbl-ma-output.png" alt="own work" width="420" />
            </a>
            <figcaption>
                Output signals 'so' and 'co'
            </figcaption>
        </figure>
    </div>
</p>
<p>
    Knowing this, we can enter the output signals in the Verilog HDL code
    math_multiplier_ma_block ma( 
    .x(?), .y(?), .si(?), .ci(?),
    .so ( s[ii][jj] ),
    .co ( c[ii][jj] ) );

Next, we express the input signals as a function of the output signal names \(s\) and \(c\) as shown in the table below.

own work
Input signals 'x', 'y', 'si' and 'ci'

Based on this table, we can express the input assignments for each ma using "c ? a : b" expressions. Note that Verilog 2001 does not allow these programming statements for the output pins. This is why we expressed the input ports as a function of the output ports instead of visa versa.

math_multiplier_ma_block ma( 
    .x ( a[jj] ),
    .y ( b[ii]),
    .si ( ii == 0 ? 1'b0 : jj <; N - 1 ? s&#91;ii-1&#93;&#91;jj+1&#93; : c&#91;ii-1&#93;&#91;N-1&#93; ),
    .ci ( jj > 0 ? c[ii][jj-1] : 1'b0 ),
    .so ( s[ii][jj] ),
    .co ( c[ii][jj] ) );

All that is left to do is to express the inputs of the module as a function of the output signals

Output signal 'p'

Putting it all together, we get the following snippet generate genvar ii, jj; for (ii = 0; ii <; N; ii = ii + 1) begin: gen_ii for (jj = 0; jj <; N; jj = jj + 1) begin: gen_jj math_multiplier_ma_block ma( .x ( a[jj] ), .y ( b[ii]), .si ( ii == 0 ? 1'b0 : jj <; N - 1 ? s[ii-1][jj+1] : c[ii-1][N-1] ), .ci ( jj > 0 ? c[ii][jj-1] : 1'b0 ), .so ( s[ii][jj] ), .co ( c[ii][jj] ) ); end assign p[ii] = s[ii][0]; end for (jj = 1; jj <; N; jj = jj + 1) begin: gen_jj2 assign p[jj+N-1] = s[N-1][jj]; end assign p[N*2-1] = c[N-1][N-1]; endgenerate[/code]

The ma block compiles into the RTL netlist shown below

own work
1-bit multiplying-adder in RTL

As shown in the figure below, the for loops unroll into 16 interconnected ma blocks.

own work
4-bit carry-propagate array multiplier in RTL

The complete Verilog HDL source code is available at:

Results

The propagation delay \(t_{pd}\) depends size \(N\) and the value of operands. For a given size \(N\), the maximum propagation delay occurs when the low order bit because a carry/sum that propagate to the highest order bit. This worst-case propagation delay is linear with \(3N\). Note that the average propagation delay is about half of this.

The worst-case propagation delays for the Terasic Altera Cyclone IV DE0-Nano are found using the post-map Timing Analysis tool. The exact value depends on the model and speed grade of the FPGA, the silicon itself, voltage and the die temperature.

\(N\) Timing Analysis Measured
slow 85°C slow 0°C fast 0°C actual
4-bits 9.9 ns 8.9 ns 6.1 ns
8-bits 20.8 ns 18.6 ns 12.4 ns
16-bits 41.3 ns 36.9 ns 24.2 ns
27-bits 69.6 ns 62.1 ns 40.9 ns 55 ns
32-bits 83.4 ns 74.5 ns 49.0 ns
Propagation delay in carry-propagate array multiplier

The timing analysis for \(N=27\), reveals that the worst-case propagation delay path goes through \(c_0\) and \(s_o\) as shown below on the left. When measuring the worst-case propagation delay on the actual device, we use input values that cause the maximum number ripple carries and sums propagating. For a 27-bit multiplier that where the input also has a maximum value of 99,999,999, the propagation path is simulated in a spreadsheet as shown below on the right.

Worst case path
Worst case input

Brute force using the FPGA to find all combinations of operands that cause long propagation delays revealed 27'h2FA3A92 * 27h'55D4A77, 27'h60A308B * 27'd99999999 (50ns), 27'h775A668 * 27'd89999999 (55 ns), 27'h56F5D8F * 27'h3AAAB7B (55 ns).

Following this "Math multiplier using logic gates", the next chapter explores methods of making the multiplication operation faster.

A faster adder circuit

(c) Copyright 2016 Coert Vonk

Implements carry-lookahead adders using circuits of logic gates Written in Verilog HDL for Altera and Xilinx FPGA’s.

Carry-lookahead adders

(c) Copyright 2016 Coert Vonk This chapter introduces algorithms to reduce the delay when adding numbers. We will look at two carry-lookahead adders using logic gates. \(\)

Carry-lookahead adder

Adding a carry-lookahead circuit can make the carry generation much faster. The individual bit adders no longer calculate outgoing carries, but instead generate propagate and generate signals.

We define a Partial Full Adder (pfa) module with the usual ports a and b for the summands, carry input ci , the sum s, and two new signals propagate p and generate g. The propagate (p) indicates that the bit would not generate the carry bit itself, but will pass through a carry from a lower bit. The generate (g) signal indicates that the bit generates a carry independent of the incoming carry. The functionality of the pfa module is expression the circuit and Boolean equations shown below.

(c) Copyright 2016 Coert Vonk
$$ \begin{align*} g &=a \cdot b \\ p &=a \oplus b \\ s &=p \oplus c_i \end{align*} $$
1-bit partial full adder

For bit position n, the outgoing carry cn is a function of pn, gn and the incoming carry ci,n. Except for bit position 0, the incoming carry equals the outgoing carry of the previous pfa, \(c_{i,n}=c_{o,n-1}\) $$ \begin{align*} c_n &= g_n + c_{in_{n}} \cdot p_n \\ &= g_n + c_{n-1} \cdot p_n \end{align*} $$

For a 4-bit cla this results in the following equations for the carryout signals: $$ \begin{align*} c_0& = g_0 + c_i \cdot p_0 \\ c_1& = g_1 + c_0 \cdot p_1 \\ c_2& = g_2 + c_1 \cdot p_2 \\ c_3& = g_3 + c_2 \cdot p_3 \\ \end{align*} $$

Substituting the cn-1

$$ \begin{align*} c_0 &= g_0 + c_{i} \cdot p_0 \\ c_1 &= g_1 + (g_0 + c_{i} \cdot p_0) \cdot p_1 \\ &= g_1 + g_0 \cdot p_1 + c_{i} \cdot p_0 \cdot p_1 \\ c_2 &= g_2 + (g_1 + g_0 \cdot p_1 + c_{i} \cdot p_0 \cdot p_1) \cdot p_2 \\ &= g_2 + g_1 \cdot p_2 + g_0 \cdot p_1 \cdot p_2 + c_{i} \cdot p_0 \cdot p_1 \cdot p_2 \\ c_3 &= g_3 + (g_2 + g_1 \cdot p_2 + g_0 \cdot p_1 \cdot p_2 + c_{i} \cdot p_0 \cdot p_1 \cdot p_2) \cdot p_3 \\ &= g_3 + g_2 \cdot p_3 + g_1 \cdot p_2 \cdot p_3 + g_0 \cdot p_1 \cdot p_2 \cdot p_3 +c_{i} \cdot p_0 \cdot p_1 \cdot p_2 \cdot p_3 \end{align*} $$

The outgoing carries c0…3 no longer depend on each other, thereby eliminating the “ripple effect”. The outgoing carries can now be implemented with only 3 gate delays (1 for p/g generation, 1 for the ANDs and 1 for the final OR assuming gates with 5 inputs).

The circuit below gives an example of a 4-bit carry look-ahead adder.

(c) Copyright 2016 Coert Vonk
4-bit carry-lookahead adder

The complexity of the carry look-ahead increases dramatically with the bit number. Instead of calculating higher bit carries, one may daisy chaining the carry logic as shown for the 12-bit adder below.

(c) Copyright 2016 Coert Vonk
12-bit carry-lookahead adder

An implementation can be found at GitHub

Results

The propagation delay tpd depends on size n and the value of operands. For a given size n, adding the value 1 to an operand that contains all zeroes causes the longest propagation delay. The post-map Timing Analysis tool reveals the worst-case propagation delays for the Terasic Altera Cyclone IV DE0-Nano. The exact value depends on the model and speed grade of the FPGA, the silicon itself, voltage and the die temperature.

\(N\) Timing Analysis Measured
slow 85°C slow 0°C fast 0°C actual
4-bits 8.1 ns 7.2 ns 5.3 ns
8-bits 9.8 ns 8.7 ns 6.2 ns
16-bits 10.0 ns 8.9 ns 6.2 ns
27-bits 15.2 ns 13.5 ns 9.5 ns
32-bits 21.8 ns 19.7 ns 13.6 ns
42-bits 24.4 ns 21.7 ns 14.9 ns
Propagation delay in 1-level carry-lookahead adder

Multi-level carry-lookahead adder

To improve speed for larger word sizes, we can add a second level of carry look ahead. To facilitate this, we extend the cla circuit by adding \(p_{i,j}\) and \(g_{i,j}\) outputs. The propagate signal \(p_{i,j}\) indicates that an incoming carry propagates from bit position \(i\) to \(j\). The generate signal \(g_{i,j}\) indicates that a carry is generated at bit position \(j\), or if a carry out is generated at a lower bit position and propagates to position \(j\).

For a 4-bit block the equations are $$ \begin{align*} p_{0,3} &= p_3 \cdot p_2 \cdot p_1 \cdot p_0 \\ g_{0,3} &= g_3 + p_3 \cdot g_2 + p_3 \cdot p_2 \cdot g_1 + p_3 \cdot p_2 \cdot p_1 \cdot g_0 \\ c_o &= g_{3,0} + p_{3,0} \cdot c_i \end{align*} $$

The circuit for a 16-bit two-level carry-lookahead adder is shown below

(c) Copyright 2016 Coert Vonk
16-bit carry-save adder using two levels of ‘cla’

An implementation can be found at GitHub

Results

Once more, the propagation delay \(t_{pd}\) depends size \(N\) and the value of operands. For a given size \(N\), adding the value 1 to an operand that contains all zeroes causes the longest propagation delay.

Once more, the post-map Timing Analysis predicts the worst-case propagation delays for the Terasic Altera Cyclone IV DE0-Nano. As usual, the exact value depends on the model and speed grade of the FPGA, the silicon itself, voltage and the die temperature.

\(N\) Timing Analysis Measured
slow 85°C slow 0°C fast 0°C actual
4-bits 8.1 ns 7.2 ns 4.3 ns
8-bits 9.3 ns 8.3 ns 5.8 ns
16-bits 11.4 ns 10.2 ns 7.1 ns
27-bits 15.3 ns 13.6 ns 9.6 ns
32-bits 18.6 ns 16.7 ns 11.6 ns
42-bits 18.0 ns 16.1 ns 11.2 ns
Propagation delay in 2-level carry-lookahead adder

Others

Other adder designs are carry-skip, carry-select and prefix adders.

Following this “Carry-lookahead adders using logic gates”, the next chapter shows an implementation of a multiplier introduced in Chapter 7 of the inquiry “How do Computers do Math?“.

Adder and subtractor circuits

(c) Copyright 2016 Coert Vonk

Implements an adder and subtractor using circuits of logic gates. Written in parameterized Verilog HDL for Altera and Xilinx FPGA’s.

Adder and subtractor using logic gates

(c) Copyright 2016-2022 Coert Vonk The inquiry “How do Computers do Math?” introduced the carry-propagate adder and the borrow-propagate subtractor. Here we will recapitulate and implement these using Verilog HDL.\(\)

Carry-propagate adder

The full adder (fa) forms the basic building block. This full adder adds two 1-bit values a and b to the incoming carry (ci), and outputs a 1-bit sum (s) and a 1-bit outgoing carry (co). The circuit and Boolean equations, shown below, give the relations between these inputs and outputs.

$$ \begin{aligned} s &= a \oplus b \oplus c_i\\ c_o &= a \cdot b + c_i \cdot (a \oplus b) \end{aligned} \nonumber $$
1-bit full adder

To build a n-bit propagate-adder, we combine n fa blocks. The circuits adds the least significant bits and passes the carry on to the next bit, and so on. Combining the output bits forms the sum s. The circuit shown below gives an example of a 4-bit carry-propagate adder.

4-bit carry-propagate adder

We describe the circuit using array instantiation. In this a[] and b[] are the summands, ci[] is the in-coming carry, co[] is the out-going carry and s[] is the sum.

math_adder_fa_block fa [N-1:0] ( .a  ( a ),
    .b  ( b ),
    .ci ( {c[N-2:0], 1'b0} ),
    .s  ( s[N-1:0] ),
    .co ( {s[N], c[N-2:0]}) );

The compiler will optimize the fa blocks with forced inputs, but most fa blocks will compile to a RTL netlist as shown below.

(c) Copyright 2016-2022 Coert Vonk
1-bit full adder (fa) block in RTL

The 4-bit adder compiles into the daisy chained fa blocks as shown.

(c) Copyright 2016-2022 Coert Vonk
4-bit carry-propagate adder in RTL

The complete Verilog HDL code along with the test bench and constraint files are available through GitHub for Xilinx (Spartan-6 LX9) and Altera (Cyclone IV DE0-Nano) boards.

Results

The propagation delay \(t_{pd}\) depends on size N and the value of operands. For a given size N, adding the value 1 to an operand that contains all zeroes causes the longest propagation delay.

The worst-case propagation delays for the Altera Cyclone IV on the DE0-Nano are found using the post-map Timing Analysis tool. The exact values depend on the model and speed grade of the FPGA, the silicon itself, voltage and the die temperature.

\(N\) Timing Analysis Measured
slow 85°C slow 0°C fast 0°C actual
4-bits 6.9 ns 6.2 ns 4.3 ns
8-bits 8.6 ns 7.6 ns 5.4 ns
16-bits 15.7 ns 13.9 ns 9.7 ns
27-bits 20.2 ns 18.0 ns 12.3 ns
32-bits 26.0 ns 23.2 ns 15.7 ns
42-bits 30.2 ns 26.9 ns 18.1 ns
Propagation delay in propagate-carry adder

Borrow-propagate subtractor

Similar to addition, the simplest subtraction method is borrow-propagate, as introduced in Chapter 7 of the inquiry “How do Computers do Math?“. Again, we will start by building a 1-bit subtractor (fs). The inputs a and b represent the 1-bit binary numbers being added. Output d, the difference. li and lo are the incoming and outgoing borrow/loan signals.

The outputs d and lo can be expressed as a function of the inputs. The difference d is an Exclusive-OR function, just as the sum was for addition.

$$ \begin{align*} d&=a \oplus b \oplus l_i \\ l_o&=\overline{a} \cdot b + l_i \cdot (\overline{a \oplus b}) \end{align*} $$
1-bit full subtractor

To build a 4-bit subtractor we combine four of these building blocks.

4-bit borrow-propagate subtractor

The implementation of the borrow-propagate subtractor is very similar to the adder can be found at GitHub.

Combined adder and subtractor

Another method to subtract is based around the fact that \(a – b = a + (-b)\). This allows us to build a circuit that can add or subtract. When the operation input op equals 1, it subtracts b from a, otherwise it adds the values.

Under two’s complement, subtracting b is the same as adding the bit-wise complement of b and adding 1. The inputs b is negated by inverting its bits (using an XOR with signal op), and 1 is added by setting the least significant carry input to 1. We can build this a 4-bit adder/subtractor using fa blocks as shown below, where r is the result.

4-bit combined adder and subtractor

Note that the circuit also includes a overflow detection for two’s complement. Overflow occurs when c2 differs from the final carry-out c3.

In all these ripple carry adder and subtractors, the carry propagates from the lowest to the highest bit position. This propagation causes a delay that is linear with the number bits.

Moving on from this math adder and subtractor using logic gates, the next chapter explores faster circuits.

Introduction

derived work

This article series describes implementations of math operations using circuits of logic gates. Written in Verilog HDL for Altera and Xilinx FPGA’s.

Primary school teaches our students methods for addition, subtraction, multiplication and division. Computer hardware implements similar methods and performs them with astounding speed. This enables applications such as computer vision, process control, encryption, hearing aids, video compression to dubious practices as high-frequency trading.

Introduction

This article describes how to build such hardware. In such, it is a sequel to the inquiry “How do Computers do Math?” that in the chapter “Math Operations Using Gates” introduced conceptual circuits using logic gates. We will model the various math operations using digital gates. Here we combine the gates into circuits and describe them using the Verilog Hardware Description Language (HDL). These Verilog HDL descriptions are compiled and mapped to a Field Programmable Gate Array (FPGA).

Working knowledge of the Verilog HDL is assumed. To learn more about Verilog HDL, I recommend the book FPGA Prototyping with Verilog Examples, an online class or lecture slides. To help you get up to speed with the development boards, wrote Getting Started documents for two popular Altera and Xilinx boards.

We aim to study algorithms to implement the algorithms in generic VLSI, and as such do will not use the highly optimized carry chain connections present on many FPGAs.

Hail to the FPGA

Let’s take this moment to honor the virtues of the FPGA. A FPGA can do many things at the same time and still respond immediately to input events. It is generally more complicated to create and debug the same logic in a FPGA compared to a microprocessor. Because of the challenges that FPGA development poses, many systems combine FPGAs and microprocessors to get the best of both worlds. For example, to recognize objects in a video stream, one would implement the pre-processing (noise removal, normalization, edge detection) in an FPGA, but the higher-level logic on a CPU or DSP.

With a FPGA, you can put massive parallel structures in place. For example, an high-speed data stream can be distributed across the whole FPGA chip to be processed in parallel, instead of having a microprocessor deal with it sequentially.

FPGAs can accelerate machine learning algorithms, video encoding, custom algorithms, compression, indexing and cryptography. By implements a soft microprocessor as part of the FPGA it can also handle high level protocols such as handle board management, protocol bridging and security tasks. [IEEExplore]

Tools

The code examples were tested on an Altera “Cyclone IV E” FPGA using Quartus Prime 16.1. Earlier code iterations used a Xilinx “Spartan-6” with their ISE Design Suite. The Verilog descriptions should work equally well on other boards or environments.

Continue reading:
  1. Demonstration
  2. Adder (and subtractor)
  3. Faster adder (and subtractor)
  4. Multiplier
  5. Faster multiplier
  6. Divider
  7. Square root (and conclusion)

Starting with Xilinx

This describes how to install the development environment for the Avnet’s Xilinx Spartan-6 FPGA LX9 MicroBoard under Windows 10 x64. For Altera boards, refer to Getting started with FPGA design using Altera Quartus Prime. Another interesting Xilinx based board is the XuLA (XC3S200A).

Note 1: you need to be logged into em.avnet.com to access the links. This might be the only occasion where I had to use the Edge browser to access the Support & Download area.

Note 2: I can’t describe the install for the ChipScope Pro, because my license expired.

Install the Xilinx ISE Design Suite

  1. ISE Design Suite
    • download version 14.7 from xilinx.com
    • extract the tar file, and run xsetup.exe
    • select ISE WebPACK
    • have some patience.
      :
  2. Xlinx License Manager
    • will start automatically
    • get Free Vivado/ISE WebPack License » next
    • sign in (if needed create an account first)
    • generate a node locked license for ISE WebPACK
    • the Xilinx.lic will arrive as an email attachment
    • click Load License button, and install the license from the email
      :
  3. Running on Windows 10
    • ISE is in maintenance mode, and doesn’t support Windows 10 (or 8.x)
    • According eevblog, crashes with file dialogs in ISE and iMPACT can be prevented by turning off Smart Heap. To do so:
      • rename libPortability.dll to libPortability.dll.orig
      • copy libPortabilityNOSH.dll to libPortability.dll in
    • in
      • C:\Xilinx\14.7\ISE_DS\ISE\lib\nt64
      • C:\Xilinx\14.7\ISE_DS\common\lib\nt64 (copy dll from first location)
        :
  4. See if it starts
    • Double-click the ISE Design Suite icon on your desktop
      :

Install Avnet Board Support

  1. Install the board description (XDB)
    • Install the Spartan-6 FPGA LX9 MicroBoard XDB file from em.avnet.com
    • unzip, then again
    • unzip avnet_edk14_3_xbd_files.zip file to the \board folder.
      :
  2. Install USB-to-UART driver
    • Download the CP210x virtual com port driver from silabs.com.
    • Extract, and install by running CP210xVCPInstaller_x64.exe
    • Verify that Serial Port “Silicon Lapbs CP2010x USB to UART Bridge (COMx)” appears in Device Manager when the micro-USB cable is plugged in. For details see the CP210x_setup_guide.
    • Walk through the examples in the board’s Getting Started Guide. Note that instead of HyperTerm, you can use PuTTY.
      :
  3. Install USB-to-JTAG driver
    • Download the Digilent Adept 2.16.1 System.
    • Run the executable to install
    • Verify that the “Digilent USB Controller” appears in Device Manager when the USB Type A plug is plugged in.
      :
  4. Install JTAG programming utility
    • Download the Digilent Plugin for Xlinx Tools.
    • Extract, and follow instructions in the enclosed Digilent_Plug-in_Xilinx_v14.pdf
    • Copy the files from the nt64 folder to C:\Xilinx\14.7\ISE_DS\ISE\lib\nt64\plugins\Digilent\libCseDigilent\
      :
  5. Later, to program using the JTAG interface
    • Xilinx ISE » Tools » iMPACT
    • Double-click boundary scan
    • Output -> Cable setup
      • select “Digilent USB JTAG cable”
      • the port will show your port(s)
      • speed = select speed
      • click Ok. The speed field will become empty. click Ok once more.
    • Right-click boundry window, and select Initialize chain (with the microboard connected)
      • set the configuration file (.bit)
      • select as target device
    • Save the prj.
      :

A first circuit

I walked through the examples in the (old) book FPGA Prototyping with Verilog Examples. It targets Spartan 3, but still seems useful. In particular see chapter 2.6.1.

  1. Create a new design
    • Double-click the ISE icon on the desktop
    • File » New Project
      • location, working directory = ..
      • name = eq2
      • top level src type = HDL
      • Evaluation Dev Board = Avnet Spartan-6 LX9 MicroBoard
      • Synthesis tool = XST
      • Simulator = ISim
    • Project » New Source » Verilog Module
      • Enter port names
        • a input bus 1 0
        • b input bus 1 0
        • aeqb output
      • Use the text editor to enter the code eq2.v as shown below
        `timescale 1ns / 1ps
        
        module eq2(
             input [1:0] a,
             input [1:0] b,
             output aeqb
         );
        
          wire e0, e1; // internal signal declaration
        
          eq1 eq_bit0_unit(.i0(a[0]), .i1(b[0]), .eq(e0));
          eq1 eq_bit1_unit(.i0(a[1]), .i1(b[1]), .eq(e1));
        
          assign aeqb = e0 & e1; // a and b are equal if individual bits are equal
        endmodule
    • Project » New Source » Verilog Module
      • Enter port names
        • i0 input
        • i1 input
        • eq output
      • Use the text editor to enter the code eq1.v as shown below
        `timescale 1ns / 1ps
        module eq1(
            input i0,
            input i1,
            output eq
            );
        
          wire p0, p1; // internal signal declaration
        
          assign eq = p0 | p1 ;
          assign p0 = ~i0 & ~i1 ;
          assign p1 = i0 & i1 ;
        
        endmodule
    • Project » New Source » Implementation Constraints File
      • Enter the physical I/O pin assignments are user constraints.
      • Refer to the schematics, or hardware guide for details.
      • Use the text editor to enter the code eq2.ucf as shown below.
        CONFIG VCCAUX=3.3;
        NET a<0> LOC = B3 | IOSTANDARD = LVCMOS33 | PULLDOWN; #DIP switch-1
        NET a<1> LOC = A3 | IOSTANDARD = LVCMOS33 | PULLDOWN; #DIP switch-2
        NET b<0> LOC = B4 | IOSTANDARD = LVCMOS33 | PULLDOWN; #DIP switch-3
        NET b<1> LOC = A4 | IOSTANDARD = LVCMOS33 | PULLDOWN; #DIP switch-4
        NET aeqb LOC = P4 | IOSTANDARD = LVCMOS18;            #LED D2
    • Verify
      • Select the desired source file
      • In the process window (below), click the ‘+’ before “Synthesize – XST”
      • Double-click “Check Syntax”
      • The results will be shown in the transcript at the bottom
        :
  2. Synthesis
    • Generates a .bit file to be uploaded to the FPGA later
    • Select the top-level verilog file (has a little green square in the icon)
      • In the Process Window, double click “Generate Programming File”
      • The transcript at the bottom will show the results
      • Correct problems if needed
      • Check the design summary (Process Windows » Design Summary)
        :
  3. Create a test bench
    • Project » New Source » Verilog Test Fixture
    • name = eq2_test
    • associate with eq2
    • add the stimulus as shown in eq2_test.v below
      `timescale 1ns / 1ps
      
      module eq2_test;
      
        reg [1:0] a;  // inputs
        reg [1:0] b;
      
        wire aeqb;  // output
      
        // Instantiate the Unit Under Test (UUT)
        eq2 uut ( .a(a), 
                  .b(b), 
                  .aeqb(aeqb) );
      
        initial begin
          a = 0;  // initialize inputs
          b = 0;
      
          #100;  // wait 100 ns for global reset to finish
              
          // stimulus starts here
          a = 2'b00; b = 2'b00; #100 $display("%b", aeqb);
          a = 2'b01; b = 2'b00; #100 $display("%b", aeqb);
          a = 2'b01; b = 2'b11; #100 $display("%b", aeqb);
          a = 2'b10; b = 2'b10; #100 $display("%b", aeqb);
          a = 2'b10; b = 2'b00; #100 $display("%b", aeqb);
          a = 2'b11; b = 2'b11; #100 $display("%b", aeqb);
          a = 2'b11; b = 2'b01; #100 $display("%b", aeqb);
        end
      endmodule

      :

  4. Behavior Simulation
    • Xilinx ISE comes packages with the ISim simulator. It is straightforward to use and fine for basic test benches. Other choices are ModelSim (hard to install under Windows 10, in my case the install suddenly continued after >24 hours), Active-HDL, and the online tool edaplayground.com.
    • Design Window (top left) » Simulation radio-button. In the drop-down list below it, select “Behavioral” view.
      • Select the eq2_test.v file
      • In the Process Window, double-click the Simulate Behavior Model
        • Will give a Windows Security Alert for isimgui.exe. Allow it access.
      • Navigate the ISim window to verify functionality. Use the F7 to zoom out. We expect an output like:
        lx9-eq2-isim:
  5. Timing Simulation
    • In the Design Window (top left), select “Simulation”. In the drop down list below it, select “Post-Route”.
    • Select the eq2_test.v file
    • In the Process Window, double-click “Simulate Post-Place & Route Model”. This will reveal the timing delays as shown below
      lx9-eq2-isim-timing:
  6. Configure FPGA
    • Plug-in the USB type B connector from the LX9 microboard
    • Process Window » double click “Configure Target Device”
      • Before starting iMPACT, it will warn you that “No iMPACT project file exists”. Click OK to proceed.
      • Double-click “Boundary Scan”
      • Right-click in the right window, and select “Cable Setup”
        • Communication Mode = “Digilent USB JTAG cable”
        • Verify that port will show up
        • speed = select speed
        • click OK twice
      • Right-click in the right window, and select “Initialize chain”
        • assign the configuration file (.bit) created earlier
        • do not attached SPI or BPI PROM
        • click OK
        • right-click the Xilinx block, and select as “Set Target Device”
      • File » Save Project as “eq2” in the same directory as the source files.
        • Will tell you to “Set the new project file from the Configure Target Device process properties”. Don’t worry, it seems to do this automatically. Click OK to proceed.
      • Right-click the Xlilinx block, and select program.
        • This should report “Program Succeeded”
      • Close iMPACT, and save it once more on the way out.
        :
  7. Give it a spin
    • It is finally time to try the real FPGA board
      • Input is through the DIP switches (SW1) on the left of the FPGA.
      • Output is the red LED (D2) located just below the FPGA.
      • We expect the LED to be “on” when switch position 1 and 2 are identical to position 3 and 4.
    • If you prefer bigger switches, I suggest wiring up a breadboard to PMOD1 (J5) connectors.
      • Vcc and Ground are available on respectively pin 5 (or 11) and 6 (or 12).
      • Remember to modify the user constraints file accordingly. For reference, I attached a fairly complete user constraints file spartan6-lx9.ucf.

See Xilinx’ student area for more info

c’est tout

Implementation

Shows an implementation of the LC-3 instruction set in Verilog HDL. Includes test benches and simulation results.

Implementation

One dark Oregon winter afternoon, I said “Let’s build a micro processor”. What started as a noble thought became a rather intense but fun project.

This section describes the implementation of the LC-3 using a Field Programmable Logic Array. An FPGA is an array blocks with basic functionality such as Lookup table, a full adder and a flip-flop. For more information on FPGAs refer to the section Programmable Logic in the inquiry “How do computers do math?“.

The FPGA used to implement the LC-3 microprocessor is a Xilinx Spartan6, but others will fit equally well. My choice was inspired by the pricing of the development board and the fairly good free development tools. Other choices would be Altera for the FPGA, their IDE or Icarus Verilog for the synthesizer and simulator and GTKWave for the waveform viewer. Refer to the end of this article for links and references to introductory Verilog books.

Schematic

The top level schematic is shown below. The modules are defined using Verilog, an hardware description language (HDL) used to model digital logic.

LC3 schematic

This is my first Verilog implementation, please bear with me ..

State

State.v

Implementation of the LC-3 instruction set in Verilog, source file State.v:

 cCtrl,      // controller control signal
                input eREADY,           // external memory ready signal
                output wire pEn,        // update PC enable
                output wire fEn,        // fetch output enable
                output wire dEn,        // decode enable
                output wire [2:0] mOp,  // memory operation selector
                output wire rWe );      // register write enable

    `include "UpdatePC.vh"
    `include "Fetch.vh"
    `include "Decode.vh"
    `include "Registers.vh"
    `include "MemoryIF.vh"

    parameter [3:0] STATE_UPDATEPC = 4'd0,   // update program counter
                    STATE_FETCH    = 4'd1,   // fetch instruction
                    STATE_DECODE   = 4'd2,   // decode
                    STATE_ALU      = 4'd3,   // ALU
                    STATE_ADDRNPC  = 4'd4,   // calc tPC address
                    STATE_ADDRMEM  = 4'd5,   // calc memory address
                    STATE_INDMEM   = 4'd6,   // indirect memory address
                    STATE_RDMEM    = 4'd7,   // read memory
                    STATE_WRMEM    = 4'd8,   // write memory
                    STATE_WRREG    = 4'd9,   // write register
                    STATE_ILLEGAL  = 4'd15;  // illegal state

    parameter       EREADY_INA     = 1'b0,   // external memory not ready
                    EREADY_ACT     = 1'b1,   // external memory ready
                    EREADY_X       = 1'bx;

    wire [1:0] iType   = cCtrl[4:3];  // instruction type (00=alu, 01=ctrl, 10=mem)
    wire [1:0] maType  = cCtrl[2:1];  // memory access type (00=indaddr, 01=read, 02=write, 03=updreg)
    wire       indType = cCtrl[0];    // indirect memory access type

    reg [3:0] state;   // current state
    reg [3:0] nState;  // next state
    reg [6:0] out;     // current output signals
    reg [6:0] nOut;    // next output signals

    assign pEn = out[6];
    assign fEn = out[5];
    assign dEn = out[4];
    assign mOp = out[3:1];
    assign rWe = out[0];

        // the combinational logic

    always @(state, eREADY, iType, maType, indType, state, out)
        casex ({state, eREADY, iType, maType, indType})
        {STATE_UPDATEPC, EREADY_X,   ITYPE_X,   MATYPE_X,   INDTYPE_X}  : begin nState = STATE_FETCH;    nOut = {PEN_0, FEN_1, DEN_0, MOP_NONE, RWE_0}; end
        {STATE_FETCH,    EREADY_ACT, ITYPE_X,   MATYPE_X,   INDTYPE_X}  : begin nState = STATE_DECODE;   nOut = {PEN_0, FEN_0, DEN_1, MOP_NONE, RWE_0}; end
            {STATE_DECODE,   EREADY_X,   ITYPE_ALU, MATYPE_X,   INDTYPE_X}  : begin nState = STATE_ALU;      nOut = {PEN_0, FEN_0, DEN_0, MOP_NONE, RWE_0}; end
        {STATE_DECODE,   EREADY_X,   ITYPE_CTL, MATYPE_X,   INDTYPE_X}  : begin nState = STATE_ADDRNPC;  nOut = {PEN_0, FEN_0, DEN_0, MOP_NONE, RWE_0}; end
        {STATE_DECODE,   EREADY_X,   ITYPE_MEM, MATYPE_X,   INDTYPE_X}  : begin nState = STATE_ADDRMEM;  nOut = {PEN_0, FEN_0, DEN_0, MOP_NONE, RWE_0}; end
        {STATE_ADDRMEM,  EREADY_X,   ITYPE_X,   MATYPE_IND, INDTYPE_X}  : begin nState = STATE_INDMEM;   nOut = {PEN_0, FEN_0, DEN_0, MOP_RD,   RWE_0}; end
        {STATE_ADDRMEM,  EREADY_X,   ITYPE_X,   MATYPE_RD,  INDTYPE_X}  : begin nState = STATE_RDMEM;    nOut = {PEN_0, FEN_0, DEN_0, MOP_RD,   RWE_0}; end
        {STATE_INDMEM,   EREADY_ACT, ITYPE_X,   MATYPE_X,   INDTYPE_RD} : begin nState = STATE_RDMEM;    nOut = {PEN_0, FEN_0, DEN_0, MOP_RDI,  RWE_0}; end
        {STATE_ADDRMEM,  EREADY_X,   ITYPE_X,   MATYPE_WR,  INDTYPE_X}  : begin nState = STATE_WRMEM;    nOut = {PEN_0, FEN_0, DEN_0, MOP_WR,   RWE_0}; end
        {STATE_INDMEM,   EREADY_ACT, ITYPE_X,   MATYPE_X,   INDTYPE_WR} : begin nState = STATE_WRMEM;    nOut = {PEN_0, FEN_0, DEN_0, MOP_WR,   RWE_0}; end
        {STATE_ALU,      EREADY_X,   ITYPE_X,   MATYPE_X,   INDTYPE_X}  : begin nState = STATE_WRREG;    nOut = {PEN_0, FEN_0, DEN_0, MOP_NONE, RWE_1}; end
        {STATE_ADDRMEM,  EREADY_X,   ITYPE_X,   MATYPE_REG, INDTYPE_X}  : begin nState = STATE_WRREG;    nOut = {PEN_0, FEN_0, DEN_0, MOP_NONE, RWE_1}; end
        {STATE_RDMEM,    EREADY_ACT, ITYPE_X,   MATYPE_X,   INDTYPE_X}  : begin nState = STATE_WRREG;    nOut = {PEN_0, FEN_0, DEN_0, MOP_NONE, RWE_1}; end
        {STATE_WRMEM,    EREADY_ACT, ITYPE_X,   MATYPE_X,   INDTYPE_X}  : begin nState = STATE_UPDATEPC; nOut = {PEN_1, FEN_0, DEN_0, MOP_NONE, RWE_0}; end
        {STATE_WRREG,    EREADY_X,   ITYPE_X,   MATYPE_X,   INDTYPE_X}  : begin nState = STATE_UPDATEPC; nOut = {PEN_1, FEN_0, DEN_0, MOP_NONE, RWE_0}; end
        {STATE_ADDRNPC,  EREADY_X,   ITYPE_X,   MATYPE_X,   INDTYPE_X}  : begin nState = STATE_UPDATEPC; nOut = {PEN_1, FEN_0, DEN_0, MOP_NONE, RWE_0}; end
        {STATE_FETCH,    EREADY_INA, ITYPE_X,   MATYPE_X,   INDTYPE_X}  : begin nState = state;          nOut = out;                                    end
        {STATE_INDMEM,   EREADY_INA, ITYPE_X,   MATYPE_X,   INDTYPE_X}  : begin nState = state;          nOut = out;                                    end
        {STATE_RDMEM,    EREADY_INA, ITYPE_X,   MATYPE_X,   INDTYPE_X}  : begin nState = state;          nOut = out;                                    end
        {STATE_WRMEM,    EREADY_INA, ITYPE_X,   MATYPE_X,   INDTYPE_X}  : begin nState = state;          nOut = out;                                    end
        default                                                         : begin nState = STATE_ILLEGAL;  nOut = {PEN_0, FEN_0, DEN_0, MOP_NONE, RWE_0}; end
        endcase

        // the sequential logic

    always @(negedge clock, posedge reset)
    if (reset)
        begin
            state <= STATE_UPDATEPC;
            out <= {PEN_0, FEN_0, DEN_0, MOP_NONE, RWE_0};
        end
    else
    begin
        state <= nState;
        out <= nOut;
        end;
endmodule&#91;/code&#93;
</p>

<h3>
    Decode
</h3>

<h4>
    Decode.vh
</h4>

<p>
    Implementation of the LC-3 instruction set in Verilog, header file <code>Decode.vh</code>:
    parameter       DEN_0          = 1'b0,   // Decode enable
                DEN_1          = 1'b1;

parameter [1:0] ITYPE_ALU      = 2'b00,  // generalized instruction type
                ITYPE_CTL      = 2'b01,
                ITYPE_MEM      = 2'b10,
                ITYPE_HLT      = 2'b11,
                ITYPE_X        = 2'bxx;

parameter [1:0] MATYPE_IND     = 2'b00,  // generalized memory access type
                MATYPE_RD      = 2'b01,
                MATYPE_WR      = 2'b10,
                MATYPE_REG     = 2'b11,
                MATYPE_X       = 2'bxx;

parameter       INDTYPE_WR     = 1'b0,   // generalized memory indirection type
                INDTYPE_RD     = 1'b1,
                INDTYPE_X      = 1'bx;

Decode.v

Implementation of the LC-3 instruction set in Verilog, source file Decode.v:

module Decode( input clock,
                input reset,
                input en,                   // input enable
                input [15:0] eDIN,          // external memory data input
                input [2:0] psr,            // processor status register
                output [4:0] cCtrl,         // various control signals
                output [1:0] drSrc,         // selects what to write to DR
                output [2:0] uOp,           // selecta ALU operation
                output aOp,                 // selects Address operation
                output pNext,               // selects if PC should branch
                output [2:0] sr1ID,         // source register 1 ID
                output [2:0] sr2ID,         // source register 2 ID
                output [2:0] drID,          // destination register ID
                output wire [4:0] imm,      // lower 5 bits from IR value
                output wire [8:0] offset ); // lower 9 bits from IR value

    `include "ALU.vh"
    `include "Address.vh"
    `include "MemoryIF.vh"
    `include "DrMux.vh"
    `include "UpdatePC.vh"
    `include "Decode.vh"

    parameter [2:0] ID_X = 3'bxxx;

        // Instruction Register (ir)
        // read instruction from external memory bus (after Fetch initiated the bus cycle)

    reg [15:0] ir;
    assign imm    = ir[4:0];  // output the lower 5 bits
    assign offset = ir[8:0];  // output the lower 9 bits

    always @(posedge clock, posedge reset)
    if (reset)
        ir = 16'hffff;
        else
        if (en == DEN_1)
            ir = eDIN;

    parameter [3:0]           // opcodes for the instructions
        I_BR  = 4'b0000,
        I_ADD = 4'b0001,
        I_LD  = 4'b0010,
        I_ST  = 4'b0011,
        I_AND = 4'b0101,
        I_LDR = 4'b0110,
        I_STR = 4'b0111,
        I_NOT = 4'b1001,
        I_LDI = 4'b1010,
        I_STI = 4'b1011,
        I_JMP = 4'b1100,
        I_LEA = 4'b1110,
        I_HLT = 4'b1111;

    reg [20:0] ctl;           // current control signal bundle

        // untangle control signal bundle

    assign cCtrl = ctl[ 20:16 ];  // { iType, maType, indRd }
    assign uOp   = ctl[ 15:13 ];
    assign aOp   = ctl[    12 ];
    assign drSrc = ctl[ 11:10 ];
    assign pNext = ctl[     9 ];
    assign drID  = ctl[  8: 6 ];
    assign sr1ID = ctl[  5: 3 ];
    assign sr2ID = ctl[  2: 0 ];

        // combinational logic to determine control signals

    wire [2:0] uOpAddC = (ir[5]) ? UOP_ADDIMM : UOP_ADDREG;         // candidate for uOp in case of ADD instruction
    wire [2:0] uOpAndC = (ir[5]) ? UOP_ANDIMM : UOP_ANDREG;         // candidate for uOp in case of AND instruction
    wire pNextC        = |(ir[11:9] & psr) ? PNEXT_TPC : PNEXT_NPC; // candidate for pNext in case of BR instruction

    always @(ir[15:12], uOpAddC, uOpAndC, pNextC)
                    // State      State       State       ALU      Address  DrSource    UpdatePC   Registers RegistersRegisters
    case (ir[15:12])// iType      maType      indType     uOp      aOp      drSrc       pNext      drID      sr1ID    sr2ID
        I_ADD   : ctl = {ITYPE_ALU, MATYPE_X,   INDTYPE_X,  uOpAddC, AOP_X,   DRSRC_ALU,  PNEXT_NPC, ir[11:9], ir[8:6], ir[2:0] };
        I_AND   : ctl = {ITYPE_ALU, MATYPE_X,   INDTYPE_X,  uOpAndC, AOP_X,   DRSRC_ALU,  PNEXT_NPC, ir[11:9], ir[8:6], ir[2:0] };
        I_NOT   : ctl = {ITYPE_ALU, MATYPE_X,   INDTYPE_X,  UOP_NOT, AOP_X,   DRSRC_ALU,  PNEXT_NPC, ir[11:9], ir[8:6], ID_X    };
        I_BR    : ctl = {ITYPE_CTL, MATYPE_X,   INDTYPE_X,  UOP_X,   AOP_NPC, DRSRC_X,    pNextC,    ID_X,     ID_X,    ID_X    };
        I_JMP   : ctl = {ITYPE_CTL, MATYPE_X,   INDTYPE_X,  UOP_X,   AOP_SR1, DRSRC_X,    PNEXT_TPC, ID_X,     ir[8:6], ID_X    };
        I_LD    : ctl = {ITYPE_MEM, MATYPE_RD,  INDTYPE_X,  UOP_X,   AOP_NPC, DRSRC_MEM,  PNEXT_NPC, ir[11:9], ID_X,    ID_X    };
        I_LDR   : ctl = {ITYPE_MEM, MATYPE_RD,  INDTYPE_X,  UOP_X,   AOP_SR1, DRSRC_MEM,  PNEXT_NPC, ir[11:9], ir[8:6], ID_X    };
        I_LDI   : ctl = {ITYPE_MEM, MATYPE_IND, INDTYPE_RD, UOP_X,   AOP_NPC, DRSRC_MEM,  PNEXT_NPC, ir[11:9], ID_X,    ID_X    };
        I_LEA   : ctl = {ITYPE_MEM, MATYPE_REG, INDTYPE_X,  UOP_X,   AOP_NPC, DRSRC_ADDR, PNEXT_NPC, ir[11:9], ID_X,    ID_X    };
        I_ST    : ctl = {ITYPE_MEM, MATYPE_WR,  INDTYPE_X,  UOP_X,   AOP_NPC, DRSRC_X,    PNEXT_NPC, ID_X,     ID_X,    ir[11:9]};
        I_STR   : ctl = {ITYPE_MEM, MATYPE_WR,  INDTYPE_X,  UOP_X,   AOP_SR1, DRSRC_X,    PNEXT_NPC, ID_X,     ir[8:6], ir[11:9]};
        I_STI   : ctl = {ITYPE_MEM, MATYPE_IND, INDTYPE_WR, UOP_X,   AOP_NPC, DRSRC_X,    PNEXT_NPC, ID_X,     ID_X,    ir[11:9]};
        default : ctl = {ITYPE_HLT, MATYPE_X,   INDTYPE_X,  UOP_X,   AOP_X,   DRSRC_X,    PNEXT_X,   ID_X,     ID_X,    ID_X    };
    endcase 

endmodule

UpdatePC

UpdatePC.vh

Implementation of the LC-3 instruction set in Verilog, header file UpdatePC.vh:

parameter PNEXT_NPC  = 1'b0,  // UpdatePC branch signal
            PNEXT_TPC  = 1'b1,
            PNEXT_X    = 1'bx;

parameter PEN_0      = 1'b0,  // UpdatePC enable
            PEN_1      = 1'b1;

UpdatePC.v

Implementation of the LC-3 instruction set in Verilog, source file UpdatePC.v:

module UpdatePC( input clock,
                    input reset,
                    input en,                 // enable signal
                    input [15:0] tPC,         // target program counter
                    input pNext,              // if 1 then branch to tPC
                    output reg [15:0] pc,     // program counter
                    output reg [15:0] nPC );  // next program counter (pc+1)

    `include "UpdatePC.vh"

    wire [15:0] a = (pNext) ? tPC : nPC;     // if pNext==1, then jump to tPC
    wire [15:0] b = (en == PEN_1) ? a : pc;  // change PC only in "Update PC" state
    wire [15:0] c = b + 1'b1;                // use carry input

    always @(posedge clock, posedge reset)
    if (reset)
        begin
            pc  <= 16'h3000;
            nPC <= 16'h3001;
        end
    else
        begin
        pc  <= b;
        nPC <= c;
        end;

endmodule&#91;/code&#93;
</p>

<h3>
    Fetch
</h3>

<h4>
    Fetch.vh
</h4>

<p>
    Implementation of the LC-3 instruction set in Verilog, header file <code>Fetch.vh</code>:
    parameter FEN_0 = 1'b0,  // fetch enable
            FEN_1 = 1'b1;

Fetch.v

Implementation of the LC-3 instruction set in Verilog, source file Fetch.v:

module Fetch( input en,                 // output enable
                input [15:0] pc,          // program counter
                    output reg iBR,           // internal memory address lines
                    output reg [15:0] iADDR,  // internal memory address lines
                output reg iWEA );        // internal memory write enable

    `include "Fetch.vh"

    always @(en, pc)
    begin
        iBR   <= ( en == FEN_1 ) ? 1'b1 : 1'b0;
        iADDR <= ( en == FEN_1 ) ? pc : 16'hxxxx;
        iWEA  <= ( en == FEN_1 ) ? 1'b0 : 1'bx;
    end

endmodule&#91;/code&#93;
</p>

<h3>
    Registers
</h3>

<h4>
    Registers.vh
</h4>

<p>
    Implementation of the LC-3 instruction set in Verilog, header file <code>Registers.vh</code>:
    parameter       RWE_0 = 1'b0,           // register write enable
                RWE_1 = 1'b1;

parameter [2:0] PSR_POSITIVE = 3'b001,  // processor status register bits
                PSR_ZERO     = 3'b010,  //   should match BR instruction
                PSR_NEGATIVE = 3'b100;

Registers.v

Implementation of the LC-3 instruction set in Verilog, source file Registers.v:

module Registers( input clock,
                  input reset,
                  input we,                // write enable
                  input [2:0] sr1ID,       // source register 1 ID
                  input [2:0] sr2ID,       // source register 2 ID
                  input [2:0] drID,        // destination register ID
                  input [15:0] dr,         // destination register value
                  output reg [15:0] sr1,   // source register 1 value
                  output reg [15:0] sr2,   // source register 2 value
                  output reg [2:0] psr );  // processor status register

    `include "Registers.vh"

    reg [3:0] id;
    reg [15:0] gpr [0:7];     // general purpose registers

        // write the destination register value, and update Process Status Register (psr)

    always @(posedge clock, posedge reset)
    if (reset)
        for (id = 0; id < 7; id = id + 1)  // initial all registers to 0
        gpr&#91; id &#93; <= 16'h0000;
    else
        if (we == RWE_1)          // when enabled by the FSM
            begin
            if (dr&#91; 15 &#93;)         // update processor status register (neg,zero,pos)
            psr <= PSR_NEGATIVE;
            else if (|dr)
            psr <= PSR_POSITIVE;
            else
            psr <= PSR_ZERO;

                gpr&#91; drID &#93; <= dr;     // write the value dr to the register identified by drID
        end

        // output the value of the register identified by "sr1ID" on output "sr1"
        // output the value of the register identified by "sr2ID" on output "sr2"

    always @(sr1ID, sr2ID, gpr&#91; sr1ID &#93;, gpr&#91; sr2ID &#93;)
    begin
        sr1 = gpr&#91; sr1ID &#93;;
        sr2 = gpr&#91; sr2ID &#93;;
    end

endmodule&#91;/code&#93;
</p>

<h3>
    ALU
</h3>
                    
<h4>
    ALU.vh
</h4>

<p>
    parameter [2:0] UOP_ADDREG = 3'b000,  // ALU operation
                UOP_ADDIMM = 3'b001,
                UOP_ANDREG = 3'b010,
                UOP_ANDIMM = 3'b011,
                UOP_NOT    = 3'b100,
                UOP_X      = 3'bxxx;

ALU.v

module ALU( input [2:0] uOp,          // operation selector
            input [15:0] sr1,         // source register 1 value (SR1)
            input [15:0] sr2,         // source register 2 value (SR2)
            input [4:0] imm,          // lower 5 bits from instruction register
                output reg [15:0] uOut ); // result of ALU operation

    `include "ALU.vh"

    wire [15:0] imm5 = ({ {11{imm[4]}}, imm[4:0] });  // sign extend to 16 bits

    always @(uOp or sr1 or sr2 or imm5)
    casex (uOp)
        3'b000: uOut = sr1 + sr2;   // ADD Mode 0
        3'b001: uOut = sr1 + imm5;  // ADD Mode 1
        3'b010: uOut = sr1 & sr2;   // AND Mode 0
        3'b011: uOut = sr1 & imm5;  // AND Mode 1
        3'b1xx: uOut = ~(sr1);      // NOT
    endcase
endmodule

Address

Address.vh

parameter AOP_SR1 = 1'b0,  // address operation
          AOP_NPC = 1'b1,
          AOP_X   = 1'bx;

Address.v

module Address( input aOp,                // operation selector
                input [15:0] sr1,         // value source register 1
                input [15:0] nPC,         // next program counter (PC), always PC+1
                input [8:0] offset,       // lower 9 bits from instruction register
                output reg [15:0] aOut ); // target program counter

    `include "Address.vh"

    wire [15:0] offset6 = ({{10{offset[5]}}, offset[5:0]});  // sign extended the 6-bit offset
    wire [15:0] offset9 = ({{ 7{offset[8]}}, offset[8:0]});  // sign extended the 9-bit offset

    always @(aOp or sr1 or nPC or offset6 or offset9)
    case (aOp)
        AOP_SR1 : aOut = sr1 + offset6;  // register + offset
        AOP_NPC : aOut = nPC + offset9;  // next PC  + offset
    endcase

endmodule

MemoryIF

MemoryIF.vh

parameter [2:0] MOP_NONE = 3'b000,  // MemoryIF operation
                MOP_RD   = 3'b100,
                MOP_RDI  = 3'b101,
                MOP_WR   = 3'b110,
                MOP_WRI  = 3'b111;

MemoryIF.v

module MemoryIF( input [2:0] mOp,         // memory operation selector
                 input [15:0] sr2,        // source register 2 value
                 input [15:0] addr,       // address for read or write
                 input [15:0] eDIN,       // external memory data input
                 output reg iBR,          // internal bus request
                 output reg [15:0] iADDR, // internal memory address lines
                 output tri [15:0] eDOUT, // internal memory data output
                 output reg iWEA );       // internal memory write enable

    `include "MemoryIF.vh"

    reg [15:0] eDOUTr;
    assign eDOUT = eDOUTr;

    always @(mOp, sr2, addr, eDIN)
    case (mOp)
        MOP_RD  : begin iBR=1; iWEA = 1'b0; iADDR = addr;     eDOUTr = 16'hzzzz; end
        MOP_RDI : begin iBR=1; iWEA = 1'b0; iADDR = eDIN;     eDOUTr = 16'hzzzz; end
        MOP_WR  : begin iBR=1; iWEA = 1'b1; iADDR = addr;     eDOUTr = sr2;      end
        MOP_WRI : begin iBR=1; iWEA = 1'b1; iADDR = eDIN;     eDOUTr = sr2;      end
        default : begin iBR=0; iWEA = 1'bx; iADDR = 16'hxxxx; eDOUTr = 16'hzzzz; end
    endcase

endmodule

DrMux

DrMux.vh

parameter [1:0] DRSRC_ALU  = 2'b00,  // destination register source selector
                DRSRC_MEM  = 2'b01,
                DRSRC_ADDR = 2'b10,
                DRSRC_X    = 2'bxx;

DrMux.v

module DrMux( input [1:0] drSrc,       // multiplexor selector
              input [15:0] eDIN,       // external memory data input
              input [15:0] addr,       // effective memory address
              input [15:0] uOut,       // result from ALU
              output reg [15:0] dr );  // data that will be stored in DR

    `include "DrMux.vh"

    always @(drSrc or uOut or eDIN or addr)
    case (drSrc)
        DRSRC_ALU  : dr = uOut;
        DRSRC_MEM  : dr = eDIN;
        DRSRC_ADDR : dr = addr;
        default    : dr = 16'hxxxx;
    endcase

endmodule:

BusDriver

BusDriver.v

module BusDriver( input br0,                // input 0, bus request
                    input [15:0] iADDR0,      // input 0, internal memory address lines
                        input iWEA0,              // input 0, internal memory write enable
                    input br1,                // input 1, bus request
                    input [15:0] iADDR1,      // input 1, internal memory address lines
                        input iWEA1,              // input 1, internal memory write enable
                    output tri [15:0] eADDR,  // external memory address lines
                        output tri  eWEA );       // external memory write enable

    assign eWEA  = br1 ? iWEA1 :
                    br0 ? iWEA0 : 1'bz;

    assign eADDR = br1 ? iADDR1 :
                    br0 ? iADDR0 : 16'hzzzz;

endmodule

Functional simulation

The functionality of our microprocessor can be tested by building a test bench. The bench will supply the clock signal and reset pulse and simulate a random access memory (RAM) containing the test program. The program is written using in a assembly language and compiled using LC3Edit.

Test program

Exercises a variety of instructions:

  • memory read
  • alu
  • memory write
  • control instructions

Written in assembly language

own work
LC3 memory asm

LC3Edit compiles this into the object file:

own work
LC3 memory obj

As part of the compilation, LC3Edit also creates a .hex file. The contents of this file can be tweaked into a .coe file to be preloaded in the test bench memory.

own work
LC3 Memory coe

Memory

The random access memory (RAM) is created using Xilinx IP's Block Memory Generator 6.2. The following parameters are used:

  • native i/f, single port ram, no byte write enable, minimum area algorithm, width 16, depth 4096, write first, use ENA pin, no output registers, no RSTA pin, enable all warnings. Initialize memory from the .coe file.

Test bench and clock/reset signals

Generate a 50 MHz symmetric clock.

Integrate the parts into a test bench using Verilog. `timescale 1ns / 1ps module SimpleLC3_SimpleLC3_sch_tb(); reg clock; // clock (generated by test fixture) reg reset; // reset (generated by test fixture) wire [15:0] eADDR; // external address (from LC3 to memory) wire [15:0] eDIN; // external data (from memory to LC3) wire [15:0] eDOUT; // external data (from LC3 to memory) wire eWEA; // external write(~read) enable (from LC3 to memory) // Instantiate the Unit Under Test SimpleLC3 UUT ( .eDOUT(eDOUT), .eWEA(eWEA), .clock(clock), .eREADY(1'b1), // ready (always 1, for now) .eDIN(eDIN), .reset(reset), .eADDR(eADDR) ); // Instantiate the Memory, created using Xilinx IP's Block Memory Generator 6.2: // Initialize from memory.coe, created from compiling memory.asm using LC3Edit. memory RAM( .clka(clock), .ena(eENA), .wea(eWEA), .addra(eADDR[11:0]), .dina(eDOUT), .douta(eDIN) ); wire eENA = |(eADDR[15:12] == 4'h3); // memory is at h3xxx initial begin clock = 0; reset = 0; #15 reset = 1; // wait for global reset to finish #22 reset = 0; end always #10 clock <= ~clock; // 20 ns clock period (50 MHz) endmodule;[/code]

Simulation results

For the functional simulation we use ISim that comes bundled with the Xilinx IDE.

The simulation needs to be ran for 1600 ns.

Waveform diagrams are shown below (click to enlarge)

LC3 Memory load
LC3 Memory load
LC3 ALU operations
LC3 ALU operations
LC3 Memory store
LC3 Memory store
LC3 Control instructions
LC3 Control instructions

Timing simulation

The free Xilinx IDE doesn't support timing simulations. Instead we will use Icarus Verilog for the synthesis and simulation, GTKWave for viewing the generated waveforms, and Emacs verilog-mode for editing. We will run them natively under Linux. For those interested, Windows binaries are available from bleyer.org.

This concludes the "Implementation of the LC-3 instruction set in Verilog".

Design

Presents a CPU Design for LC-3 instruction set, that we later implement using Verilog HDL. The illustrations help visualize the design. The instruction set is based on the book Introduction to Computer Systems by Patt and Partel. For this text we push the simplicity of this little microprocessor (LC-3) even further as described in Instruction Set.

Design

The microprocessor consists of a Data Path and a Control Unit. Together they implement the various instruction phases.

This section describes an architecture for the LC-3. It aims at staying true to the von Neumann architecture and instruction cycle names. However, here we assume the program counter and instruction register are in the data path.

Data Path

The schematic below shows the Data Path.

own work
LC3 data path

We use the following conventions

  • The shaded blocks are modules that implement various functionality. The module names have been chosen to reflect the instruction phases.
  • Signals connect the blocks. A signal can be a single wire, or a collection of wires such as the 16 bits that represent the value of the program counter. Signal names are chosen to overlap with operand names where possible.
  • The microprocessor connects to an external memory through the external interface.

Modules

Module Description
UpdatePC Maintains the program counter, pc.
Fetch Initiates the bus cycle, to read the instruction pointed to by pc.
Decode Reads the instruction from the memory bus and extracts its operands.
Registers Maintains the register values and processor status register.
ALU Performs arithmetic and logical operations.
Address Calculates memory address for memory or control instructions.
MemoryIF Initiates the external memory bus cycle to read or write data.
DrMux Destination register multiplexor, selects the value that will be written to the destination register.
BusDriver Simple arbiter for memory read requests from Fetch and MemoryIF.

Signals

Group Signal Description
Program counters pc Program Counter
nPC Next program counter (always has the value pc+1)
tPC Target program counter, for JMP / BR*.
Operands sr1ID Source register 1 identifier.
Also used as baseRID for JMP / LDR / STR
sr2ID Source register 2 identifier.
Also used as srID for ST / STI / STR.
imm Immediate value
offset Memory address offset
Register values sr1 Value of the register identified by signal sr1ID
sr2 Value of the register identified by sr2ID
dr Value written to the register identified by drID
psr Value of the processor status register
Intermediate values uOut Result of the ALU operation
aOut=addr=tPC Result of the address calculation
External bus eADDR Memory address
eDIN Instruction/data being read from memory
eDOUT Data being written from memory
eWEA Write enable signal going to memory.
Value 0 for read, 1 for write.
Internal bus iBR0, IBR1 Internal bus request signals
iADDR0, iADDR1 Internal memory addresses
iWEA0, iWEA1 Internal write enable signals

Examples

Read memory

Assume: the instruction at address 3000 is 201F.

Assigning the label LDv to memory location 3020, this instruction decodes to

Address Value Label Mnemonic
x3000 x201F LD r0, LDv
x3020 x1234 LDv

Issuing a reset, triggers the following sequence of events:

# Module Action Signals
1. UpdatePC Resets the program counter to its initial value pc=3000, nPC=3001
2. Fetch Starts a read cycle for the instruction br0=1, iADDR0=3000, iWEA0=0
3. BusDriver Forwards the read cycle to the external memory bus eADDR=3000, eWEA=0
4. ExtMemory Responds with the instruction eDIN=201f
5. Decode Extracts the operands offset=1f, drID=0
6. Address Adds the offset to nPC addr=3020
7. MemoryIF Starts a read cycle for the data iBR1=1, iADDR1=3020, iWEA1=0
8. BusDriver Forwards the read cycle to the external memory bus eADDR=3020, eWEA=0
9. ExtMemory Responds with the data eDIN=1234
10. DrMux Selects the eDIN input dr=1234
11. Registers Writes the value dr to the register identified by drID

ALU operation

Assume: pc=3003, the register R0=1234 and R1=4321. The instruction at the next address 3004 is 1801.

This instruction decodes to

Address Value Label Mnemonic
x3004 x1801 ADD R4, R0, R1

The following sequence of events will happen:

.
# Module Action Signals
1. UpdatePC Increments the program counter pc=3004
2. Fetch Starts a read cycle for the instruction iBR0=1, iADDR0=3004, iWEA0=0
3. BusDriver Forwards the read cycle to the external memory bus eADDR=3004, eWEA=0
4. ExtMemory Responds with the instruction eDIN=1801
5. Decode Extracts the operands sr1ID=0, sr2ID=1, drID=4
6. Registers Supplies the values for the registers identified by sr1ID and sr2ID sr1=1234, sr2=4321
7. ALU Calculates the sum of sr1 and sr2 uOut=5555
8. DrMux Selects the uOut input. dr=5555
9. Registers Writes the value dr to the register identified by drID

Write memory

Assume: pc=3007, register R4=AAAA and the label STIa refers to data address 3024 containing the value 3028. The instruction at the next address 3008 is B81D.

This instruction decodes to

Address Value Label Mnemonic
x3008 xB81D STI R4, STIa
x3024 x3028 STIa
x3028 xBAD0

The following sequence of events will happen:

.
# Module Action Signals
1. UpdatePC Increments the program counter pc=3008, nPC=3009
2. Fetch Starts a read cycle for the instruction. iBR0=1, iADDR0=3008, iWEA0=0
3. Bus driver Forwards the read cycle to the external memory bus. eADDR=3008, eWEA=0
4. ExtMemory Responds with the instruction. eDIN=b81f
5. Decode Extracts the operands (sr2ID represents the SR operand) sr2ID=4, offset=1d
6. Registers Supplies the value for the register identified by sr2ID sr2=aaaa
7. Address Adds the offset to nPC addr=3024
8. MemoryIF Starts a read cycle to retrieve the address where to store the value. iBR1=1, iADDR1=3024, iWEA1=0
9. BusDriver Forwards the read cycle to the external memory bus. eADDR=3024, eWEA=0
10. ExtMemory Responds with the value eDIN=3028
11. MemoryIF Starts a write cycle to write the value of register R4 to address 3028 iBR1=1, iADDR1=3028, iWEA1=1
12. BusDriver Forwards the write cycle to the external memory bus. eADDR=3028, eWEA=1

Control unit

Instructions can be broken up into micro instructions. These can be implemented using a finite state machine (FSM), where each state corresponds to one micro instruction.

The finite state machine can be visualized as shown in the figure below.

  • circles, represent the states identified by a unique number and name.
  • double circle, represents the initial state.
  • arrows, represent state transitions. Labels represent the condition that must be met for the transition to occur.
  • shading, is used to identify the implementation modules.
  • eREADY, indicates that the external memory finished a read or write operation.
  • iType, maType, indType refer to the generalized instruction types generated by Decoder.

State diagram

own work
LC3 finite state machine

Details

  • Policies:
    • State transitions, are only possible during the falling edge of the clock signal (from 1 to 0);
    • Outputs, to the external memory interface, are driven in response to state transitions;
    • Inputs, from the external memory interface, are sampled on the rising edge of the clock signal (from 0 to 1);
    • Control signals, change only during the falling edge of the clock signal to minimize glitches.
  • Each state:
    • depends on both input signals and the previous state’
    • generates control signals control signals for the data path (with the help of the Decode module).
  • The control unit consists of two modules:
    • State, implements the state machine, and generates state specific control signals.
    • Decode, generalizes the instruction for the state machine, and generates state independent control signals.

Schematic for the control unit

own work
LC3 control unit

The next section describes the signals for the control unit in the CPU Design for LC-3 instruction set.

Signals for the control unit

Group Signal Description
External interface eREADY==1 Indicates that the external memory finished a read or write operation.
clock External supplied clock
Internal to the State module state Current state
nState Next state as determined by the combinational logic
Generalized instruction types (bundled into cCtrl) iType Instruction type
maType Memory access type
indType Indirect memory access type
Data path control pNext Signals UpdatePC to change the program counter to tPC
pEn Enables UpdatePC to change the program counter.
fEn Enable Fetch to start external memory bus cycle to read the instruction.
dEn Enables Decode to read the instruction from the external memory bus.
rWe Enables Registers to store the value of dr in the register identified by drID.
uOp Chooses the operation and inputs of the ALU
aOp Chooses the operation and inputs of the Address calculation.
mOp Chooses the memory operation to be performed by MemoryIF.
drSrc Selects the destination register source input on DrMux

The next section gives a detailed description of the modules for the CPU Design for LC-3 instruction set.

Modules (detailed description)

Module Description
State Generates the state specific control signals for each micro instruction being executed. Refer to the signals described above for details.
UpdatePC Updates the program counter, pc, at the end of each instruction cycle. The new value is:
  • 3000 if reset is asserted, or
  • the value of the tPC input, when a JMP or BR* instruction was executed (and its condition was met), or
  • otherwise, the previous value of pc+1.
Fetch
  • Initiates the external bus cycle (iBR0, iADDR0, iWEA0) to read the instruction from the memory location pointed to by pc.
  • The control unit will maintain this state until the external memory reports that the data is available (eREADY).
Decode
  • Finishes the external bus cycle by reading the instruction from the external memory bus (eDIN).
  • Decodes the instruction:
    • Based on the opcode (ir[15:11], it generalizes the instruction type for the State module (cCtrl).
    • Based on the operands (ir[10:0]), it configures the data path using state independent control signals:
      • For ALU instructions, uOp, sr1ID, sr2ID, drSrc, drID
      • For memory instructions, aOp, sr1ID (BaseR for LDR / STR), sr2ID (sr for ST / STI / STR), offset, drID
      • For control instructions, pNext, sr1 (BaseR for JMP).
Registers
  • Maintains the general purpose register (R0..R7).
  • Supplies the values for the registers identified by sr1ID, sr2ID.
  • Updates the register specified by drID to the value dr when rWe is asserted.
ALU
  • The input uOp selects both the operation type and inputs.
    • For ADD do if ir[5]==1 then uOut=sr1+imm5, else uOut=sr1+sr2.
    • For AND do if ir[5]==1 then uOut=sr1&imm5, else uOut=sr1&sr2.
    • For NOT do uOut=~sr1.
Address
  • Input aOp selects both the calculation type and inputs.
    • For BR* do aOut=nPC+offset9.
    • For LD / LDI / LEA / ST / STI, do aOut=nPC+offset9.
    • For JMP / LDR / STR, do aOut=sr1+offset6.
  • Note that aOut is connected to the addr input on MemoryIF, and the tPC input on FetchPC.
MemoryIF
  • Input mOp selects the memory access mode and inputs.
    • For LDI / STI, under the direction of the Control Unit, it first initiates a memory read cycle for addr (aOut). The Control Unit will maintain this state until the external memory reports that the data is available (eREADY). It then takes the value read from memory (eDIN), and
      • for LDI, it initiates a read cycle for address eDIN;
      • for STI, it initiates a write cycle to write the value sr2 to address eDIN.
    • For the other instructions, it takes the address, and
      • for LD / LDR, read the value from addr (aOut);
      • for LEA, do nothing;
      • for ST / STR, write the value sr2 to addr (aOut).
    • The control unit will maintain this state until the external memory reports that the data is available (eREADY).
DrMux*
  • Input drSrc selects the value that will be written to the destination register.
    • For ADD / AND / NOT do forward uOut to dr.
    • For LD / LDR / LDI do forward deign to dr.
    • For LEA do forward aOut to dr.

*) DrMux is an abbreviation for Destination Register Multiplexor.

To continue this CPU Design for LC-3 instruction set, read about its implementation on the next page.

Instruction set

Introduces a simplified LC-3 instruction set, that we later will design a CPU for and implement in Verilog HDL.

Instruction Set

The Instruction Set Architecture (ISA) specifies all the information to write a program in machine language. It contains:

  • Memory organization, specifies the address maps; how many bits per location;
  • Register set, specifies the size of the internal registers; how many registers; and how they can be used;
  • Instruction set, specifies the opcodes; operands; data types; and addressing modes

Simplicity rules

The book Introduction to Computer Systems by Patt and Partel, introduces an hypothetical microprocessor called LC-3. For this text we push the simplicity of this little computer (LC-3) even further by:

  • not supporting subroutine calls, JSR JSRR RET
  • not supporting interrupt handling, RTI TRAP
  • not supporting overflow detection in arithmetic operations
  • not validating the Instruction encoding
  • replacing the TRAP 0, with a simple HALT instruction.

Implementing this very basic Instruction Set helps us understand the inner workings of a microprocessor.

With the exception of these simplifications, the Instruction Set Architecture (ISA) is specified in the book “Introduction to Computer Systems“. The following sections summarize this ISA. For more details, refer to Appendix A.3 of the book.

Overview

  • Memory organization:
    • 16-bit addresses; word addressable only,
    • 16-bit memory words.
  • Memory map
    • User programs start at memory location 3000 hex, and may extend to FDFF.
  • Bit numbering
    • Bits are numbered from right (least significant bit) to left (most significant bit), starting with bit 0.
  • Registers
    • A 16-bit program counter (PC), contains the address of the next instruction.
    • Eight 16-bit general purpose registers, numbered 000 .. 111 binary, for register R0 .. R7.
    • A 3-bit processor status register (PSR), that is updated when an instructions writes to a register.
      • psr[2]==1, when the 2’s complement value is negative (n).
      • psr[1]==1, when the 2’s complement value is zero (z).
      • psr[0]==1, when the 2’s complement value is positive (p).
  • Instructions
    • 16-bit instructions, RISC (all instructions the same size).
      • the opcode, is encoded in the the 4 most significant bits of the instruction (bit 15..12).
      • the operands, are encoded in the remaining 12 bits of the instruction.
    • ALU performs ADD AND and NOT operations on 16-bit words.

Instructions

Operand conventions

As mentioned above, from the 16 bit instruction, only 12 bits are available for the operands. This implies that 16-bit data values or memory addresses have to be specified indirectly. For instance by referring to a value in a register.

Addressing modes:

  • PC relative, the address is calculated by adding an offset to the incremented program counter, pc.
  • Register relative, address is read from a register.
  • Indirect, address is read from a memory location who”s address is calculated by adding an offset to the incremented program counter.
  • Load effective address, address is calculated by adding an offset to the incremented program counter. The address itself (not its value) is stored in a register.

The table below shows the conventions used in describing the instructions.

Operand Description
srID, sr1ID, sr2ID Source Register Identifiers (000..111 for R0..R7)
drID Destination Register Identifier (000..111 for R0..R7)
baseRID Base Register Identifier (000..111 for R0..R7)
sr, sr1, sr2 16-bit Source Register value
dr 16-bit Destination Register value
baseR Base Register value, used together with 2’s complement offset to calculate memory address.
imm5 5-bit immediate value as 2’s complement integer
mem[address] Contents of memory at the given address
offset6 6-bit value as 2’s complement integer
offset9 9-bit value as 2’s complement integer
SX Sign-extend, by replicating the most significant bit as many times as necessary to extend to the word size of 16 bits.
Conventions

ALU instructions

There are two variations of the ADD and AND instructions. The difference is in bit 5 of the instruction word. One takes the second argument from sr2, the other takes it from the immediate value imm5.

Instruction types

Opcode Name Assembly Operation
ADD Addition ADD DR, SR1, SR2 dr = sr1 + sr2
ADD DR, SR1, imm5 dr = sr1 + SX(imm5)
AND Logical AND AND DR, SR1, SR2 dr = sr1 & sr2
AND DR, SR1, imm5 dr = sr1 & SX(imm5)
NOT Logical NOT NOT DR, SR dr = ~sr

Instruction encoding

Opcode 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
ADD 0 0 0 1 drID sr1ID 0 0 0 sr2ID
0 0 0 1 drID sr1ID 1 imm5
AND 0 1 0 1 drID sr1ID 0 0 0 sr2ID
0 1 0 1 drID sr1ID 1 imm5
NOT 1 0 0 1 drID srID 1 1 1 1 1 1

Memory instructions

Instruction types

Opcode Name Assembly Operation
LD Load LD DR, label dr = mem[pc + SX(offset9)]
LDR Load Register LDR DR, BaseR, offset6 dr = mem[baseR + SX(offset6)]
LDI Load Indirect LDI DR, label dr = mem[mem[pc + SX(offset9)]]
LEA Load Eff. Addr. LEA DR, target dr = pc + SX(offset9)
ST Store ST SR, label mem[pc + SX(offset9)] = sr
STR Store Register STR SR, BaseR, offset6 mem[baseR + SX(offset6)] = sr
STI Store Indirect STI SR, label mem[mem[pc + SX(offset9)]] = sr

Instruction encoding

opcode 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
LD 0 0 1 0 drID offset9
LDR 0 1 1 0 drID baseRID offset6
LDI 1 0 1 0 drID offset9
LEA 1 1 1 0 drID offset9
ST 0 0 1 1 srID offset9
STR 0 1 1 1 srID baseRID offset6
STI 1 0 1 1 srID offset9

Control instructions

Instruction types

Opcode Name Assembly Operation
BR* Branch BR* label if (condition*) pc = pc + SX(offset9)
JMP Jump JMP BaseR pc = baseR
HALT Halt HALT stop program execution (simplified TRAP 0)

*) The assembler instruction for BR* can be either

  • BRn label, test for state bit n
  • BRz label, test for state bit z
  • BRn label, test for state bit p
  • BRzp label, test for state bits z and p
  • BRnp label, test for state bits n and p
  • BRnz label, test for state bits n and z
  • BRnzp label, test for state bits n, z and p

Instruction encoding

opcode 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
BR* 0 0 0 0 n z p offset9
JMP 1 1 0 0 0 0 0 baseRID 0 0 0 0 0 0
HALT 1 1 1 1 0 0 0 0 0 0 1 0 0 1 0 1

This article “Simplified LC-3 Instruction set” continues with Design on the next page.

Copyright © 1996-2022 Coert Vonk, All Rights Reserved