Message protocol on FPGA

This continues the third part of Math Talk. This page shows a master implementation of the message protocol described earlier.

Messages Exchange with FPGA as Slave

The implementation builds onto the Byte Module code shown earlier. We will start by explaining how to pass multidimensional arrays through ports.

Registers

On Altera, we can the multidimensional arrays available in system verilog HDL

wire [nrRWregs+nrROregs-1:0] [31:0] registers;

The implementation is slightly more complicated on Xilinx, because Verilog 2001 doesn’t allow multidimensional arrays to be used as inputs or outputs. Instead, we work around this by flatten the 2D registers array into two vectors. One for input, and one for the output ports as shown in the code fragments below.

Flatten the 2-dimensional array, registers, into vectors rwRegs1D and roRegs1D genvar nn; wire [31:0] roRegs2D[0:nrROregs-1]; for ( nn = 0; nn < nrRWregs; nn = nn + 1) begin :nnRW assign rwRegs1D[32*nn+31:32*nn] = registers[nn]; // flatten end for ( nn = 0; nn < nrROregs; nn = nn + 1) begin :nnRO assign roRegs2D[nn] = roRegs1D[32*nn+31:32*nn]; // inflate end[/code]

Inflate the vectors, rwRegs1D and roRegs1D, into a 2-dimensional array registers.

wire [0:31] registers[0:nrRWregs+nrROregs-1]; genvar nn; for ( nn = 0; nn < nrRWregs; nn = nn + 1 ) begin :nnRW assign registers[nn] = rwRegs1D[32*nn+31:32*nn]; end for ( nn = 0; nn < nrROregs; nn = nn + 1 ) begin :nnRO assign roRegs1D[32*nn+31:32*nn] = registers[nn+nrRWregs]; end[/code]

Timing

The timing diagram below shows the relation between the different signals at the message level on Xilinx.

Timing for Xilinx implementation

The Altera implementation is more optimized, as it needs to run at 200 MHz. The gate level simulation is shown below;

Timing for Altera implementation

Finite State Machine

This message module converts the bytes into messages and visa versa. The protocol is implemented using a state machine with 4 states:

  • Idle (0)
  • Transmit status (1), transmits 8-bit status to master
  • Transmit register value (2) , transmits a 32-bit register value to the master
  • Receive register value (3), receives a 32-bit register value from the master

An additional state-like variable, byteId, is used to keeps track of what byte to transmit or receive.

SPI Finite State Machine

Sources

The complete project including constraints and test bench is available through

Verification

To verify the implementation, run the test bench (spi_msg_tb.v) using gate level simulation. This test bench will monitor the communication and report errors when found. In the real world, we connect the Arduino SPI Master that acts just like the test bench.

This article introduced SPI as a protocol and expanded it to exchange messages.

Message protocol on Arduino

Implements the SPI byte protocol on Arduino to exchange bytes with a FPGA. Written in C for Intel Arduino 101. This page continues the protocol description of the Math Talk series. This short section shows an Arduino implementation of a SPI master. The Arduino can be either an Arduino 101 or an Arduino UNO R3 that has been modified to run on 3.3V as described in the hardware section.

Arduino as Master

The Arduino is blessed with a support library for the serial peripheral interface. This greatly aids the implementation. The code shown below was tested with both types of Arduino. For the slave we used an Altera or Xilinx based FPGA implementation as described on the next page. Refer to the first part of this article for details about the physical connection. In particular, once more, please read the part about 3.3V versus 5V when using an Arduino UNO R3.

Sources

The SPI library makes it very straightforward to implement a SPI master. You can clone this project from:

The Arduino sends an alternating pattern of 0xAA and 0x55 to the FPGA. On the FPGA, LED[0] will be on when it receives 0xAA. Consequentially it will blink with 10% duty cycle. The FPGA always returns 0x55, what is displayed on the serial port.

Traces

Logic analyzer traces of the Arduino communicating with the FPGA. We see the SS going active, and the FPGA sending a byte over MISO to the Arduino. The data is sampled on the rising edge of SCLK.

SPI exchange of a byte

Zoomed out, we see the Arduino receiving three bytes from the FPGA.

SPI exchange of three bytes, zoomed out

Following this “SPI byte protocol on Arduino”, up next is: Byte Exchange with a FPGA as Slave.

Message protocol

Specifies the SPI byte protocol. We use this to exchange bytes between the Arduino microcontroller and a FPGA. This is the third part of Math Talk. In this part we describe the protocol used to transfer bytes between the microcontroller and FPGA.

Bytes Exchange Protocol

With the two devices physically connected, we need a protocol to transfer data. We chose the Serial Peripheral Interface (SPI), a lightweight protocol to connect one master to one or more slaves.

Master/slave

The SPI bus is controlled by a master device (typically a microcontroller) that orchestrates the bus access. The master generates the control signals and regulates the data flow. The illustration below shows a master with three slaves. The pinout for SCLK, MOSI, MISO and SS can be found on the previous page. The master uses the Slave Select (SS) signal to select the slave.

SPI master with three slaves

Parameters

SPI is also a protocol with many degrees of freedom. It is important that the master and slave agree on the voltage levels and maximum clock frequency. The SPI clock polarity (CPOL) and clock phase (CPHA) introduce four more degrees of freedom as shown in the table below.

SPI parameters
Mode CPOL CPHA clock idle data driven data latched
0 0 0 low falling edge rising edge
1 0 1 low rising edge falling edge
2 1 0 high rising edge falling edge
3 1 1 high falling edge rising edge

For this article we assume mode 3, where the clock is high when idle; data is driving following the falling edge of the clock and latched on the rising edge.

Operation

The protocol is easiest explained with shift registers as shown in the illustration below. The master generates the SPI Clock (SCLK) to initiate the information exchange. Data is shifted on one edge of this clock and is sampled on the opposite edge when the data is stable.

SPI Master and Slave as shift registers

In mode 3, at the falling edge of SCLK, both devices drive their most significant bit (b7) on their outgoing data line. On the rising edge, both devices clock in this bit into the least significant bit position (b0). After eight SCLK cycles, the master and slave have exchanged their values and each device processes the data received (e.g. writing it to memory). In case there is more data to be exchanged, the registers are loaded with new data and the process repeats itself. Once all data is transmitted, the master stops the SCLK clock.

Slave select

For a more complete picture, we need to include the effect of the slave select (SS*) signal that is used to address the slave devices.

SPI bus timing

Slaves may only drive their output (MISO) line when SS* is active, otherwise they should tri-stated the output. The protocol can be broken down into the following steps:

  1. The master initiates the communication by activating SS*.
    • The slave responds by starting to drive its MISO output.
    • Meanwhile the master drives its MOSI output.
  2. The master makes SCLK low.
    • On this falling edge, the master and slave drive their most significant bit position (b7) on respectively their MOSI and MISO outputs.
  3. The master makes SCLK high.
    • On this rising edge, the master and slave clock the input from their respectively MISO and MOSI inputs into the least significant bit position (b0).
  4. Go back to step 2. until the least significant bit position (b0) has been sent.
  5. When all bits are transmitted, the master deactivates SS*.

Following this definition of the SPI byte protocol, the following pages describe an implementation of this protocol, where an Arduino is the master and a FPGA is the slave.

Byte protocol on Arduino

Implements the SPI byte protocol on Arduino to exchange bytes with a FPGA. Written in C for Intel Arduino 101. This page continues the protocol description of the Math Talk series. This short section shows an Arduino implementation of a SPI master. The Arduino can be either an Arduino 101 or an Arduino UNO R3 that has been modified to run on 3.3V as described in the hardware section.

Arduino as Master

The Arduino is blessed with a support library for the serial peripheral interface. This greatly aids the implementation. The code shown below was tested with both types of Arduino. For the slave we used an Altera or Xilinx based FPGA implementation as described on the next page. Refer to the first part of this article for details about the physical connection. In particular, once more, please read the part about 3.3V versus 5V when using an Arduino UNO R3.

Sources

The SPI library makes it very straightforward to implement a SPI master. You can clone this project from:

The Arduino sends an alternating pattern of 0xAA and 0x55 to the FPGA. On the FPGA, LED[0] will be on when it receives 0xAA. Consequentially it will blink with 10% duty cycle. The FPGA always returns 0x55, what is displayed on the serial port.

Traces

Logic analyzer traces of the Arduino communicating with the FPGA. We see the SS going active, and the FPGA sending a byte over MISO to the Arduino. The data is sampled on the rising edge of SCLK.

SPI exchange of a byte

Zoomed out, we see the Arduino receiving three bytes from the FPGA.

SPI exchange of three bytes, zoomed out

Following this “SPI byte protocol on Arduino”, up next is: Byte Exchange with a FPGA as Slave.

Byte protocol

Specifies the SPI byte protocol. We use this to exchange bytes between the Arduino microcontroller and a FPGA. This is the third part of Math Talk. In this part we describe the protocol used to transfer bytes between the microcontroller and FPGA.

Bytes Exchange Protocol

With the two devices physically connected, we need a protocol to transfer data. We chose the Serial Peripheral Interface (SPI), a lightweight protocol to connect one master to one or more slaves.

Master/slave

The SPI bus is controlled by a master device (typically a microcontroller) that orchestrates the bus access. The master generates the control signals and regulates the data flow. The illustration below shows a master with three slaves. The pinout for SCLK, MOSI, MISO and SS can be found on the previous page. The master uses the Slave Select (SS) signal to select the slave.

SPI master with three slaves

Parameters

SPI is also a protocol with many degrees of freedom. It is important that the master and slave agree on the voltage levels and maximum clock frequency. The SPI clock polarity (CPOL) and clock phase (CPHA) introduce four more degrees of freedom as shown in the table below.

SPI parameters
Mode CPOL CPHA clock idle data driven data latched
0 0 0 low falling edge rising edge
1 0 1 low rising edge falling edge
2 1 0 high rising edge falling edge
3 1 1 high falling edge rising edge

For this article we assume mode 3, where the clock is high when idle; data is driving following the falling edge of the clock and latched on the rising edge.

Operation

The protocol is easiest explained with shift registers as shown in the illustration below. The master generates the SPI Clock (SCLK) to initiate the information exchange. Data is shifted on one edge of this clock and is sampled on the opposite edge when the data is stable.

own work
SPI Master and Slave as shift registers

In mode 3, at the falling edge of SCLK, both devices drive their most significant bit (b7) on their outgoing data line. On the rising edge, both devices clock in this bit into the least significant bit position (b0). After eight SCLK cycles, the master and slave have exchanged their values and each device processes the data received (e.g. writing it to memory). In case there is more data to be exchanged, the registers are loaded with new data and the process repeats itself. Once all data is transmitted, the master stops the SCLK clock.

Slave select

For a more complete picture, we need to include the effect of the slave select (SS*) signal that is used to address the slave devices.

SPI bus timing

Slaves may only drive their output (MISO) line when SS* is active, otherwise they should tri-stated the output. The protocol can be broken down into the following steps:

  1. The master initiates the communication by activating SS*.
    • The slave responds by starting to drive its MISO output.
    • Meanwhile the master drives its MOSI output.
  2. The master makes SCLK low.
    • On this falling edge, the master and slave drive their most significant bit position (b7) on respectively their MOSI and MISO outputs.
  3. The master makes SCLK high.
    • On this rising edge, the master and slave clock the input from their respectively MISO and MOSI inputs into the least significant bit position (b0).
  4. Go back to step 2. until the least significant bit position (b0) has been sent.
  5. When all bits are transmitted, the master deactivates SS*.

Following this definition of the SPI byte protocol, the following pages describe an implementation of this protocol, where an Arduino is the master and a FPGA is the slave.

Hardware

Implementation of the hardware SPI connection between an Intel Arduino 101 and a FPGA. Includes pinouts and schematic. This is the second part of Math Talk. We will describe the hardware components and the physical interconnect to communicate with the FPGA.

SPI connection between Arduino and FPGA

SPI is a protocol, in which one device (the master) controls one or more other devices (the slaves). For the master we use an open-source microcontroller prototyping platform, such as the Arduino 101 or a modified Arduino UNO R3. In this document we use Arduino to refer to either platform.

The slave can be a low-cost FPGA prototyping platforms, such as the Xilinx Spartan-6 Avnet LX9 or the Altera Cyclone-IV Terasic DE0-Nano.

The repository includes project files and pin assignments for both these boards. The code is written in HDL Verilog and should work equally well on more powerful boards.

Voltage levels

It is very important that the I/O voltage levels of the devices match. Both FPGA boards support 3.3V levels, and are a good match for the Arduino 101. However, the Arduino UNO uses the traditional 5 Volt TTL levels. Instead of using a level shifter, such as the 74LVC245, we opt for converting the Arduino to 3.3V according to Adafruit’s instructions. Running a 16 MHz clock at 3.3V is out of spec. Is said to work, but should really program the fuses to get the frequency down to abt. 13 MHz.

Signals

The SPI interface is a 4 wire interface. The bus consists of 3 signals plus a slave select signal for each device.

  • SCLK, clock signal sent from the master to all slaves;
  • MOSI, serial data from the master to the slaves (Master Out-Slave In);
  • MISO, serial data from a slave to the master (Master In-Slave Out);
  • SSn, slave select signal for each slave.

Once the Arduino runs at 3.3V, connecting the two devices becomes trivial.

Pinouts

Avnet LX9 (Xilinx Spartan-6)
Terasic DE0-Nano (Altera Cyclone-IV)

Arduino SPI pins
Intel Arduino 101

Connect

List of physical connections

Connections
signal Arduino Xlinx FPGA Altera FPGA
SS Digital I/O 10 PMOD J4 pin1 GPIO0 J1 pin4
MOSI Digital I/O 11 PMOD J4 pin2 GPIO0 J1 pin6
MISO Digital I/O 12 PMOD J4 pin3 GPIO0 J1 pin8
SCK Digital I/O 13 PMOD J4 pin4 GPIO0 J1 pin10
GND GND PMOD J4 pin5 GPIO0 J1 pin12

Schematic

Interconnect between Arduino 101 and NE0-nano

Following this “SPI connection between Arduino and FPGA”, the next page describes how to exchange bytes over this physical interface.

Introduction

This series “Connecting Arduino to FPGA” describes how the Arduino can access custom registers on a FPGA. Hardware schematic and protocol implementation to transfer messages. Written in Verilog HDL and C.

Building Math Circuits implemented a math compute device on a Field Programmable Gate Array (FPGA). This sequel describes a protocol and its implementation that enables a FPGA and microcontroller to communicate with each other. The idea is to generate operands on an microcontroller; the Math Hardware then performs the operations and returns the results. The communication between the devices is the focus of this article.

Connecting Arduino to FPGA

The protocol that we will use is called Serial Peripheral Interface (SPI). It is a synchronous full-duplex serial interface [1], and is commonly used to communicate with on-board peripherals such as EEPROM, FLASH memory, A/D converters, temperature sensors, or in our case a Field Programmable Gate Array (FPGA).

We assume a working knowledge of the Verilog hardware description language. To learn more about Verilog refer a book such as “FPGA Prototyping with Verilog Examples” by Chu, do the free online class at verilog.com, or read through the slides Intro to Verilog from MIT. Instructions on installing the toolchain for the Verilog IDEs can be found at Getting Started with FPGA programming on Altera or on Xilinx.

Contents

The series “Connecting Arduino to FPGA” starts with describing the physical connections. We then look into exchanging bytes between a microcontroller and an FPGA. The last part implements an layer that allows message passing to custom registers.

Square root circuit

(c) Copyright 2016 Coert Vonk

Implements a math square root using a circuit of logic gates. Written in parameterized Verilog HDL for Altera and Xilinx FPGA’s.

Square root using logic gates

(c) Copyright 2016 Coert Vonk The square root method implemented here is a simplification of Samavi and Sutikno improvements of the non-restoring digit recurrence square root algorithm. For details about this method, refer to Chapter 7 of the inquiry “How do Computers do Math?“. \(\)

Simplified Samovi Square Root

The square root of an integer can be calculated using a circuit of Controlled Subtract-Multiplex (csm) blocks. The blocks were introduced as part of the divider implementation. The square root circuit for an 8-bits value is given in shown below.

Each row performs one “attempt subtraction” cycle. For a start, the top row attempts to subtracts binary 01. If the answer would be negative, the most significant bit of the output will be ‘1’. This msb drives the drives the Output Select (\(os\)) inputs that effectively cancels the subtraction if the result would be negative.

8-bit simplified-Samovi square-root

Similar to the divider, using Verilog HDL we can generate instances of csm blocks based on the word length of the radicand (xWIDTH) . Once more, to describe the circuit in Verilog HDL, we need to derive the rules that govern the connections between the blocks.

Start by numbering the output ports based on their location in the matrix. For this circuit, we have the output signals difference \(d\) and borrow-out \(b\). E.g. \(d_{13}\) identifies the difference signal for the block in row 1 and column 3. Next, we express the input signals as a function of the output signal names \(d\) and \(b\) and do the same for the quotient itself as shown in the table below.

2BD

Based on this table, we can now express the interconnects using Verilog HDL using ?: expressions.

generate genvar ii, jj;
: 2BD
endgenerate

The complete Verilog HDL source code along with the test bench and constraints is available at:

Results

As usual, the propagation delay \(t_{pd}\) depends size \(N\) and the value of operands. For a given size \(N\), the maximum propagation delay occurs when each subtraction needs to be cancelled.

Post-map Timing Analysis reveals the worst-case propagation delays. The values in the table below, assume that the size of both operands is the same. The exact value depends on the model and speed grade of the FPGA, the silicon itself, voltage and the die temperature.

\(N\) Timing Analysis Measured
slow 85°C slow 0°C fast 0°C actual
4-bits
8-bits
16-bits
27-bits
32-bits
Propagation delay for simplified-Samovi square-root

Conclusion

Judging from the number of research papers popping up each year, we can deduce that this is a still active field.

Add elementary functions such as sin, cos, tan, exponential, and logarithm.

Suggestion for further reading

Divider circuit

(c) Copyright 2016 Coert Vonk

Implements a math divider using a circuit of logic gates. Written in parameterized Verilog HDL for Altera and Xilinx FPGA’s.

Divider using logic gates

(c) Copyright 2016 Coert Vonk The attempt-subtraction divider was introduced in the inquiry How do Computer do Math.\(\) This most basic divider consists of interconnected Controlled Subtract-Multiplex (csm) blocks. Each blocks contains a 1-bit Full Subtractor (fs) with the usual inputs a, b and bi and outputs d and bo. The output select signal, os, signal selects between input x and and the difference x-y.

$$ \begin{align*} d^\prime &=x \oplus y \oplus b_i\\ d&=os \cdot x + \overline{os} \cdot d^\prime\\ b_o&=\overline{x} \cdot y + b_i \cdot (\overline{x \oplus y}) \end{align*} $$
1-bit Controlled Subtract-Multiplex

Attempt-subtraction divider

A complete 8:4-bit divider can therefore be implemented by a matrix of csm modules connected on rows and columns as shown in figure below. Each row performs one “attempt subtraction” cycle. Note that the most significant bit is used to drive the Output Select os inputs. (For more details see “Combinational arithmetic“.)

8:4-bit Attempt-Subtraction

Similar to the multipliers, using Verilog HDL we can generate instances of csm blocks based on the word length of the dividend (xWIDTH) and divisor (yWIDTH). To describe the circuit in Verilog HDL, we need to derive the rules that govern the connections between the blocks.

Start by numbering the output ports based on their location in the matrix. For this circuit, we have the output signals difference (\(d\)) and borrow-out (\(b\)). E.g. \(d_{13}\) identifies the difference signal for the block in row 1 and column 3. Next, we express the input signals as a function of the output signal names (\(d\) and \(b\)) and do the same for the quotient itself as shown in the table below.

(c) Copyright 2016 Coert Vonk
Signals do, bo, x, y, bi and os

Based on this table, we can now express the interconnects using Verilog HDL using ?: expressions. generate genvar ii, jj; for ( ii = 0; ii <; xWIDTH; ii = ii + 1) begin: gen_ii for ( jj = 0; jj <; yWIDTH + 1; jj = jj + 1) begin: gen_jj math_divider_csm_block csm( .a ( jj <; 1 ? x[xWIDTH-1-ii] : ii > 0 ? d[ii-1][jj-1] : 1’b0 ), .b ( jj <; yWIDTH ? y[jj] : 1'b0 ), .bi ( jj > 0 ? b[ii][jj-1] : 1’b0 ), .os ( b[ii][yWIDTH] ), .d ( d[ii][jj] ), .bo ( b[ii][jj] ) ); end end for ( ii = 0; ii <; xWIDTH; ii = ii + 1) begin: gen_p assign q[xWIDTH-1-ii] = ~b[ii][yWIDTH]; end for ( jj = 0; jj <;= yWIDTH; jj = jj + 1) begin: gen_r assign r[jj] = d[xWIDTH-1][jj]; end endgenerate[/code]

The complete Verilog HDL source code along with the test bench and constraints is available at:

Results

As usual, the propagation delay \(t_{pd}\) depends size \(N\) and the value of operands. For a given size \(N\), the maximum propagation delay occurs when each subtraction needs to be cancelled.

The worst-case propagation delays for the Terasic Altera Cyclone IV DE0-Nano are found using the post-map Timing Analysis tool. The values in the table below, assume that the size of both operands is the same. The exact value depends on the model and speed grade of the FPGA, the silicon itself, voltage and the die temperature.

\(N\) Timing Analysis Measured
slow 85°C slow 0°C fast 0°C actual
4-bits 11.2 ns 10.0 ns 6.8 ns
8-bits 37.1 ns 33.2 ns 22.0 ns
16-bits 148 ns 132 ns 84.8 ns
27-bits 408 ns 365 ns 236 ns
32-bits 485 ns 434 ns 279 ns
Propagation delay in attempt-subtraction divider

Continuing from “Divider using logic gates”, the next chapter shows an implementation of the square root algorithm introduced in Chapter 7 of the inquiry “How do Computers do Math?“.

A faster multiplier circuit

(c) Copyright 2016 Coert Vonk

Implements a carry-save array multiplier using a circuit of logic gates. Written in Verilog HDL for Altera and Xilinx FPGA’s.

Carry-save array multiplier

(c) Copyright 2016 Coert Vonk \(\) The propagating carry limits the performance of the previous algorithm. Here we investigates methods of implementing binary multiplication with a smaller latency. Low latency demands an efficient algorithm and high performance circuitry to limit propagation delays. Crucial to the performance of multipliers are high-speed adders. [Bewick]

The speed of the multiplier is directly related to this execution time of these Digital Signal Processing (DSP) applications.

Since multiplication dominates the execution time of most DSP algorithms, so there is a need of high-speed multiplier. Examples are convolution, Fast Fourier Transform (FFT), filtering and in ALU of microprocessors.

Carry-save Array Multiplier

An important advance in improving the speed of multipliers, pioneered by Wallace, is the use of carry save adders (CSA). Even though the building block is still the multiplying adder (ma), the topology of prevents a ripple carry by ensuring that, wherever possible, the carry-out signal propagates downward and not sideways.

The illustration below gives an example of this multiplication process.

(c) Copyright 2016 Coert Vonk
Inside a carry-save array multiplier

Again, the building block is the multiplying adder (ma) as describe on the previous page. However, the topology is so that the carry-out from one adder is not connected to the carry-in of the next adder. Hence preventing a ripple carry. The circuit diagram below shows the connections between these blocks.

4-bit carry-save array multiplier

The observant reader might notice that ma0x can be replaced with simple AND gates, ma4x can be replaced by adders. Also the block ma43 is not needed. More interesting, the the ripple adder in the last row, can be replace with the faster carry look ahead adder.

Similar to the carry-propagate array multiplier, using Verilog HDL we can generate instances of ma blocks based on the word length of the multiplicand and multiplier (N). To describe the circuit in Verilog HDL, we need to derive the rules that govern the connections between the blocks.

Start by numbering the output ports based on their location in the matrix. For this circuit, we have the output signals sum (s) and carry-out (c). E.g. c_13 identifies the carry-out signal for the block in row 1 and column 3. Next, we express the input signals as a function of the output signal names s and c and do the same for the product itself as shown in the table below.

Function for output signals ‘so’ and ‘co’ and output signals ‘x’, ‘y’, ‘si’, ‘ci’ and ‘p’

Based on this table, we can now express the interconnects using Verilog HDL using ?: expressions.

generate genvar ii, jj;
    for ( ii = 0; ii <;= N; ii = ii + 1) begin: gen_ii
        for ( jj = 0; jj <; N; jj = jj + 1) begin: gen_jj
            math_multiplier_ma_block ma(
                .x ( ii <; N ? a[jj] : (jj > 0) ? c[N][jj-1] : 1'b0 ),
                .y ( ii <; N ? b[ii] : 1'b1 ),
                .si ( ii > 0  jj <; N - 1 ? s[ii-1][jj+1] : 1'b0 ),
                .ci ( ii > 0 ? c[ii-1][jj] : 1'b0 ),
                .so ( s[ii][jj] ),
                .co ( c[ii][jj] ) );
            if ( ii == N ) assign p[N+jj] = s[N][jj];
        end
        assign p[ii] = s[ii][0];
    end
endgenerate

The complete Verilog HDL source code along with the test bench and constraints is available at:

Results

The propagation delay \(t_{pd}\) depends size \(N\) and the value of operands. For a given size \(N\), the maximum propagation delay occurs when the low order bit cause a carry/sum that propagate to the highest order bit. This worst-case propagation delay is linear with \(2N\), this makes this carry-save multiplier is about 33% faster as the ripple-carry multiplier. Note that the average propagation delay is about half of this.

The post-map Timing Analysis tool shows the worst-case propagation delays for the Terasic Altera Cyclone IV DE0-Nano. The exact value depends on the model and speed grade of the FPGA, the silicon itself, voltage and the die temperature.

\(N\) Timing Analysis Measured
slow 85°C slow 0°C fast 0°C actual
4-bits 9.0 ns 8.0 ns 5.6 ns
8-bits 18.7 ns 16.8 ns 11.4 ns
16-bits 30.9 ns 27.6 ns 18.3 ns
27-bits 46.8 ns 41.9 ns 27.7 ns
32-bits 57.9 ns 51.6 ns 34.3 ns
Propagation delay in carry-save array multiplier

Other multipliers

The Wallace Multiplier decreases the latency by reorganizing the additions. Wikipedia has a good description of the algorithm. Due to the irregular routing, they can be difficult to route on a FPGA. As a consequence, additional wire delays may cause it to perform slower than carry-safe array multipliers.

The Wallace Multiplier can be combined with Booth Coding. The Booth Multiplier (alt) uses, say, 2 bits of the multiplier in generating each partial product thereby using only half the number of rows. Booth Multipliers with more fancy VLSI technique such as 0.6μ BiCMOS process using emitter coupled logic makes 53×53 multipliers possible with a latency of less than 2.6 nanoseconds [ref].

Booth multiplication is a technique that allows for smaller, faster multiplication circuits, by reordering the values to be multiplied. It is the standard technique used in chip design.

Vedic arithmetic is the ancient system of Indian mathematics which has a unique technique of calculations based on 16 Sutras (Formulae)

Another algorithm is Karatsuba. Overview.

Following this “Carry-save array multiplier using logic gates”, the next chapter shows an implementation of the divider introduced in Chapter 7 of the inquiry “How do Computers do Math?“.

Multiplier circuit

(c) Copyright 2016 Coert Vonk

Implements a math multiplier using circuits of logic gates. Written in parameterized Verilog HDL for Altera and Xilinx FPGA’s.

Multiplier using logic gates

multiplier icon We introduced the carry-propagate array multiplier in the inquiry “How do Computers do Math?“.\(\)

This multiplier is build around Multiplier Adder (ma) blocks. These ma blocks are themselves build around the Full Adder (fa) blocks introduced in the adder section. These fa blocks have the usual inputs \(a\) and \(b\), \(c_i\) and outputs \(s\) and \(c_o\). The special thing is that the internal signal \(b\) is an AND function of the inputs \(x\) and \(y\) as depicted below.

$$ \begin{align*} a &= s_i\\ b &= x \cdot y\\ s_o &= s_i\oplus b\oplus c_i\\ c_o &= s_i \cdot b + c_i \cdot(s_i \oplus b) \end{align*} $$
1-bit multiplying-adder

Carry-propagate Array Multiplier

As shown in the inquiry “How do Computers do Math?“, a carry-propagate array multiplier can be built by combining many of these ma blocks. The circuit diagram below shows the connections between these blocks for a 4-bit multiplier.

4-bit carry-propagate array multiplier

For an implementation in Verilog HDL, we can instantiate ma blocks based on the word length of the multiplicand and multiplier (\(N\)). If you are new to Verilog HDL, remember that the generate code segment expands during compilation time. In other words, it is just a short hand for writing out the long list of ma block instances.

generate genvar ii, jj;
    for ( ii = 0; ii <; N; ii = ii + 1) begin: gen_ii
        for ( jj = 0; jj <; N; jj = jj + 1) begin: gen_jj
            math_multiplier_ma_block ma( 
                .x(?), .y(?), .si(?), .ci(?),
                .so(?), .co(?) );
        end
    end
endgenerate[/code]

As you might notice, the input and output ports are not described. For this, we need to derive the rules that govern these interconnects. Start by numbering the output ports based on their location in the matrix. For this circuit, we have the output signals sum (\(s\)) and carry-out (\(c\)). E.g. \(c_{13}\) identifies the carry-out signal for the block in row 1 and column 3. Note that the circuit description depicts the matrix in a slanted fashion.

own work
Output signals 'so' and 'co'

Knowing this, we can enter the output signals in the Verilog HDL code math_multiplier_ma_block ma( .x(?), .y(?), .si(?), .ci(?), .so ( s[ii][jj] ), .co ( c[ii][jj] ) );

Next, we express the input signals as a function of the output signal names \(s\) and \(c\) as shown in the table below.

own work
Input signals 'x', 'y', 'si' and 'ci'

Based on this table, we can express the input assignments for each ma using "c ? a : b" expressions. Note that Verilog 2001 does not allow these programming statements for the output pins. This is why we expressed the input ports as a function of the output ports instead of visa versa.

math_multiplier_ma_block ma( 
    .x ( a[jj] ),
    .y ( b[ii]),
    .si ( ii == 0 ? 1'b0 : jj <; N - 1 ? s[ii-1][jj+1] : c[ii-1][N-1] ),
    .ci ( jj > 0 ? c[ii][jj-1] : 1'b0 ),
    .so ( s[ii][jj] ),
    .co ( c[ii][jj] ) );

All that is left to do is to express the inputs of the module as a function of the output signals

Output signal 'p'

Putting it all together, we get the following snippet generate genvar ii, jj; for (ii = 0; ii <; N; ii = ii + 1) begin: gen_ii for (jj = 0; jj <; N; jj = jj + 1) begin: gen_jj math_multiplier_ma_block ma( .x ( a[jj] ), .y ( b[ii]), .si ( ii == 0 ? 1'b0 : jj <; N - 1 ? s[ii-1][jj+1] : c[ii-1][N-1] ), .ci ( jj > 0 ? c[ii][jj-1] : 1'b0 ), .so ( s[ii][jj] ), .co ( c[ii][jj] ) ); end assign p[ii] = s[ii][0]; end for (jj = 1; jj <; N; jj = jj + 1) begin: gen_jj2 assign p[jj+N-1] = s[N-1][jj]; end assign p[N*2-1] = c[N-1][N-1]; endgenerate[/code]

The ma block compiles into the RTL netlist shown below

own work
1-bit multiplying-adder in RTL

As shown in the figure below, the for loops unroll into 16 interconnected ma blocks.

own work
4-bit carry-propagate array multiplier in RTL

The complete Verilog HDL source code is available at:

Results

The propagation delay \(t_{pd}\) depends size \(N\) and the value of operands. For a given size \(N\), the maximum propagation delay occurs when the low order bit because a carry/sum that propagate to the highest order bit. This worst-case propagation delay is linear with \(3N\). Note that the average propagation delay is about half of this.

The worst-case propagation delays for the Terasic Altera Cyclone IV DE0-Nano are found using the post-map Timing Analysis tool. The exact value depends on the model and speed grade of the FPGA, the silicon itself, voltage and the die temperature.

\(N\) Timing Analysis Measured
slow 85°C slow 0°C fast 0°C actual
4-bits 9.9 ns 8.9 ns 6.1 ns
8-bits 20.8 ns 18.6 ns 12.4 ns
16-bits 41.3 ns 36.9 ns 24.2 ns
27-bits 69.6 ns 62.1 ns 40.9 ns 55 ns
32-bits 83.4 ns 74.5 ns 49.0 ns
Propagation delay in carry-propagate array multiplier

The timing analysis for \(N=27\), reveals that the worst-case propagation delay path goes through \(c_0\) and \(s_o\) as shown below on the left. When measuring the worst-case propagation delay on the actual device, we use input values that cause the maximum number ripple carries and sums propagating. For a 27-bit multiplier that where the input also has a maximum value of 99,999,999, the propagation path is simulated in a spreadsheet as shown below on the right.

Worst case path
Worst case input

Brute force using the FPGA to find all combinations of operands that cause long propagation delays revealed 27'h2FA3A92 * 27h'55D4A77, 27'h60A308B * 27'd99999999 (50ns), 27'h775A668 * 27'd89999999 (55 ns), 27'h56F5D8F * 27'h3AAAB7B (55 ns).

Following this "Math multiplier using logic gates", the next chapter explores methods of making the multiplication operation faster.

A faster adder circuit

(c) Copyright 2016 Coert Vonk

Implements carry-lookahead adders using circuits of logic gates Written in Verilog HDL for Altera and Xilinx FPGA’s.

Carry-lookahead adders

(c) Copyright 2016 Coert Vonk This chapter introduces algorithms to reduce the delay when adding numbers. We will look at two carry-lookahead adders using logic gates. \(\)

Carry-lookahead adder

Adding a carry-lookahead circuit can make the carry generation much faster. The individual bit adders no longer calculate outgoing carries, but instead generate propagate and generate signals.

We define a Partial Full Adder (pfa) module with the usual ports a and b for the summands, carry input ci , the sum s, and two new signals propagate p and generate g. The propagate (p) indicates that the bit would not generate the carry bit itself, but will pass through a carry from a lower bit. The generate (g) signal indicates that the bit generates a carry independent of the incoming carry. The functionality of the pfa module is expression the circuit and Boolean equations shown below.

(c) Copyright 2016 Coert Vonk
$$ \begin{align*} g &=a \cdot b \\ p &=a \oplus b \\ s &=p \oplus c_i \end{align*} $$
1-bit partial full adder

For bit position n, the outgoing carry cn is a function of pn, gn and the incoming carry ci,n. Except for bit position 0, the incoming carry equals the outgoing carry of the previous pfa, \(c_{i,n}=c_{o,n-1}\) $$ \begin{align*} c_n &= g_n + c_{in_{n}} \cdot p_n \\ &= g_n + c_{n-1} \cdot p_n \end{align*} $$

For a 4-bit cla this results in the following equations for the carryout signals: $$ \begin{align*} c_0& = g_0 + c_i \cdot p_0 \\ c_1& = g_1 + c_0 \cdot p_1 \\ c_2& = g_2 + c_1 \cdot p_2 \\ c_3& = g_3 + c_2 \cdot p_3 \\ \end{align*} $$

Substituting the cn-1

$$ \begin{align*} c_0 &= g_0 + c_{i} \cdot p_0 \\ c_1 &= g_1 + (g_0 + c_{i} \cdot p_0) \cdot p_1 \\ &= g_1 + g_0 \cdot p_1 + c_{i} \cdot p_0 \cdot p_1 \\ c_2 &= g_2 + (g_1 + g_0 \cdot p_1 + c_{i} \cdot p_0 \cdot p_1) \cdot p_2 \\ &= g_2 + g_1 \cdot p_2 + g_0 \cdot p_1 \cdot p_2 + c_{i} \cdot p_0 \cdot p_1 \cdot p_2 \\ c_3 &= g_3 + (g_2 + g_1 \cdot p_2 + g_0 \cdot p_1 \cdot p_2 + c_{i} \cdot p_0 \cdot p_1 \cdot p_2) \cdot p_3 \\ &= g_3 + g_2 \cdot p_3 + g_1 \cdot p_2 \cdot p_3 + g_0 \cdot p_1 \cdot p_2 \cdot p_3 +c_{i} \cdot p_0 \cdot p_1 \cdot p_2 \cdot p_3 \end{align*} $$

The outgoing carries c0…3 no longer depend on each other, thereby eliminating the “ripple effect”. The outgoing carries can now be implemented with only 3 gate delays (1 for p/g generation, 1 for the ANDs and 1 for the final OR assuming gates with 5 inputs).

The circuit below gives an example of a 4-bit carry look-ahead adder.

(c) Copyright 2016 Coert Vonk
4-bit carry-lookahead adder

The complexity of the carry look-ahead increases dramatically with the bit number. Instead of calculating higher bit carries, one may daisy chaining the carry logic as shown for the 12-bit adder below.

(c) Copyright 2016 Coert Vonk
12-bit carry-lookahead adder

An implementation can be found at GitHub

Results

The propagation delay tpd depends on size n and the value of operands. For a given size n, adding the value 1 to an operand that contains all zeroes causes the longest propagation delay. The post-map Timing Analysis tool reveals the worst-case propagation delays for the Terasic Altera Cyclone IV DE0-Nano. The exact value depends on the model and speed grade of the FPGA, the silicon itself, voltage and the die temperature.

\(N\) Timing Analysis Measured
slow 85°C slow 0°C fast 0°C actual
4-bits 8.1 ns 7.2 ns 5.3 ns
8-bits 9.8 ns 8.7 ns 6.2 ns
16-bits 10.0 ns 8.9 ns 6.2 ns
27-bits 15.2 ns 13.5 ns 9.5 ns
32-bits 21.8 ns 19.7 ns 13.6 ns
42-bits 24.4 ns 21.7 ns 14.9 ns
Propagation delay in 1-level carry-lookahead adder

Multi-level carry-lookahead adder

To improve speed for larger word sizes, we can add a second level of carry look ahead. To facilitate this, we extend the cla circuit by adding \(p_{i,j}\) and \(g_{i,j}\) outputs. The propagate signal \(p_{i,j}\) indicates that an incoming carry propagates from bit position \(i\) to \(j\). The generate signal \(g_{i,j}\) indicates that a carry is generated at bit position \(j\), or if a carry out is generated at a lower bit position and propagates to position \(j\).

For a 4-bit block the equations are $$ \begin{align*} p_{0,3} &= p_3 \cdot p_2 \cdot p_1 \cdot p_0 \\ g_{0,3} &= g_3 + p_3 \cdot g_2 + p_3 \cdot p_2 \cdot g_1 + p_3 \cdot p_2 \cdot p_1 \cdot g_0 \\ c_o &= g_{3,0} + p_{3,0} \cdot c_i \end{align*} $$

The circuit for a 16-bit two-level carry-lookahead adder is shown below

(c) Copyright 2016 Coert Vonk
16-bit carry-save adder using two levels of ‘cla’

An implementation can be found at GitHub

Results

Once more, the propagation delay \(t_{pd}\) depends size \(N\) and the value of operands. For a given size \(N\), adding the value 1 to an operand that contains all zeroes causes the longest propagation delay.

Once more, the post-map Timing Analysis predicts the worst-case propagation delays for the Terasic Altera Cyclone IV DE0-Nano. As usual, the exact value depends on the model and speed grade of the FPGA, the silicon itself, voltage and the die temperature.

\(N\) Timing Analysis Measured
slow 85°C slow 0°C fast 0°C actual
4-bits 8.1 ns 7.2 ns 4.3 ns
8-bits 9.3 ns 8.3 ns 5.8 ns
16-bits 11.4 ns 10.2 ns 7.1 ns
27-bits 15.3 ns 13.6 ns 9.6 ns
32-bits 18.6 ns 16.7 ns 11.6 ns
42-bits 18.0 ns 16.1 ns 11.2 ns
Propagation delay in 2-level carry-lookahead adder

Others

Other adder designs are carry-skip, carry-select and prefix adders.

Following this “Carry-lookahead adders using logic gates”, the next chapter shows an implementation of a multiplier introduced in Chapter 7 of the inquiry “How do Computers do Math?“.

Copyright © 1996-2022 Coert Vonk, All Rights Reserved