A faster multiplier circuit

Implements a carry-save array multiplier using a circuit of logic gates. Written in Verilog HDL for Altera and Xilinx FPGA’s.

Carry-save array multiplier

(c) Copyright 2016 Coert Vonk \(\) The propagating carry limits the performance of the previous algorithm. Here we investigates methods of implementing binary multiplication with a smaller latency. Low latency demands an efficient algorithm and high performance circuitry to limit propagation delays. Crucial to the performance of multipliers are high-speed adders. [Bewick]

The speed of the multiplier is directly related to this execution time of these Digital Signal Processing (DSP) applications.

Since multiplication dominates the execution time of most DSP algorithms, so there is a need of high-speed multiplier. Examples are convolution, Fast Fourier Transform (FFT), filtering and in ALU of microprocessors.

Carry-save Array Multiplier

An important advance in improving the speed of multipliers, pioneered by Wallace, is the use of carry save adders (CSA). Even though the building block is still the multiplying adder (ma), the topology of prevents a ripple carry by ensuring that, wherever possible, the carry-out signal propagates downward and not sideways.

The illustration below gives an example of this multiplication process.

(c) Copyright 2016 Coert Vonk
Inside a carry-save array multiplier

Again, the building block is the multiplying adder (ma) as describe on the previous page. However, the topology is so that the carry-out from one adder is not connected to the carry-in of the next adder. Hence preventing a ripple carry. The circuit diagram below shows the connections between these blocks.

4-bit carry-save array multiplier

The observant reader might notice that ma0x can be replaced with simple AND gates, ma4x can be replaced by adders. Also the block ma43 is not needed. More interesting, the the ripple adder in the last row, can be replace with the faster carry look ahead adder.

Similar to the carry-propagate array multiplier, using Verilog HDL we can generate instances of ma blocks based on the word length of the multiplicand and multiplier (N). To describe the circuit in Verilog HDL, we need to derive the rules that govern the connections between the blocks.

Start by numbering the output ports based on their location in the matrix. For this circuit, we have the output signals sum (s) and carry-out (c). E.g. c_13 identifies the carry-out signal for the block in row 1 and column 3. Next, we express the input signals as a function of the output signal names s and c and do the same for the product itself as shown in the table below.

Function for output signals ‘so’ and ‘co’ and output signals ‘x’, ‘y’, ‘si’, ‘ci’ and ‘p’

Based on this table, we can now express the interconnects using Verilog HDL using ?: expressions.

generate genvar ii, jj;
    for ( ii = 0; ii <;= N; ii = ii + 1) begin: gen_ii
        for ( jj = 0; jj <; N; jj = jj + 1) begin: gen_jj
            math_multiplier_ma_block ma(
                .x ( ii <; N ? a[jj] : (jj > 0) ? c[N][jj-1] : 1'b0 ),
                .y ( ii <; N ? b[ii] : 1'b1 ),
                .si ( ii > 0  jj <; N - 1 ? s[ii-1][jj+1] : 1'b0 ),
                .ci ( ii > 0 ? c[ii-1][jj] : 1'b0 ),
                .so ( s[ii][jj] ),
                .co ( c[ii][jj] ) );
            if ( ii == N ) assign p[N+jj] = s[N][jj];
        assign p[ii] = s[ii][0];

The complete Verilog HDL source code along with the test bench and constraints is available at:


The propagation delay \(t_{pd}\) depends size \(N\) and the value of operands. For a given size \(N\), the maximum propagation delay occurs when the low order bit cause a carry/sum that propagate to the highest order bit. This worst-case propagation delay is linear with \(2N\), this makes this carry-save multiplier is about 33% faster as the ripple-carry multiplier. Note that the average propagation delay is about half of this.

The post-map Timing Analysis tool shows the worst-case propagation delays for the Terasic Altera Cyclone IV DE0-Nano. The exact value depends on the model and speed grade of the FPGA, the silicon itself, voltage and the die temperature.

\(N\) Timing Analysis Measured
slow 85°C slow 0°C fast 0°C actual
4-bits 9.0 ns 8.0 ns 5.6 ns
8-bits 18.7 ns 16.8 ns 11.4 ns
16-bits 30.9 ns 27.6 ns 18.3 ns
27-bits 46.8 ns 41.9 ns 27.7 ns
32-bits 57.9 ns 51.6 ns 34.3 ns
Propagation delay in carry-save array multiplier

Other multipliers

The Wallace Multiplier decreases the latency by reorganizing the additions. Wikipedia has a good description of the algorithm. Due to the irregular routing, they can be difficult to route on a FPGA. As a consequence, additional wire delays may cause it to perform slower than carry-safe array multipliers.

The Wallace Multiplier can be combined with Booth Coding. The Booth Multiplier (alt) uses, say, 2 bits of the multiplier in generating each partial product thereby using only half the number of rows. Booth Multipliers with more fancy VLSI technique such as 0.6μ BiCMOS process using emitter coupled logic makes 53×53 multipliers possible with a latency of less than 2.6 nanoseconds [ref].

Booth multiplication is a technique that allows for smaller, faster multiplication circuits, by reordering the values to be multiplied. It is the standard technique used in chip design.

Vedic arithmetic is the ancient system of Indian mathematics which has a unique technique of calculations based on 16 Sutras (Formulae)

Another algorithm is Karatsuba. Overview.

Following this “Carry-save array multiplier using logic gates”, the next chapter shows an implementation of the divider introduced in Chapter 7 of the inquiry “How do Computers do Math?“.