# Building Math Hardware



## Faster Parameterized Multiplier in Verilog The propagating carry limits the performance of the previous algorithm. Here we investigates methods of implementing binary multiplication with a smaller latency. Low latency demands an efficient algorithm and high performance circuitry to limit propagation delays. Crucial to the performance of multipliers are high-speed adders. [Bewick] The speed of the multiplier is directly related to this execution time of these Digital Signal Processing (DSP) applications. Since multiplication dominates the execution time of most DSP algorithms, so there is a need of high-speed multiplier. Examples are convolution, Fast Fourier Transform (FFT), filtering and in ALU of microprocessors.

### Carry-save Array Multiplier

An important advance in improving the speed of multipliers, pioneered by Wallace, is the use of carry save adders (CSA). Even though the building block is still the multiplying adder (ma), the topology of prevents a ripple carry by ensuring that, wherever possible, the carry-out signal propagates downward and not sideways. The illustration below gives an example of this multiplication process.
Again, the building block is the multiplying adder (ma) as describe on the previous page. However, the topology is so that the carry-out from one adder is not connected to the carry-in of the next adder. Hence preventing a ripple carry. The circuit diagram below shows the connections between these blocks.
The observant reader might notice that ma0x can be replaced with simple AND gates, ma4x can be replaced by adders. Also the block ma43 is not needed. More interesting, the the ripple adder in the last row, can be replace with the faster carry look ahead adder. Similar to the carry-propagate array multiplier, using Verilog HDL we can generate instances of ma blocks based on the word length of the multiplicand and multiplier (N). To describe the circuit in Verilog HDL, we need to derive the rules that govern the connections between the blocks. Start by numbering the output ports based on their location in the matrix. For this circuit, we have the output signals sum (s) and carry-out (c). E.g. c_13 identifies the carry-out signal for the block in row 1 and column 3. Next, we express the input signals as a function of the output signal names s and c and do the same for the product itself as shown in the table below. Function for output signals ‘so’ and ‘co’ and output signals ‘x’, ‘y’, ‘si’, ‘ci’ and ‘p’
Based on this table, we can now express the interconnects using Verilog HDL using ?: expressions.
generate genvar ii, jj;
for ( ii = 0; ii <= N; ii = ii + 1) begin: gen_ii
for ( jj = 0; jj < N; jj = jj + 1) begin: gen_jj
math_multiplier_ma_block ma(
.x ( ii < N ? a[jj] : (jj &gt; 0) ? c[N][jj-1] : 1'b0 ),
.y ( ii < N ? b[ii] : 1'b1 ),
.si ( ii &gt; 0 && jj < N - 1 ? s[ii-1][jj+1] : 1'b0 ),
.ci ( ii &gt; 0 ? c[ii-1][jj] : 1'b0 ),
.so ( s[ii][jj] ),
.co ( c[ii][jj] ) );
if ( ii == N ) assign p[N+jj] = s[N][jj];
end
assign p[ii] = s[ii];
end
endgenerate
The complete Verilog HDL source code along with the test bench and constraints is available through GitHub.

#### Results

The propagation delay $$t_{pd}$$ depends size $$N$$ and the value of operands. For a given size $$N$$, the maximum propagation delay occurs when the low order bit cause a carry/sum that propagate to the highest order bit. This worst-case propagation delay is lineair with $$2N$$, this makes this carry-save multiplier is about 33% faster as the ripple-carry multiplier. Note that the average propagation delay is about half of this. The post-map Timing Analysis tool shows the worst-case propagation delays for the Terasic Altera Cyclone IV DE0-Nano. The exact value depends on the model and speed grade of the FPGA, the silicon itself, voltage and the die temperature.
Propagation delay in carry-save array multiplier
$$N$$ Timing Analysis Measured
slow 85°C slow 0°C fast 0°C actual
4-bits 9.0 ns 8.0 ns 5.6 ns
8-bits 18.7 ns 16.8 ns 11.4 ns
16-bits 30.9 ns 27.6 ns 18.3 ns
27-bits 46.8 ns 41.9 ns 27.7 ns
32-bits 57.9 ns 51.6 ns 34.3 ns