## Faster Parameterized Multiplier in Verilog

The propagating carry limits the performance of the previous algorithm. Here we investigates methods of implementing binary multiplication with a smaller latency. Low latency demands an efficient algorithm and high performance circuitry to limit propagation delays. Crucial to the performance of multipliers are high-speed adders. [Bewick] The speed of the multiplier is directly related to this execution time of these*Digital Signal Processing*(DSP) applications. Since multiplication dominates the execution time of most DSP algorithms, so there is a need of high-speed multiplier. Examples are convolution, Fast Fourier Transform (FFT), filtering and in ALU of microprocessors.

**Carry-save Array Multiplier**

An important advance in improving the speed of multipliers, pioneered by Wallace, is the use of carry save adders (CSA). Even though the building block is still the multiplying adder (`ma`

), the topology of prevents a ripple carry by ensuring that, wherever possible, the carry-out signal propagates downward and not sideways.
The illustration below gives an example of this multiplication process.
`ma`

) as describe on the previous page. However, the topology is so that the carry-out from one adder is not connected to the carry-in of the next adder. Hence preventing a ripple carry. The circuit diagram below shows the connections between these blocks.
`ma`_{0x}

can be replaced with simple AND gates, `ma`_{4x}

can be replaced by adders. Also the block `ma`_{43}

is not needed. More interesting, the the ripple adder in the last row, can be replace with the faster carry look ahead adder.
Similar to the carry-propagate array multiplier, using Verilog HDL we can generate instances of `ma`

blocks based on the word length of the multiplicand and multiplier (`N`

). To describe the circuit in Verilog HDL, we need to derive the rules that govern the connections between the blocks.
Start by numbering the output ports based on their location in the matrix. For this circuit, we have the output signals *sum*(

`s)`

and *carry-out*(

`c`

). E.g. c_13 identifies the carry-out signal for the block in row `1`

and column `3`

. Next, we express the input signals as a function of the output signal names `s`

and `c`

and do the same for the product itself as shown in the table below.
`?:`

expressions.
generate genvar ii, jj; for ( ii = 0; ii <= N; ii = ii + 1) begin: gen_ii for ( jj = 0; jj < N; jj = jj + 1) begin: gen_jj math_multiplier_ma_block ma( .x ( ii < N ? a[jj] : (jj > 0) ? c[N][jj-1] : 1'b0 ), .y ( ii < N ? b[ii] : 1'b1 ), .si ( ii > 0 && jj < N - 1 ? s[ii-1][jj+1] : 1'b0 ), .ci ( ii > 0 ? c[ii-1][jj] : 1'b0 ), .so ( s[ii][jj] ), .co ( c[ii][jj] ) ); if ( ii == N ) assign p[N+jj] = s[N][jj]; end assign p[ii] = s[ii][0]; end endgenerateThe complete Verilog HDL source code along with the test bench and constraints is available through GitHub.

#### Results

The propagation delay \(t_{pd}\) depends size \(N\) and the value of operands. For a given size \(N\), the maximum propagation delay occurs when the low order bit cause a carry/sum that propagate to the highest order bit. This worst-case propagation delay is lineair with \(2N\), this makes this carry-save multiplier is about*33% faster*as the ripple-carry multiplier. Note that the average propagation delay is about half of this. The post-map Timing Analysis tool shows the worst-case propagation delays for the Terasic Altera Cyclone IV DE0-Nano. The exact value depends on the model and speed grade of the FPGA, the silicon itself, voltage and the die temperature.

\(N\) | Timing Analysis | Measured | ||
---|---|---|---|---|

slow 85°C | slow 0°C | fast 0°C | actual | |

4-bits | 9.0 ns | 8.0 ns | 5.6 ns | |

8-bits | 18.7 ns | 16.8 ns | 11.4 ns | |

16-bits | 30.9 ns | 27.6 ns | 18.3 ns | |

27-bits | 46.8 ns | 41.9 ns | 27.7 ns | |

32-bits | 57.9 ns | 51.6 ns | 34.3 ns |

### Other multipliers

The*Wallace Multiplier*decreases the latency by reorganizing the additions. Wikipedia has a good description of the algorithm. Due to the irregular routing, they can be difficult to route on a FPGA. As a consequence, additional wire delays may cause it to perform slower than carry-safe array multipliers. The Wallace Multiplier can be combined with Booth Coding. The

*Booth Multiplier*(alt) uses, say, 2 bits of the multiplier in generating each partial product thereby using only half the number of rows. Booth Multipliers with more fancy VLSI technique such as 0.6μ BiCMOS process using emitter coupled logic makes 53×53 multipliers possible with a latency of less than 2.6 nanoseconds [ref]. Booth multiplication is a technique that allows for smaller, faster multiplication circuits, by reordering the values to be multiplied. It is the standard technique used in chip design. Vedic arithmetic is the ancient system of Indian mathematics which has a unique technique of calculations based on 16 Sutras (Formulae) Another algorithm is Karatsuba. Overview. : The next chapter shows an implementation of the divider introduced in Chapter 7 of the inquiry “How do computers do Math?”.