# Building Math Hardware

## Faster Parameterized Adder in Verilog This chapter introduces algorithms to reduce the delay when adding numbers. We will look at two carry-lookahead circuits 

Adding a carry-lookahead circuit can make the carry generation much faster. The individual bit adders no longer calculate outgoing carries, but instead generate propagate and generate signals. We define a Partial Full Adder (pfa) module with the usual ports a and b for the summands, carry input ci , the sum s, and two new signals propagate p and generate g. The propagate (p) indicates that the bit would not generate the carry bit itself, but will pass through a carry from a lower bit. The generate (g) signal indicates that the bit generates a carry independent of the incoming carry. The functionality of the pfa module is expression the circuit and Boolean equations shown below. \begin{aligned} g &=a \cdot b\\ p &=a \oplus b\\ s &=p \oplus c_i \end{aligned}
For bit position n, the outgoing carry cn is a function of pn, gn and the incoming carry ci,n. Except for bit position 0, the incoming carry equals the outgoing carry of the previous pfa, $$c_{i,n}=c_{o,n-1}$$ \begin{aligned} c_n &=g_n + c_{in_{n}} \cdot p_n \\ &=g_n + c_{n-1} \cdot p_n \end{aligned} For a 4-bit cla this results in the following equations for the carryout signals: \begin{aligned} c_0& = g_0 + c_i \cdot p_0 \\ c_1& = g_1 + c_0 \cdot p_1 \\ c_2& = g_2 + c_1 \cdot p_2 \\ c_3& = g_3 + c_2 \cdot p_3 \\ \end{aligned} Substituting the cn-1

\begin{aligned} c_0 &= g_0 + c_{i} \cdot p_0\\ c_1 &= g_1 + (g_0 + c_{i} \cdot p_0) \cdot p_1\\ &= g_1 + g_0 \cdot p_1 + c_{i} \cdot p_0 \cdot p_1\\ c_2 &= g_2 + (g_1 + g_0 \cdot p_1 + c_{i} \cdot p_0 \cdot p_1) \cdot p_2\\ &= g_2 + g_1 \cdot p_2 + g_0 \cdot p_1 \cdot p_2 + c_{i} \cdot p_0 \cdot p_1 \cdot p_2 \\ c_3 &= g_3 + (g_2 + g_1 \cdot p_2 + g_0 \cdot p_1 \cdot p_2 + c_{i} \cdot p_0 \cdot p_1 \cdot p_2) \cdot p_3\\ &= g_3 + g_2 \cdot p_3 + g_1 \cdot p_2 \cdot p_3 + g_0 \cdot p_1 \cdot p_2 \cdot p_3 +c_{i} \cdot p_0 \cdot p_1 \cdot p_2 \cdot p_3 \end{aligned}

The outgoing carries c0…3 no longer depend on each other, thereby eliminating the “ripple effect”. The outgoing carries can now be implemented with only 3 gate delays (1 for p/g generation, 1 for the ANDs and 1 for the final OR assuming gates with 5 inputs). The circuit below gives an example of a 4-bit carry look-ahead adder.
The complexity of the carry look-ahead increases dramatically with the bit number. Instead of calculating higher bit carries, one may daisy chaining the carry logic as shown for the 12-bit adder below.
The source code is available through GitHub.

#### Results

The propagation delay tpd depends on size n and the value of operands. For a given size n, adding the value 1 to an operand that contains all zeroes causes the longest propagation delay. The post-map Timing Analysis tool reveals the worst-case propagation delays for the Terasic Altera Cyclone IV DE0-Nano. The exact value depends on the model and speed grade of the FPGA, the silicon itself, voltage and the die temperature.
$$N$$ Timing Analysis Measured
slow 85°C slow 0°C fast 0°C actual
4-bits 8.1 ns 7.2 ns 5.3 ns
8-bits 9.8 ns 8.7 ns 6.2 ns
16-bits 10.0 ns 8.9 ns 6.2 ns
27-bits 15.2 ns 13.5 ns 9.5 ns
32-bits 21.8 ns 19.7 ns 13.6 ns
42-bits 24.4 ns 21.7 ns 14.9 ns

To improve speed for larger word sizes, we can add a second level of carry look ahead. To facilitate this, we extend the cla circuit by adding $$p_{i,j}$$ and $$g_{i,j}$$ outputs. The propagate signal $$p_{i,j}$$ indicates that an incoming carry propagates from bit position $$i$$ to $$j$$. The generate signal $$g_{i,j}$$ indicates that a carry is generated at bit position $$j$$, or if a carry out is generated at a lower bit position and propagates to position $$j$$. For a 4-bit block the equations are \begin{aligned} p_{0,3} &= p_3 \cdot p_2 \cdot p_1 \cdot p_0\\ g_{0,3} &= g_3 + p_3 \cdot g_2 + p_3 \cdot p_2 \cdot g_1 + p_3 \cdot p_2 \cdot p_1 \cdot g_0\\ c_o &= g_{3,0} + p_{3,0} \cdot c_i \end{aligned} The circuit for a 16-bit two-level carry-lookahead adder is shown below
The source code is available through GitHub.

#### Results

Once more, the propagation delay $$t_{pd}$$ depends size $$N$$ and the value of operands. For a given size $$N$$, adding the value (1\) to an operand that contains all zeroes causes the longest propagation delay. Once more, the post-map Timing Analysis predicts the worst-case propagation delays for the Terasic Altera Cyclone IV DE0-Nano. As usual, the exact value depends on the model and speed grade of the FPGA, the silicon itself, voltage and the die temperature.
$$N$$ Timing Analysis Measured
slow 85°C slow 0°C fast 0°C actual
4-bits 8.1 ns 7.2 ns 4.3 ns
8-bits 9.3 ns 8.3 ns 5.8 ns
16-bits 11.4 ns 10.2 ns 7.1 ns
27-bits 15.3 ns 13.6 ns 9.6 ns
32-bits 18.6 ns 16.7 ns 11.6 ns
42-bits 18.0 ns 16.1 ns 11.2 ns