We introduced the carry-propegate array multiplier in the inquiry “How do computers do Math?“.\(\)
This multiplier is build around Multiplier Adder (ma) blocks. These ma blocks are themselves build around the Full Adder (fa) blocks introduced in the adder section. These fa blocks have the usual inputs \(a\) and \(b\), \(c_i\) and outputs \(s\) and \(c_o\). The special thing is that the internal signal \(b\) is an AND function of the inputs \(x\) and \(y\) as depicted below.
1-bit multiplying-adder
$$\begin{aligned}
a &= s_i\\
b &= x \cdot y\\
s_o &= s_i\oplus b\oplus c_i\\
c_o &= s_i \cdot b + c_i \cdot(s_i \oplus b)
\end{aligned}$$
Carry-propagate Array Multiplier
As shown in the inquiry “How do Computers do Math?“, a carry-propagate array multiplier can be built by combining many of these ma blocks. The circuit diagram below shows the connections between these blocks for a 4-bit multiplier.
For an implementation in Verilog HDL, we can instantiate ma blocks based on the word length of the multiplicand and multiplier (\(N\)). If you are new to Verilog HDL, remember that the generate code segment expands during compilation time. In other words, it is just a short hand for writing out the long list of ma block instances.
generate genvar ii, jj;
for ( ii = 0; ii < N; ii = ii + 1) begin: gen_ii
for ( jj = 0; jj < N; jj = jj + 1) begin: gen_jj
math_multiplier_ma_block ma(
.x(?), .y(?), .si(?), .ci(?),
.so(?), .co(?) );
end
end
endgenerate
As you might notice, the input and output ports are not described. For this, we need to derive the rules that govern these interconnects. Start by numbering the output ports based on their location in the matrix. For this circuit, we have the output signals sum (\(s\)) and carry-out (\(c\)). E.g. \(c_{13}\) identifies the carry-out signal for the block in row 1 and column 3. Note that the circuit description depicts the matrix in a slanted fashion.
Knowing this, we can enter the output signals in the Verilog HDL code
Next, we express the input signals as a function of the output signal names \(s\) and \(c\) as shown in the table below.
Based on this table, we can express the input assignments for each ma using “c ? a : b” expressions. Note that Verilog 2001 does not allow these programming statements for the output pins. This is why we expressed the input ports as a function of the output ports instead of visa versa.
All that is left to do is to express the inputs of the module as a function of the output signals
Putting it all together, we get the following snippet
generate genvar ii, jj;
for (ii = 0; ii < N; ii = ii + 1) begin: gen_ii
for (jj = 0; jj < N; jj = jj + 1) begin: gen_jj
math_multiplier_ma_block ma(
.x ( a[jj] ),
.y ( b[ii]),
.si ( ii == 0 ? 1'b0 : jj < N - 1 ? s[ii-1][jj+1] : c[ii-1][N-1] ),
.ci ( jj > 0 ? c[ii][jj-1] : 1'b0 ),
.so ( s[ii][jj] ),
.co ( c[ii][jj] ) );
end
assign p[ii] = s[ii][0];
end
for (jj = 1; jj < N; jj = jj + 1) begin: gen_jj2
assign p[jj+N-1] = s[N-1][jj];
end
assign p[N*2-1] = c[N-1][N-1];
endgenerate
The ma block compiles into the RTL netlist shown below
As shown in the figure below, the for loops unroll into 16 interconnected ma blocks.
The complete Verilog HDL source code is available through GitHub.
Results
The propagation delay \(t_{pd}\) depends size \(N\) and the value of operands. For a given size \(N\), the maximum propagation delay occurs when the low order bit because a carry/sum that propagate to the highest order bit. This worst-case propagation delay is linear with \(3N\). Note that the average propagation delay is about half of this.
The worst-case propagation delays for the Terasic Altera Cyclone IV DE0-Nano are found using the post-map Timing Analysis tool. The exact value depends on the model and speed grade of the FPGA, the silicon itself, voltage and the die temperature.
Propagation delay in carry-propagate array multiplier
\(N\)
Timing Analysis
Measured
slow 85°C
slow 0°C
fast 0°C
actual
4-bits
9.9 ns
8.9 ns
6.1 ns
8-bits
20.8 ns
18.6 ns
12.4 ns
16-bits
41.3 ns
36.9 ns
24.2 ns
27-bits
69.6 ns
62.1 ns
40.9 ns
55 ns
32-bits
83.4 ns
74.5 ns
49.0 ns
The timing analysis for \(N=27\), reveals that the worst-case propagation delay path goes through \(c_0\) and \(s_o\) as shown below on the left. When measuring the worst-case propagation delay on the actual device, we use input values that cause the maximum number ripple carries and sums propagating. For a 27-bit multiplier that where the input also has a maximum value of 99,999,999, the propagation path is simulated in a spreadsheet as shown below on the right.
Brute force using the FPGA to find all combinations of operands that cause long propagation delays revealed 27’h2FA3A92 * 27h’55D4A77, 27’h60A308B * 27’d99999999 (50ns), 27’h775A668 * 27’d89999999 (55 ns), 27’h56F5D8F * 27’h3AAAB7B (55 ns).
The next chapter explores methods of making the multiplication operation faster.