## Starting with Altera

A short introduction to compiling, simulating and uploading using the Altera Quartus development environment for the Cyclone IV on a DE0-Nano board.

A (relatively) short introduction to compiling, simulating and uploading using the Altera Quartus development environment for the Terasic Altera Cyclone IV DE0-Nano under Windows 10. An equivalent tutorial is available for the reader who prefers Xilinx based boards.

## Install the FPGA Design Suite

Start by installing the free Quartus Prime Lite, this includes the IDE and required tool chain to create configuration files for the Altera FPGA.

1. Install Quartus Prime Lite 21.1 (>16.1)
• Unpack Quartus-lite-21.1.0.842-windows.tar; run the setup.bat and install to a path without spaces in the name (e.g. C:\intelFPGA_lite\21.1)
• Include device support for Cyclone IV and Questa – Intel FPGA Starter Edition (was: ModelSim-Altera Starter Edition).
• Select USB Blaster II driver (JTAG) installation
• Run the Quartus Prime software
2. Run the Quartus Prime 21.1 Device Installer
• install Cyclone IV and ModelSim-Altera Starter support

The USB Blaster driver needs some finishing up

1. Use a USB cable to connect your computer to the NE0-Nano board
2. Go in Window’s Device Manager
• Right-click Other devices » USB-Blaster » Update Driver Software
• Browse my computer for driver software at C:\intelFPGA_lite\21.1

## Install Board Support for DE0-Nano

The Terasic board support for DE0-Nano includes examples, user manual and the Terasic System Builder tool.

Note that its Control Panel fails with Load DLL (TERASIC_JTAG_DRIVE.dll) under 64-bit Windows. No biggie, we do not need it.

## Before you start

The simulator doesn’t play well with UNC paths (e.g. network shares). It triggers numerous errors and may cause timing simulations to disregard delays. Keeping the files on the server is fine, for as long as you access them through a symbolic link. To create a symbolic link, change to your home directory and create a symbolic link to your file server (in an elevated Power-Shell window do cd ~ ; New-Item -ItemType SymbolicLink -Path "Hardware.sym" -Target "\\server\path\to\files").

If you add the symbolic link to Explorer’s Quick Access, it will resolve the link first. To work around this, first create a regular directory and add that to the Quick Access. Then replace that directory with the symbolic link.

A few last tip before you take off:

Stay clear of absolute path names. Not only do they cause havoc when moving a project to a new directory, but they also confuse the simulator. If the ellipsis () file selector returns an absolute path, simply edit it to make the file relative to the project directory. E.g. change \C:\Users\you\path\project.v to project.v. If you want to quickly change all absolute paths, I suggest closing Quartus and pulling .qsf into an text editor (e.g. emacs).

Lastly, store the files on a local drive. It makes the process of compiling much faster.

## Build a circuit

Terasic advises to start with their System Builder to reduce the risk of damaging the board by incorrect I/O settings. I will throw this caution to the wind and use an example Quartus Setting file instead

Start Quartus Prime and create a new project

• File » New Project Wizard
• Choose a working directory (.c:\users\you\whatever\example1); project name = example1
• Project type = Empty Project
• Family = Cyclone IV E; name filter = EP4CE22F17C6; click the only device that matches
• Finish, accepting the default EDA Tools Settings

Start with a Verilog HDL module that tests if inputs are equal (File » New » Design file > Verilog HDL, and later save it as eq1.v)

timescale 1ns / 1ps
module eq1( input i0, input i1, output eq );
wire p0, p1; // internal signal declaration
assign eq = p0 | p1;
assign p0 = ~i0 & ~i1;
assign p1 = i0 & i1;
endmodule

Add a Verilog HDL module (File » New » Design file > Verilog HDL, and later save it as eq2.v)

timescale 1ns / 1ps
module eq2( input [1:0] a, input [1:0] b, output aeqb );
wire e0, e1; // internal signal declaration
eq1 eq_bit0_unit(.i0(a[0]), .i1(b[0]), .eq(e0));
eq1 eq_bit1_unit(.i0(a[1]), .i1(b[1]), .eq(e1));
assign aeqb = e0 & e1; // a and b are equal if individual bits are equal
endmodule

For the top level, we will use a schematic. Create a symbol file (eq2.sym) for the eq2 module so we can reference it in the schematic.

• File » Create/Update » Create Symbol Files for Current File

Create the top level schematic

• File » New » Design file » Block Diagram/Schematic
• File » Save as » example1.bdf
• Make sure that this top-level module has the same name as its source file and is the same name as your project name.
• Update the block diagram (.bdf)
• select the block diagram tab
• place the new symbol
• double-click the canvas
• expand the project directory and select the eq2.sym file that we just created
• place the symbol in the block diagram
• double-click the canvas
• expand the libraries directory and select “primitive » pin » input”
• place the pin so that it touches input a
• change the pin name to SWITCH[1..0]
• copy and paste the input pin, and place it so that it touches input b
• change the pin name to SWITCH[3..2]
• double-click the canvas
• expand the libraries directory and select “primitive » pin » output”
• place it so it touches output aeqb (a equals b)
• change the pin name to LED[0]
• Last, but not least: Mark this file as the top-level module
• Right-click the file name, and select “Set as Top-Level Entry”

Compile

• Processing » Start » Start Analysis & Elaboration

## Implementation Constraints

Time to constrain the implementation by specifying the input and output pins along with timing requirements.

### External pins assignments

• Assignments » Pin Planner
• This will show the five I/O pins
• Don’t worry about the Fitter Location, it is just whatever the fitter chose last time
• Double-click the Location field next each of the pin names to add the pin numbers (based on Table 3-2 and 3-3 in the Terasic board’s user manual)
LED[0] PIN_A15
SWITCH[3] PIN_M15
SWITCH[2] PIN_B9
SWITCH[1] PIN_T8
SWITCH[0] PIN_M1
• Change the I/O standard to 3.3V_LVTTL based on the same user manual (change the first one, then copy and paste)

You should end up with something like

If you plan to branch out, I suggest downloading and importing the settings file (.qsf) with pin assignments from university.altera.com.

### Timing requirements

For most design you will want to specify timing requirements. For our simple example however we will not do this.

For the record, to request specific timing requirements (default time quest) you would create Synopsys Design Constraints (File » New SCD File) and save it with the same base name as the top level file (e.g. example1.sdc). For our example it just contains some comments as a reminder.

#create_clock -period 20.000 -name CLOCK_50
#derive_pll_clocks
#derive_clock_uncertainty

## Synthesize and upload to FPGA

Before moving ahead, I like to shorten the path names in the Quartus Settings file (.qsf). This prevents problems when you move the project and especially when you have it stored on a file server. As we will see later, the ModelSim simulator doesn’t support UNC path names, but will honor relative paths even when the projects is on a file server.

• Assignments » Settings
• Remove the files (.bdf, .v, .sdc)
• Add the files back in. After selecting each file using the ellipsis () file selector, edit the resulting path name to exclude the path and press Add.
• Project Navigator
• Right-click example1.v and choose Set as Top-Level Entry
• Alternatively, you can close Quartus and pull the .qsf file in a text editor (e.g. Emacs) and shorten the path names (*_FILE) at the end of the file.

Let us move ahead and generate the binary SRAM Object File (.sof). This is the configuration file to be uploaded to the FPGA

• The Compilation Report tab shows the result
• Correct any critical warnings or errors

This would be the moment to do a timing analysis and/or simulation. However, at this point I’m simply too curious to see if it works just out of the box, so let’s give it a shot.

Connect the DE0-Nano board with a USB cable and upload the configuration file to the FPGA SRAM. (Alternately upload to Flash)

• Quartus Prime Lite » Tools » Programmer (or double-click Program Device in the task list)
• Click Hardware Setup, and select USB-Blaster [USB-0]
• Click Add File, and select your SRAM object file (.sof) in the output_files directory
• Click Start
• Save as example1.cdf, so it opens with these settings next time
• Great! We did it, the design is now on the FPGA (until we remove the power)

## Give it a spin

With the FPGA configured, time has come to test it

• Input is through the DIP switches (SWITCH) on the bottom-left of the board.
• Output is the right LED (LED[0]) located at the top of the board.
• We expect the LED to be “on” when switch positions 0 and 1 are identical to positions 2 and 3.

If you prefer bigger switches, you can wire up a breadboard to the GPIO-0 connector.

• VCC3p3 and Ground are available on the 40-pin expansion header at respectively pin 29 and 12 (or 30) as specified in the board manual.
• GPIO_00, 01, 02, 03 are on FPGA BGA pins at respectively D3, C3, A2, A3 and on the 40-pin expansion header at respectively pin 2, 4, 5 and 6.
• Modify the user constraints file accordingly.

Give yourself a pat on the back; you have achieved something new today!

## Timing Analysis

With the first hurdle cleared, we are going to take a closer look at the circuit. The first step will analyze the delays in the circuit to determine the conditions under which the circuit operates reliably. We will follow up with a functional and timing simulation.

If you just want to the timing information on the port-to-port path, you can use

• Task » Reports » Custom Reports » Report Timing
• From » click ...
• Collection = get_ports; List; select SWITCH[0] through SWITCH[3]; click ‘>
• To » click ...
• Collection = get_ports; List; select LED[0]; click ‘>

To place timing constraints on the port-to-port paths:

• Tools » TimeQuest Timing Analyzer
• Tasks » Update Timing Netlist ; double-click
• Constraints » Set Maximum Delay
• From » click ...
• Collection = get_ports; List; select SWITCH[0] through SWITCH[3]; click ‘>
• To » click ...
• Collection = get_ports; List; select LED[0]; click ‘>
• Delay value = 100 ns (more or less random value)
• Run
• Constraints » Write SDC File to example1.sdc
• Task » Reports » Custom Reports » Report Timing
• no need to specify the path, just click Report Timing
• this reveals a Data Delay of about 6.7 ns.

If you want Quartus to meet more stringent restrains, you need to specify these and recompile. This will direct Quartus to seek an implementation that meets these constraints. However, in this case we only specified restraints because we’re interested in the values. [TimeQuest User Guide]

## Functional Simulation

The Verilog HDL (.v) instructions are compiled (Analysis and Synthesis) into Register Transfer Logic (RTL) netlists, called Compiler Database Files (.cdb). As the name implies, data is moved through gates and register, typically subject to some clocking condition. A functional simulation tests the circuit at this RTL level. As such it will simulate the functionality, but not the propagation delays.

Altera Quartus ships bundled with their flavor of ModelSim that lets you draw waveforms for the signals going to the module under test. You can then save these waveforms as a wave.do file or HDL test bench. For this example we to skip the step of drawing waveforms, and jump straight to using test benches.

Start by creating a test bench (eq2_tb.v) that instantiate module under test and drive its inputs. As before, you should remove the path from the file name (Assignments » Settings » Files).

timescale 1ns / 100ps
default_nettype none

module eq2_tb;
reg [1:0] a;  // inputs
reg [1:0] b;
wire aeqb;  // output

// Instantiate the Device Under Test (UUT)
eq2 dut ( .a(a),
.b(b),
.aeqb(aeqb) );
initial begin
a = 0;  // initialize inputs
b = 0;

#100;  // wait 100 ns for global reset to finish (Xilinx only?)

// stimulus starts here
a = 2'b00; b = 2'b00; #10 $display("%b", aeqb); a = 2'b01; b = 2'b00; #10$display("%b", aeqb);
a = 2'b01; b = 2'b11; #10 $display("%b", aeqb); a = 2'b10; b = 2'b10; #10$display("%b", aeqb);
a = 2'b10; b = 2'b00; #10 $display("%b", aeqb); a = 2'b11; b = 2'b11; #10$display("%b", aeqb);
a = 2'b11; b = 2'b01; #10 $display("%b", aeqb); #200$stop;
end
endmodule

Configure Quartus to create a script that compiles the test bench and modules for ModelSim

• Assignments » Settings » EDA Tool Settings » Simulation » Compile test bench
• New
• Test bench name = Functional test bench
• Top level module in test bench = eq2_tb
• File name = eq2_tb.v, eq1.v, eq2.v (name the test bench and all the modules that ModelSim needs to compile). After selecting each file using the ellipsis () file selector, edit the resulting path name to exclude the project path, then press Add.

• I strongly suggest putting the test bench in a separate file, when doing timing simulations. This keeps ModelSim from using the .v file instead of the .vo file what causes propagation delays not to show in timing simulations.
• (sdfcomp-7) Failed to open SDF file “whatever.sdo” in read mode is most likely caused by the files residing on a UNC path. Refer to the quote at the beginning of this article.
• (sdfcomp-14) Failed to open “modelsim.ini” specified by the MODELSIM environment variable is also most likely caused by the files residing on a UNC path. Refer to the quote at the beginning of this article.
• When you get error deleting “msim_transcript” permission denied, close ModelSim first before starting it.
• Errors like Instantiation of ‘cycloneive_io_obuf’ failed. The design unit was not found indicate that the global libraries altera_ver or cycloneive_ver were not included in Assignments » Settings » Libraries.
• Keep in mind that the Windows filename limit (260) may be exceeded.
• To prevent Warning: (vsim-WLF-5000) WLF file currently in use: vsim.wlf, quit ModelSim and delete simulation/modelsim/sim.wlf and simulation/modelsim/wlft*.

Compile the circuit to RTL netlists

• Processing » Start » Start Analysis & Elaboration

Start the simulation

• Tools » Simulation Tool » RTL Simulation
• Select signals of interest
• Objects » select the signals of interest » Add Wave (or the green circle with the + sign)
• In the text field in the toolbar, change the simulation time from 1 ps to 200 ns and click the run button to the right (or type run 1 us at the command prompt)
• Undock the Wave Window (icon on the far top right of the window). In case the Wave Window is hidden, use Windows » Wave.
• Click “zoom full” and observe the simulated waveforms.

You should see something like

If you make a change to either your DUT or your test bench

• right-click the test bench and select recompile.
• make sure that you click the Restart button directly to the right of the simulation time window (or simply type restart at the command prompt).

## Timing Simulation

Before you start a timing simulation, first close ModelSim.

After compiling your circuit into RTL netlists (.cdb), it is fitted to the physical device (.qsf) trying to satisfy resource assignments and constraints. It then attempts to optimize the remaining logic. The Netlist Writer can export this information for ModelSim in the form of Verilog Output Files (.vo), Standard Delay Format Output Files (.sdo) and scripts to prepare ModelSim through compilation and initialization.

For this exercise, we will do a timing simulation for the whole system. With our top-level entry being a schematic (.bdf), we first need to convert this to Verilog HDL

• Open the schematic
• File » Create/Update » Create HDL Design File from Current File
• File type = Verilog HDL
• Assignments » Settings » Files
• replace the example1.bdf file with example1.v
• Make it the top-level module (so much for using a schematic … eh)

Create a system test bench (example1_tb.v). As before, you should remove the path from the file name (Assignments » Settings » Files).

timescale 1ns / 100ps
default_nettype none

module example1_tb;
reg [3:0] SWITCH;  // inputs
wire LED;          // output

example1 uut ( .SWITCH(SWITCH),
.LED(LED) );
initial begin
SWITCH[3:0] = 2'b0000; // initialize inputs

#100;  // wait 100 ns for global reset to finish (Xlinx only?)

// stimulus starts here
SWITCH[1:0] = 2'b00; SWITCH[3:2] = 2'b00; #10 $display("%b", LED); SWITCH[1:0] = 2'b01; SWITCH[3:2] = 2'b00; #10$display("%b", LED);
SWITCH[1:0] = 2'b01; SWITCH[3:2] = 2'b11; #10 $display("%b", LED); SWITCH[1:0] = 2'b10; SWITCH[3:2] = 2'b10; #10$display("%b", LED);
SWITCH[1:0] = 2'b10; SWITCH[3:2] = 2'b00; #10 $display("%b", LED); SWITCH[1:0] = 2'b11; SWITCH[3:2] = 2'b11; #10$display("%b", LED);
SWITCH[1:0] = 2'b11; SWITCH[3:2] = 2'b01; #10 $display("%b", LED); #200$stop;
end
endmodule

Create the verilog output files

• Processing » Start » Start EDA Netlist Writer

Configure Quartus to create a script that compiles the test bench and modules for ModelSim

• Assignments » Settings » Libraries » Global
• altera_ver
• cycloneive_ver
• Assignments » Settings » EDA Tool Settings » Simulation
• Compile test bench
• Delete the old Functional test bench
• New
• Test bench name = Timing test bench
• Top-level module in test bench = example1_tb
• Add the files listed below. After selecting each file using the ellipsis () file selector, edit the resulting path name to exclude the project path, then press Add.
• example_tb.v
• simulation/modelsim/example1.vo
• it appears that eq2 and eq1 were included inside simulation/modelsim/example1.vo. If you have larger modules, make sure you include al the .vo files here.
• if the simulation/modelsim directory doesn’t exist: recompile, and then add the .vo module.
• If things don’t go your way, refer to the section “Some hints ..” under “Test bench for functional simulation”.

Start a full synthesis for the circuit

• Processing » Start » Start Compilation

Start ModelSim

• Tools » Simulation Tool » Gate Level Simulation
• Timing model = Slow -6 1.2V 85 Model (default), this simulates a slow -6 speed grade model at 1.2 V and 85 °C.

Select the signals

• Objects » select the signals of interest » Add Wave
• Increase simulation time to 1 µs and click the run button on the right
• Undock the Wave Window (so you can expand it)
• Click “zoom full” and observe the simulated waveforms. You should see something like

Expect something like

For more hands on experience, refer to Altera’s excellent University Program lessons, or the Terasic CD-ROM files (C:\altera_lite\DE0-Nano) for examples and documentation. Altera also has a more generic Become a FPGA Designer video-based class.

Our first implementation is the SPI interface Math Talk. This is then used to build a demonstration of math operations in FPGA as eluded to from the inquiry How do Computer do Math?.

c’est tout

## Message protocol on FPGA

This continues the third part of Math Talk. This page shows a master implementation of the message protocol described earlier.

## Messages Exchange with FPGA as Slave

The implementation builds onto the Byte Module code shown earlier. We will start by explaining how to pass multidimensional arrays through ports.

### Registers

On Altera, we can the multidimensional arrays available in system verilog HDL

wire [nrRWregs+nrROregs-1:0] [31:0] registers;

The implementation is slightly more complicated on Xilinx, because Verilog 2001 doesn’t allow multidimensional arrays to be used as inputs or outputs. Instead, we work around this by flatten the 2D registers array into two vectors. One for input, and one for the output ports as shown in the code fragments below.

Flatten the 2-dimensional array, registers, into vectors rwRegs1D and roRegs1D genvar nn; wire [31:0] roRegs2D[0:nrROregs-1]; for ( nn = 0; nn < nrRWregs; nn = nn + 1) begin :nnRW assign rwRegs1D[32*nn+31:32*nn] = registers[nn]; // flatten end for ( nn = 0; nn < nrROregs; nn = nn + 1) begin :nnRO assign roRegs2D[nn] = roRegs1D[32*nn+31:32*nn]; // inflate end[/code]

Inflate the vectors, rwRegs1D and roRegs1D, into a 2-dimensional array registers.

wire [0:31] registers[0:nrRWregs+nrROregs-1]; genvar nn; for ( nn = 0; nn < nrRWregs; nn = nn + 1 ) begin :nnRW assign registers[nn] = rwRegs1D[32*nn+31:32*nn]; end for ( nn = 0; nn < nrROregs; nn = nn + 1 ) begin :nnRO assign roRegs1D[32*nn+31:32*nn] = registers[nn+nrRWregs]; end[/code]

### Timing

The timing diagram below shows the relation between the different signals at the message level on Xilinx.

The Altera implementation is more optimized, as it needs to run at 200 MHz. The gate level simulation is shown below;

### Finite State Machine

This message module converts the bytes into messages and visa versa. The protocol is implemented using a state machine with 4 states:

• Idle (0)
• Transmit status (1), transmits 8-bit status to master
• Transmit register value (2) , transmits a 32-bit register value to the master
• Receive register value (3), receives a 32-bit register value from the master

An additional state-like variable, byteId, is used to keeps track of what byte to transmit or receive.

### Sources

The complete project including constraints and test bench is available through

### Verification

To verify the implementation, run the test bench (spi_msg_tb.v) using gate level simulation. This test bench will monitor the communication and report errors when found. In the real world, we connect the Arduino SPI Master that acts just like the test bench.

This article introduced SPI as a protocol and expanded it to exchange messages.

## Message protocol on Arduino

Implements the SPI byte protocol on Arduino to exchange bytes with a FPGA. Written in C for Intel Arduino 101. This page continues the protocol description of the Math Talk series. This short section shows an Arduino implementation of a SPI master. The Arduino can be either an Arduino 101 or an Arduino UNO R3 that has been modified to run on 3.3V as described in the hardware section.

## Arduino as Master

The Arduino is blessed with a support library for the serial peripheral interface. This greatly aids the implementation. The code shown below was tested with both types of Arduino. For the slave we used an Altera or Xilinx based FPGA implementation as described on the next page. Refer to the first part of this article for details about the physical connection. In particular, once more, please read the part about 3.3V versus 5V when using an Arduino UNO R3.

### Sources

The SPI library makes it very straightforward to implement a SPI master. You can clone this project from:

The Arduino sends an alternating pattern of 0xAA and 0x55 to the FPGA. On the FPGA, LED[0] will be on when it receives 0xAA. Consequentially it will blink with 10% duty cycle. The FPGA always returns 0x55, what is displayed on the serial port.

## Traces

Logic analyzer traces of the Arduino communicating with the FPGA. We see the SS going active, and the FPGA sending a byte over MISO to the Arduino. The data is sampled on the rising edge of SCLK.

Zoomed out, we see the Arduino receiving three bytes from the FPGA.

Following this “SPI byte protocol on Arduino”, up next is: Byte Exchange with a FPGA as Slave.

## Message protocol

Specifies the SPI byte protocol. We use this to exchange bytes between the Arduino microcontroller and a FPGA. This is the third part of Math Talk. In this part we describe the protocol used to transfer bytes between the microcontroller and FPGA.

## Bytes Exchange Protocol

With the two devices physically connected, we need a protocol to transfer data. We chose the Serial Peripheral Interface (SPI), a lightweight protocol to connect one master to one or more slaves.

### Master/slave

The SPI bus is controlled by a master device (typically a microcontroller) that orchestrates the bus access. The master generates the control signals and regulates the data flow. The illustration below shows a master with three slaves. The pinout for SCLK, MOSI, MISO and SS can be found on the previous page. The master uses the Slave Select (SS) signal to select the slave.

### Parameters

SPI is also a protocol with many degrees of freedom. It is important that the master and slave agree on the voltage levels and maximum clock frequency. The SPI clock polarity (CPOL) and clock phase (CPHA) introduce four more degrees of freedom as shown in the table below.

SPI parameters
Mode CPOL CPHA clock idle data driven data latched
0 0 0 low falling edge rising edge
1 0 1 low rising edge falling edge
2 1 0 high rising edge falling edge
3 1 1 high falling edge rising edge

For this article we assume mode 3, where the clock is high when idle; data is driving following the falling edge of the clock and latched on the rising edge.

### Operation

The protocol is easiest explained with shift registers as shown in the illustration below. The master generates the SPI Clock (SCLK) to initiate the information exchange. Data is shifted on one edge of this clock and is sampled on the opposite edge when the data is stable.

In mode 3, at the falling edge of SCLK, both devices drive their most significant bit (b7) on their outgoing data line. On the rising edge, both devices clock in this bit into the least significant bit position (b0). After eight SCLK cycles, the master and slave have exchanged their values and each device processes the data received (e.g. writing it to memory). In case there is more data to be exchanged, the registers are loaded with new data and the process repeats itself. Once all data is transmitted, the master stops the SCLK clock.

### Slave select

For a more complete picture, we need to include the effect of the slave select (SS*) signal that is used to address the slave devices.

Slaves may only drive their output (MISO) line when SS* is active, otherwise they should tri-stated the output. The protocol can be broken down into the following steps:

1. The master initiates the communication by activating SS*.
• The slave responds by starting to drive its MISO output.
• Meanwhile the master drives its MOSI output.
2. The master makes SCLK low.
• On this falling edge, the master and slave drive their most significant bit position (b7) on respectively their MOSI and MISO outputs.
3. The master makes SCLK high.
• On this rising edge, the master and slave clock the input from their respectively MISO and MOSI inputs into the least significant bit position (b0).
4. Go back to step 2. until the least significant bit position (b0) has been sent.
5. When all bits are transmitted, the master deactivates SS*.

Following this definition of the SPI byte protocol, the following pages describe an implementation of this protocol, where an Arduino is the master and a FPGA is the slave.

## Byte protocol on Arduino

Implements the SPI byte protocol on Arduino to exchange bytes with a FPGA. Written in C for Intel Arduino 101. This page continues the protocol description of the Math Talk series. This short section shows an Arduino implementation of a SPI master. The Arduino can be either an Arduino 101 or an Arduino UNO R3 that has been modified to run on 3.3V as described in the hardware section.

## Arduino as Master

The Arduino is blessed with a support library for the serial peripheral interface. This greatly aids the implementation. The code shown below was tested with both types of Arduino. For the slave we used an Altera or Xilinx based FPGA implementation as described on the next page. Refer to the first part of this article for details about the physical connection. In particular, once more, please read the part about 3.3V versus 5V when using an Arduino UNO R3.

### Sources

The SPI library makes it very straightforward to implement a SPI master. You can clone this project from:

The Arduino sends an alternating pattern of 0xAA and 0x55 to the FPGA. On the FPGA, LED[0] will be on when it receives 0xAA. Consequentially it will blink with 10% duty cycle. The FPGA always returns 0x55, what is displayed on the serial port.

## Traces

Logic analyzer traces of the Arduino communicating with the FPGA. We see the SS going active, and the FPGA sending a byte over MISO to the Arduino. The data is sampled on the rising edge of SCLK.

Zoomed out, we see the Arduino receiving three bytes from the FPGA.

Following this “SPI byte protocol on Arduino”, up next is: Byte Exchange with a FPGA as Slave.

## Byte protocol

Specifies the SPI byte protocol. We use this to exchange bytes between the Arduino microcontroller and a FPGA. This is the third part of Math Talk. In this part we describe the protocol used to transfer bytes between the microcontroller and FPGA.

## Bytes Exchange Protocol

With the two devices physically connected, we need a protocol to transfer data. We chose the Serial Peripheral Interface (SPI), a lightweight protocol to connect one master to one or more slaves.

### Master/slave

The SPI bus is controlled by a master device (typically a microcontroller) that orchestrates the bus access. The master generates the control signals and regulates the data flow. The illustration below shows a master with three slaves. The pinout for SCLK, MOSI, MISO and SS can be found on the previous page. The master uses the Slave Select (SS) signal to select the slave.

### Parameters

SPI is also a protocol with many degrees of freedom. It is important that the master and slave agree on the voltage levels and maximum clock frequency. The SPI clock polarity (CPOL) and clock phase (CPHA) introduce four more degrees of freedom as shown in the table below.

SPI parameters
Mode CPOL CPHA clock idle data driven data latched
0 0 0 low falling edge rising edge
1 0 1 low rising edge falling edge
2 1 0 high rising edge falling edge
3 1 1 high falling edge rising edge

For this article we assume mode 3, where the clock is high when idle; data is driving following the falling edge of the clock and latched on the rising edge.

### Operation

The protocol is easiest explained with shift registers as shown in the illustration below. The master generates the SPI Clock (SCLK) to initiate the information exchange. Data is shifted on one edge of this clock and is sampled on the opposite edge when the data is stable.

In mode 3, at the falling edge of SCLK, both devices drive their most significant bit (b7) on their outgoing data line. On the rising edge, both devices clock in this bit into the least significant bit position (b0). After eight SCLK cycles, the master and slave have exchanged their values and each device processes the data received (e.g. writing it to memory). In case there is more data to be exchanged, the registers are loaded with new data and the process repeats itself. Once all data is transmitted, the master stops the SCLK clock.

### Slave select

For a more complete picture, we need to include the effect of the slave select (SS*) signal that is used to address the slave devices.

Slaves may only drive their output (MISO) line when SS* is active, otherwise they should tri-stated the output. The protocol can be broken down into the following steps:

1. The master initiates the communication by activating SS*.
• The slave responds by starting to drive its MISO output.
• Meanwhile the master drives its MOSI output.
2. The master makes SCLK low.
• On this falling edge, the master and slave drive their most significant bit position (b7) on respectively their MOSI and MISO outputs.
3. The master makes SCLK high.
• On this rising edge, the master and slave clock the input from their respectively MISO and MOSI inputs into the least significant bit position (b0).
4. Go back to step 2. until the least significant bit position (b0) has been sent.
5. When all bits are transmitted, the master deactivates SS*.

Following this definition of the SPI byte protocol, the following pages describe an implementation of this protocol, where an Arduino is the master and a FPGA is the slave.

## Hardware

Implementation of the hardware SPI connection between an Intel Arduino 101 and a FPGA. Includes pinouts and schematic. This is the second part of Math Talk. We will describe the hardware components and the physical interconnect to communicate with the FPGA.

## SPI connection between Arduino and FPGA

SPI is a protocol, in which one device (the master) controls one or more other devices (the slaves). For the master we use an open-source microcontroller prototyping platform, such as the Arduino 101 or a modified Arduino UNO R3. In this document we use Arduino to refer to either platform.

The slave can be a low-cost FPGA prototyping platforms, such as the Xilinx Spartan-6 Avnet LX9 or the Altera Cyclone-IV Terasic DE0-Nano.

The repository includes project files and pin assignments for both these boards. The code is written in HDL Verilog and should work equally well on more powerful boards.

### Voltage levels

It is very important that the I/O voltage levels of the devices match. Both FPGA boards support 3.3V levels, and are a good match for the Arduino 101. However, the Arduino UNO uses the traditional 5 Volt TTL levels. Instead of using a level shifter, such as the 74LVC245, we opt for converting the Arduino to 3.3V according to Adafruit’s instructions. Running a 16 MHz clock at 3.3V is out of spec. Is said to work, but should really program the fuses to get the frequency down to abt. 13 MHz.

### Signals

The SPI interface is a 4 wire interface. The bus consists of 3 signals plus a slave select signal for each device.

• SCLK, clock signal sent from the master to all slaves;
• MOSI, serial data from the master to the slaves (Master Out-Slave In);
• MISO, serial data from a slave to the master (Master In-Slave Out);
• SSn, slave select signal for each slave.

Once the Arduino runs at 3.3V, connecting the two devices becomes trivial.

#### Connect

List of physical connections

Connections
signal Arduino Xlinx FPGA Altera FPGA
SS Digital I/O 10 PMOD J4 pin1 GPIO0 J1 pin4
MOSI Digital I/O 11 PMOD J4 pin2 GPIO0 J1 pin6
MISO Digital I/O 12 PMOD J4 pin3 GPIO0 J1 pin8
SCK Digital I/O 13 PMOD J4 pin4 GPIO0 J1 pin10
GND GND PMOD J4 pin5 GPIO0 J1 pin12

Schematic

Following this “SPI connection between Arduino and FPGA”, the next page describes how to exchange bytes over this physical interface.

## Introduction

This series “Connecting Arduino to FPGA” describes how the Arduino can access custom registers on a FPGA. Hardware schematic and protocol implementation to transfer messages. Written in Verilog HDL and C.

Building Math Circuits implemented a math compute device on a Field Programmable Gate Array (FPGA). This sequel describes a protocol and its implementation that enables a FPGA and microcontroller to communicate with each other. The idea is to generate operands on an microcontroller; the Math Hardware then performs the operations and returns the results. The communication between the devices is the focus of this article.

## Connecting Arduino to FPGA

The protocol that we will use is called Serial Peripheral Interface (SPI). It is a synchronous full-duplex serial interface [1], and is commonly used to communicate with on-board peripherals such as EEPROM, FLASH memory, A/D converters, temperature sensors, or in our case a Field Programmable Gate Array (FPGA).

We assume a working knowledge of the Verilog hardware description language. To learn more about Verilog refer a book such as “FPGA Prototyping with Verilog Examples” by Chu, do the free online class at verilog.com, or read through the slides Intro to Verilog from MIT. Instructions on installing the toolchain for the Verilog IDEs can be found at Getting Started with FPGA programming on Altera or on Xilinx.

### Contents

The series “Connecting Arduino to FPGA” starts with describing the physical connections. We then look into exchanging bytes between a microcontroller and an FPGA. The last part implements an layer that allows message passing to custom registers.

Continue at:

## Square root circuit

Implements a math square root using a circuit of logic gates. Written in parameterized Verilog HDL for Altera and Xilinx FPGA’s.

## Square root using logic gates

The square root method implemented here is a simplification of Samavi and Sutikno improvements of the non-restoring digit recurrence square root algorithm. For details about this method, refer to Chapter 7 of the inquiry “How do Computers do Math?“. 

### Simplified Samovi Square Root

The square root of an integer can be calculated using a circuit of Controlled Subtract-Multiplex (csm) blocks. The blocks were introduced as part of the divider implementation. The square root circuit for an 8-bits value is given in shown below.

Each row performs one “attempt subtraction” cycle. For a start, the top row attempts to subtracts binary 01. If the answer would be negative, the most significant bit of the output will be ‘1’. This msb drives the drives the Output Select ($$os$$) inputs that effectively cancels the subtraction if the result would be negative.

Similar to the divider, using Verilog HDL we can generate instances of csm blocks based on the word length of the radicand (xWIDTH) . Once more, to describe the circuit in Verilog HDL, we need to derive the rules that govern the connections between the blocks.

Start by numbering the output ports based on their location in the matrix. For this circuit, we have the output signals difference $$d$$ and borrow-out $$b$$. E.g. $$d_{13}$$ identifies the difference signal for the block in row 1 and column 3. Next, we express the input signals as a function of the output signal names $$d$$ and $$b$$ and do the same for the quotient itself as shown in the table below.

2BD

Based on this table, we can now express the interconnects using Verilog HDL using ?: expressions.

generate genvar ii, jj;
: 2BD
endgenerate

The complete Verilog HDL source code along with the test bench and constraints is available at:

#### Results

As usual, the propagation delay $$t_{pd}$$ depends size $$N$$ and the value of operands. For a given size $$N$$, the maximum propagation delay occurs when each subtraction needs to be cancelled.

Post-map Timing Analysis reveals the worst-case propagation delays. The values in the table below, assume that the size of both operands is the same. The exact value depends on the model and speed grade of the FPGA, the silicon itself, voltage and the die temperature.

## Conclusion

Judging from the number of research papers popping up each year, we can deduce that this is a still active field.

Add elementary functions such as sin, cos, tan, exponential, and logarithm.

## Divider circuit

Implements a math divider using a circuit of logic gates. Written in parameterized Verilog HDL for Altera and Xilinx FPGA’s.

## Divider using logic gates

The attempt-subtraction divider was introduced in the inquiry How do Computer do Math. This most basic divider consists of interconnected Controlled Subtract-Multiplex (csm) blocks. Each blocks contains a 1-bit Full Subtractor (fs) with the usual inputs a, b and bi and outputs d and bo. The output select signal, os, signal selects between input x and and the difference x-y.

### Attempt-subtraction divider

A complete 8:4-bit divider can therefore be implemented by a matrix of csm modules connected on rows and columns as shown in figure below. Each row performs one “attempt subtraction” cycle. Note that the most significant bit is used to drive the Output Select os inputs. (For more details see “Combinational arithmetic“.)

Similar to the multipliers, using Verilog HDL we can generate instances of csm blocks based on the word length of the dividend (xWIDTH) and divisor (yWIDTH). To describe the circuit in Verilog HDL, we need to derive the rules that govern the connections between the blocks.

Start by numbering the output ports based on their location in the matrix. For this circuit, we have the output signals difference ($$d$$) and borrow-out ($$b$$). E.g. $$d_{13}$$ identifies the difference signal for the block in row 1 and column 3. Next, we express the input signals as a function of the output signal names ($$d$$ and $$b$$) and do the same for the quotient itself as shown in the table below.

Based on this table, we can now express the interconnects using Verilog HDL using ?: expressions. generate genvar ii, jj; for ( ii = 0; ii <; xWIDTH; ii = ii + 1) begin: gen_ii for ( jj = 0; jj <; yWIDTH + 1; jj = jj + 1) begin: gen_jj math_divider_csm_block csm( .a ( jj <; 1 ? x[xWIDTH-1-ii] : ii > 0 ? d[ii-1][jj-1] : 1’b0 ), .b ( jj <; yWIDTH ? y[jj] : 1'b0 ), .bi ( jj > 0 ? b[ii][jj-1] : 1’b0 ), .os ( b[ii][yWIDTH] ), .d ( d[ii][jj] ), .bo ( b[ii][jj] ) ); end end for ( ii = 0; ii <; xWIDTH; ii = ii + 1) begin: gen_p assign q[xWIDTH-1-ii] = ~b[ii][yWIDTH]; end for ( jj = 0; jj <;= yWIDTH; jj = jj + 1) begin: gen_r assign r[jj] = d[xWIDTH-1][jj]; end endgenerate[/code]

The complete Verilog HDL source code along with the test bench and constraints is available at:

### Results

As usual, the propagation delay $$t_{pd}$$ depends size $$N$$ and the value of operands. For a given size $$N$$, the maximum propagation delay occurs when each subtraction needs to be cancelled.

The worst-case propagation delays for the Terasic Altera Cyclone IV DE0-Nano are found using the post-map Timing Analysis tool. The values in the table below, assume that the size of both operands is the same. The exact value depends on the model and speed grade of the FPGA, the silicon itself, voltage and the die temperature.

Continuing from “Divider using logic gates”, the next chapter shows an implementation of the square root algorithm introduced in Chapter 7 of the inquiry “How do Computers do Math?“.

## A faster multiplier circuit

Implements a carry-save array multiplier using a circuit of logic gates. Written in Verilog HDL for Altera and Xilinx FPGA’s.

## Carry-save array multiplier

 The propagating carry limits the performance of the previous algorithm. Here we investigates methods of implementing binary multiplication with a smaller latency. Low latency demands an efficient algorithm and high performance circuitry to limit propagation delays. Crucial to the performance of multipliers are high-speed adders. [Bewick]

The speed of the multiplier is directly related to this execution time of these Digital Signal Processing (DSP) applications.

Since multiplication dominates the execution time of most DSP algorithms, so there is a need of high-speed multiplier. Examples are convolution, Fast Fourier Transform (FFT), filtering and in ALU of microprocessors.

### Carry-save Array Multiplier

An important advance in improving the speed of multipliers, pioneered by Wallace, is the use of carry save adders (CSA). Even though the building block is still the multiplying adder (ma), the topology of prevents a ripple carry by ensuring that, wherever possible, the carry-out signal propagates downward and not sideways.

The illustration below gives an example of this multiplication process.

Again, the building block is the multiplying adder (ma) as describe on the previous page. However, the topology is so that the carry-out from one adder is not connected to the carry-in of the next adder. Hence preventing a ripple carry. The circuit diagram below shows the connections between these blocks.

The observant reader might notice that ma0x can be replaced with simple AND gates, ma4x can be replaced by adders. Also the block ma43 is not needed. More interesting, the the ripple adder in the last row, can be replace with the faster carry look ahead adder.

Similar to the carry-propagate array multiplier, using Verilog HDL we can generate instances of ma blocks based on the word length of the multiplicand and multiplier (N). To describe the circuit in Verilog HDL, we need to derive the rules that govern the connections between the blocks.

Start by numbering the output ports based on their location in the matrix. For this circuit, we have the output signals sum (s) and carry-out (c). E.g. c_13 identifies the carry-out signal for the block in row 1 and column 3. Next, we express the input signals as a function of the output signal names s and c and do the same for the product itself as shown in the table below.

Based on this table, we can now express the interconnects using Verilog HDL using ?: expressions.

generate genvar ii, jj;
for ( ii = 0; ii <;= N; ii = ii + 1) begin: gen_ii
for ( jj = 0; jj <; N; jj = jj + 1) begin: gen_jj
math_multiplier_ma_block ma(
.x ( ii <; N ? a&#91;jj&#93; : (jj > 0) ? c[N][jj-1] : 1'b0 ),
.y ( ii <; N ? b&#91;ii&#93; : 1'b1 ),
.si ( ii > 0  jj <; N - 1 ? s&#91;ii-1&#93;&#91;jj+1&#93; : 1'b0 ),
.ci ( ii > 0 ? c[ii-1][jj] : 1'b0 ),
.so ( s[ii][jj] ),
.co ( c[ii][jj] ) );
if ( ii == N ) assign p[N+jj] = s[N][jj];
end
assign p[ii] = s[ii][0];
end
endgenerate

The complete Verilog HDL source code along with the test bench and constraints is available at:

#### Results

The propagation delay $$t_{pd}$$ depends size $$N$$ and the value of operands. For a given size $$N$$, the maximum propagation delay occurs when the low order bit cause a carry/sum that propagate to the highest order bit. This worst-case propagation delay is linear with $$2N$$, this makes this carry-save multiplier is about 33% faster as the ripple-carry multiplier. Note that the average propagation delay is about half of this.

The post-map Timing Analysis tool shows the worst-case propagation delays for the Terasic Altera Cyclone IV DE0-Nano. The exact value depends on the model and speed grade of the FPGA, the silicon itself, voltage and the die temperature.

### Other multipliers

The Wallace Multiplier decreases the latency by reorganizing the additions. Wikipedia has a good description of the algorithm. Due to the irregular routing, they can be difficult to route on a FPGA. As a consequence, additional wire delays may cause it to perform slower than carry-safe array multipliers.

The Wallace Multiplier can be combined with Booth Coding. The Booth Multiplier (alt) uses, say, 2 bits of the multiplier in generating each partial product thereby using only half the number of rows. Booth Multipliers with more fancy VLSI technique such as 0.6μ BiCMOS process using emitter coupled logic makes 53×53 multipliers possible with a latency of less than 2.6 nanoseconds [ref].

Booth multiplication is a technique that allows for smaller, faster multiplication circuits, by reordering the values to be multiplied. It is the standard technique used in chip design.

Vedic arithmetic is the ancient system of Indian mathematics which has a unique technique of calculations based on 16 Sutras (Formulae)

Another algorithm is Karatsuba. Overview.

Following this “Carry-save array multiplier using logic gates”, the next chapter shows an implementation of the divider introduced in Chapter 7 of the inquiry “How do Computers do Math?“.

## Multiplier circuit

Implements a math multiplier using circuits of logic gates. Written in parameterized Verilog HDL for Altera and Xilinx FPGA’s.

## Multiplier using logic gates

We introduced the carry-propagate array multiplier in the inquiry “How do Computers do Math?“.

This multiplier is build around Multiplier Adder (ma) blocks. These ma blocks are themselves build around the Full Adder (fa) blocks introduced in the adder section. These fa blocks have the usual inputs $$a$$ and $$b$$, $$c_i$$ and outputs $$s$$ and $$c_o$$. The special thing is that the internal signal $$b$$ is an AND function of the inputs $$x$$ and $$y$$ as depicted below.

Carry-propagate Array Multiplier

As shown in the inquiry “How do Computers do Math?“, a carry-propagate array multiplier can be built by combining many of these ma blocks. The circuit diagram below shows the connections between these blocks for a 4-bit multiplier.

For an implementation in Verilog HDL, we can instantiate ma blocks based on the word length of the multiplicand and multiplier ($$N$$). If you are new to Verilog HDL, remember that the generate code segment expands during compilation time. In other words, it is just a short hand for writing out the long list of ma block instances.

generate genvar ii, jj;
for ( ii = 0; ii <; N; ii = ii + 1) begin: gen_ii
for ( jj = 0; jj <; N; jj = jj + 1) begin: gen_jj
math_multiplier_ma_block ma(
.x(?), .y(?), .si(?), .ci(?),
.so(?), .co(?) );
end
end
endgenerate&#91;/code&#93;
</p>
<p>
As you might notice, the input and output ports are not described. For this, we need to derive the rules that govern these interconnects. Start by numbering the output ports based on their location in the matrix. For this circuit, we have the output signals <em>sum</em> ($$s$$) and <em>carry-out</em> ($$c$$). E.g. $$c_{13}$$ identifies the carry-out signal for the block in row <code>1</code> and column <code>3</code>. Note that the circuit description depicts the matrix in a slanted fashion.

<div class="flex-container">
<figure>
<img class="wp-image-16504" src="https://coertvonk.com/wp-content/uploads/math-multiplier-ripple-carry-tbl-ma-output.png" alt="own work" width="420" />
</a>
<figcaption>
Output signals 'so' and 'co'
</figcaption>
</figure>
</div>
</p>
<p>
Knowing this, we can enter the output signals in the Verilog HDL code
math_multiplier_ma_block ma(
.x(?), .y(?), .si(?), .ci(?),
.so ( s[ii][jj] ),
.co ( c[ii][jj] ) );

Next, we express the input signals as a function of the output signal names $$s$$ and $$c$$ as shown in the table below.

Based on this table, we can express the input assignments for each ma using "c ? a : b" expressions. Note that Verilog 2001 does not allow these programming statements for the output pins. This is why we expressed the input ports as a function of the output ports instead of visa versa.

math_multiplier_ma_block ma(
.x ( a[jj] ),
.y ( b[ii]),
.si ( ii == 0 ? 1'b0 : jj <; N - 1 ? s&#91;ii-1&#93;&#91;jj+1&#93; : c&#91;ii-1&#93;&#91;N-1&#93; ),
.ci ( jj > 0 ? c[ii][jj-1] : 1'b0 ),
.so ( s[ii][jj] ),
.co ( c[ii][jj] ) );

All that is left to do is to express the inputs of the module as a function of the output signals

Putting it all together, we get the following snippet generate genvar ii, jj; for (ii = 0; ii <; N; ii = ii + 1) begin: gen_ii for (jj = 0; jj <; N; jj = jj + 1) begin: gen_jj math_multiplier_ma_block ma( .x ( a[jj] ), .y ( b[ii]), .si ( ii == 0 ? 1'b0 : jj <; N - 1 ? s[ii-1][jj+1] : c[ii-1][N-1] ), .ci ( jj > 0 ? c[ii][jj-1] : 1'b0 ), .so ( s[ii][jj] ), .co ( c[ii][jj] ) ); end assign p[ii] = s[ii][0]; end for (jj = 1; jj <; N; jj = jj + 1) begin: gen_jj2 assign p[jj+N-1] = s[N-1][jj]; end assign p[N*2-1] = c[N-1][N-1]; endgenerate[/code]

The ma block compiles into the RTL netlist shown below

As shown in the figure below, the for loops unroll into 16 interconnected ma blocks.

The complete Verilog HDL source code is available at:

### Results

The propagation delay $$t_{pd}$$ depends size $$N$$ and the value of operands. For a given size $$N$$, the maximum propagation delay occurs when the low order bit because a carry/sum that propagate to the highest order bit. This worst-case propagation delay is linear with $$3N$$. Note that the average propagation delay is about half of this.

The worst-case propagation delays for the Terasic Altera Cyclone IV DE0-Nano are found using the post-map Timing Analysis tool. The exact value depends on the model and speed grade of the FPGA, the silicon itself, voltage and the die temperature.

The timing analysis for $$N=27$$, reveals that the worst-case propagation delay path goes through $$c_0$$ and $$s_o$$ as shown below on the left. When measuring the worst-case propagation delay on the actual device, we use input values that cause the maximum number ripple carries and sums propagating. For a 27-bit multiplier that where the input also has a maximum value of 99,999,999, the propagation path is simulated in a spreadsheet as shown below on the right.

Brute force using the FPGA to find all combinations of operands that cause long propagation delays revealed 27'h2FA3A92 * 27h'55D4A77, 27'h60A308B * 27'd99999999 (50ns), 27'h775A668 * 27'd89999999 (55 ns), 27'h56F5D8F * 27'h3AAAB7B (55 ns).

Following this "Math multiplier using logic gates", the next chapter explores methods of making the multiplication operation faster.