This release focusses on improving the capacity and capabilities of the Nexus hardware, in preparation for a complete rewrite of nxcompile
. The changes in this release are all a result of attempting to scale up the v0.2 hardware above a 36 node mesh, with more than 8 inputs and outputs per node - when the design in v0.2 was scaled, complexity exploded and each node more than trebled in size leading to a complete exhaustion of resources on the FPGA.
Using a Xilinx XC7A200T FPGA, timing has been achieved at 200 MHz for the mesh using a 10x10 configuration with 32 inputs, 32 outputs, and 16 working registers per node. Over a PCIe link to a host system, Nexus has been observed to run at 10 million cycles per second with output messages disabled and around 2.4 million cycles per second with output messages enabled (a data rate of approximately 1 Gb/s) - both these figures are with the minimum amount of simulation work happening per cycle (so 'ideal' figures). This is far above the performance achieved in release v0.2.
The headline changes in v0.3 are:
nxmodel
previously used SimPy, but is now written in C++. This change yielded a huge speed increase (simulations now run at tens of kilohertz), and it still integrates well with the cocotb verification environment using the awesome pybind11 framework.- Constants, enumerations, structs, and unions are now defined using Packtype which allows the definition to be mastered in Python and then used from generated code in the RTL, Python based verification environment, and the C++ model and driver.
- The implementation of the node RTL has changed dramatically:
- Output message start and end positions are now held in a lookup table in the node's RAM, this resulted in a substantial reduction in both flop and LUT usage.
- Loopback of output messages has been replaced by a mask, which directly drives inputs from output values of the same node in matching positions.
- The decoder now directly loads the node's RAM for all entries, rather than output mappings being fed through the controller.
- The controller now supports output value trace generation, which is off by default and can be selectively enabled.
- If configured with support, node inputs can be directly fed from external sources - this is part of the support for simulated memory access.
- The logical core now evaluates three input truth table operations rather than fixed functions such as AND, OR, etc.
- Message routing now prioritises horizontal dispatch (across columns) over vertical dispatch (across rows), this results in better balance of traffic across the mesh and helps to support the column aggregators.
- New aggregator components now sit at the bottom of each column of nodes, collecting all signal messages and exposing a wide output bus. All other message types are passed through the aggregators and forwarded to the host.
- The top-level controller has substantially changed:
- Host-facing interfaces to the controller have been widened to 128-bit, and encoding has been changed to achieve better transfer efficiency.
- The host-facing interfaces to the mesh have been removed, and instead specific controller request and response types have been introduced which allow messages to be forwarded into and out of the mesh.
- The controller is now responsible for generating a summary of the wide output bus from the mesh and sending it to the host, sections of the summary can be suppressed if not required to reduce the traffic.
- Multiple on-device memories exist in the controller which can be accessed by the mesh once per simulated cycle, the host can also read and write the contents of these memories at any point during the simulation.
nxlink
has been rewritten to support the new controller interface, and the protobuf based gRPC framework has been dropped in favour of integrating with tools as a library rather than via socket connections.
The next focus will be on improving the compiler to take full advantage of the new capabilities and capacity of the Nexus hardware platform.