Building Large Systems with Formic

Formic boards are designed as cost-efficient “bricks” to build large systems. Specifically:

  • Each board is small, just 10 cm x 10 cm
  • There are eight GTP links per board, which enable multiple 2D or 3D topologies to be built
  • Half the links are in “Host” and the other half in “Device” SATA modes, which allow interconnection without the need of hard-to-find and/or more expensive crossover cables
  • All voltages are generated on board from a 12V unregulated input
  • Power connectors are duplicated in the left and right sides of the PCB, to allow chaining the boards supply
  • JTAG connectors are also duplicated and buffered in the left and right sides of the PCB, to allow chaining the boards programming and debug
  • Large, passive coolers are used to minimize the audible noise through the usage of standard, bulk fans that cool multiple boards at once

We have successfully built a 64-board Formic system, organized as a 4x4x4 cube made of plexiglas. The boards are interconnected in a 3D-mesh topology, where each board uses two GTP links per dimension to connect to its neighbors.

A Scalable, Non-coherent Hardware Architecture

The first hardware architecture that uses the Formic boards has been developed by FORTH-ICS and is available for download. It is a realistic model of a non-coherent, manycore architecture that can scale to hundreds of cores. Each Formic FPGA contains eight Xilinx MicroBlaze CPUs, each with its own private L1 and L2 caches. A crossbar-based network-on-chip interconnects the eight CPUs and neighboring boards. Adding more boards scales the system to more total cores. Our 4x4x4 cube of 64 boards faithfully models a 512-core processor.

Our non-coherent hardware architecture, in a nutshell, features per board:

  • Eight Xilinx MicroBlaze 32-bit RISC CPUs. We use the 3-stage, area-optimized version.
  • Each CPU (shown as “MBS” in the figure, short for “MicroBlaze Slice”) has a private two-way 4-KB instruction L1 cache, a private two-way 8-KB data L1 cache and a private eight-way 256-KB L2 unified cache. The L1 caches are implemented in FPGA BRAMs; the L2 data are stored in the board SRAMs, while their tags in BRAMs. Each CPU core is accompanied by its own DMA engine, network-on-chip interface, 4-KB mailbox, interrupt controller and various other architectural enhancements.
  • Eight GTP blocks that connect to the network-on-chip on the inside and to the board SATA connectors on the outside. They implement CRC checking of packets and credit-based flow control for board-to-board traffic.
  • A 22-port crossbar with combined input/output queues. The crossbar uses 3 VCs and dimension-ordered routing for deadlock avoidance.
  • A board-wide virtual-to-physical translation table (TLB). Our architecture uses global virtual address, so translation happens just before the access to the 128-MB board DDR2 DRAM.
  • A board controller that manages board peripherals, such as a UART controller, an I2C controller and the board LEDs.

Our hardware prototype uses a large combination of edge-aligned clocks to map high-bandwidth parts (such as the network-on-chip) at high clock rates with narrow datapaths. This technique increases dramatically the mapping efficiency -- many more features fit in the FPGA -- but sacrifices CPU clocking rate. To maintain realistic CPU clock latencies, we clock the MicroBlaze CPUs at 10 MHz. This allows the network-on-chip to run at 160 MHz with 16-bit datapaths and still faithfully model the CPU-to-network bandwidth relationships of modern multicore chips.

The 520-core Heterogeneous Platform

FORTH-ICS is a partner in the European Union FP7 ENCORE project. We have successfully implemented features of our non-coherent hardware architecture on the Xilinx Virtex-5 FPGA of the ARM Versatile Express platform.

We have connected two Versatile Express boxes to our 512-core Formic cube. Each box contains a quad-core ARM Cortex A9 core. The resulting system is a 520-core heterogenous platform, of 8 “large” ARM cores and 512 “small” MicroBlaze cores. Each core has its own DMA engine and can initiate cache-to-cache transfers with any other core in the system.

We are actively using our heteregenous architecture for research in multicore hardware architecture, runtime systems and compilers. One prominent use case of the 520-core FPGA prototype is Myrmics, a scalable, bare-metal, runtime system that implements a task-based, parallel programming model. More information about our research can be found here.