Field-Programmable Gate Arrays (FPGAs) are integrated circuits that are
organized such that they can be programmed and reprogrammed to emulate a
hardware design that exactly matches the requirements of an application,
and hence can potentially have much better performance-per-watt than
alternatives. This has led to growing commercial interest (including
IBM) in using FPGAs for general-purpose and scientific computing. To
ease the programming task for non-hardware-expert programmers, systems
are emerging that can map high-level languages such as C and OpenCL
to FPGAs---targeting compiler-generated circuits, overlay processing
engines, and combinations of the two.

In this talk we describe the synergy between compiler and hardware when
compiling highly-threaded and parallel code for a novel wide-issue
overlay architecture with deeply-pipelined datapaths. Our goal in this
work is to use the compiler to schedule instructions both within and
across threads, to maximize the utilization of the hardware and to
influence its design. We have developed a highly-parameterized engine
comprising (i) deeply-pipelined floating point units of widely-varying
latency (eg., addition/subtraction, multiplication, division and
exponentiation), and (ii) operand network and (iii) storage that are
both configurable in the number of ports and level of sharing across
them. Using our LLVM-based compiler infrastructure and neuroscience
simulation as an initial application area, we explore the scheduling
algorithms and operand storage/networks that result in highest
utilization and performance.