Last time I pointed out that although a number of applications run faster on FPGAs, not all of them do. This time I’ll be looking at which characteristics allow certain applications to run faster on FPGAs than CPUs.
To do this we first need to understand some of the principals of optimising designs on an FPGA. The standard design concept on an FPGA is the same as that used to design any other synchronous logic circuit: pipelining. To get the best performance from an FPGA designers usually aim to make their design what’s called “fully pipelined”. A fully pipelined design allows each component of the system to provide calculation towards the output at once, making it highly efficient.
Pipelining
Pipelining’s approach is similar to a factory assembly line, with the processing broken down into sections. Each section of processing is connected together in series, with the output of one section connected to the input of the next. Data moves between each section of
processing synchronously at regular intervals. Although each section of processing is conducted serially on an individual datum, the processing elements are all running in parallel, processing new data as it enters the system. Have a look at the image below of a
fighter jet assembly line to see what I mean:
Raw materials, data, enter the assembly line. Each stage of assembly may be seen as a stage of a calculation, with data being combined together using operators. The output of the assembly line, the finished planes, would be the same as the result of the calculation in
this metaphor. The clock rate of a pipeline refers to the speed at which parts move from one assembly stage to the next. For the pipeline to work without buffering parts/data, this rate should be the same for every pipeline stage.
Pipelining in itself doesn’t decrease processing time, it could increase it. However pipelining will increase the throughput of a system when processing a stream of data. In terms of processing network data, a pipeline clocked at the same rate as incoming traffic can guarantee to process each byte of network data, even at 100% utilisation.
For example, in a 100 stage 1G network processing pipeline, clocked at 125MHz, the first byte of network data is processed in 800ns, with each subsequent processed byte arriving every 8ns. This is the same rate as data enters the system, irrespective of network
utilisation.
A CPU uses a pipeline for its processing too, and usually they run at a much faster rate than one on an FPGA. However, the pipeline on a CPU is not being used to its full capacity – not all stages are being used at once. It is as if on the production line some staff are having a coffee break. So CPU pipelines are not operating at peak efficiency. This is why optimised FPGA designs are able to run certain types of calculations faster than CPUs, despite a lower clock rate, since the pipeline can be designed so that each stage is always being used.
To come back to the original premise for this entry – which characteristics allow certain applications to run faster on FPGAs than CPUs? As you may be able to recognise by now, applications which can be mapped into a fully pipelined FPGA implementation, or a near fully pipelined implementation work the best.
Some points to think about to achieve full pipelining:
- Do similar operations need to be carried out on each piece of data?
- As few changes to pipeline behaviour with external state as possible
- In the factory metaphor, this would be minimising the number of optional extras so that there are minimal branches in the assembly line/pipeline. In other words it’s best to have the data flow completely serial.
- Accesses to external and internal resources. such as memory should be pipelineable, so that they can be integrated into the assembly process.
- Parts of the algorithm may have to be refactored to pipeline them. Is this possible?
- Does data arrive at a constant rate?
- Accuracy – can precision be sacrificed to improve the rate moving data between stages of the pipeline?
Some applications in the financial space which can be pipelined to work well on FPGAs are Monte Carlo simulation and network processing.
Hard numbers
Next time I’ll most likely go into detail with case studies, but I promised hard numbers this time so to finish up with some.
“FPGA accelerated low-latency market data feed processing” was a research paper I co-wrote with David Thomas and Wayne Luk of Imperial College. It showed an FPGA based parser for the high-bandwidth OPRA feed was able to process all data on the feed up until the point of Ethernet saturation with no dropping of packets caused by slow processing. However, the software consuming the data delivered by the FPGA began to drop packets at around half line rate due to not being able to process the market data at the same rate.
An investigation into implementations of a European option pricing benchmark I conducted with Matt Aubury (“Design Space Exploration Of The European Option Benchmark Using Hyperstreams”) compares FPGA with alternative implementations including a software one using a conventional CPU. Depending on the accuracy criteria of the benchmark, the FPGA was shown to be between 15 to 146 times faster than the software implementation.
The numbers don’t lie – see you next time!