HLS Mini Series - II
Handling Variable Loop Depths
Level: Beginner
Reading time: XX minutes
Catching up
Along with the first blog post in this series, the full High-Level Synthesis (HLS) project flow can be carried up to the output generation. So far we coarsely covered the idea of the baseline approach but have not yet any means of scaling the generated IP for performance or resources. We also assumed to start with a fully well-behaved C-code that would already satisfy the claims as required from the first chapters. Let me create awareness here that the C/C++ program may not always be in good shape from the start. I will focus on one aspect of the C-code that may seem obvious but is still encountered quite often. It can get involved if adopting a larger grown code base to go into HLS-based hardware acceleration.
Today: Variable Iteration Depth
For the simple task of calculating an average of a variable length field, the depth of the sample set may be given as a parameter in a C function:
Any CPU-based programmer would readily use such functions for handling almost an arbitrary field size, even going into more complex loop and iterator implementation in C++. However, the HLS compiler already in the baseline implementation cannot determine the loop latency as the width argument is not statically known. It will create an HDL state machine that gets the width value out of a control register but cannot assume any actual value. This yields question marks in the result reports, i.e., an unclear performance metric. Vitis HLS helps in such situations by supporting a compiler directive that defines an estimated execution depth. In that way the directive LOOP_TRIPCOUNT mitigates this issue. In our simple case this would yield:
However this directive will not alter the code generation but only works towards the reporting side. So the HLS compiler will still create a sequential representation and each iteration will add to the overall latency. Without further guidance, the accumulation will not yet show the FPGA advantage. But to make most efficient use of the generated logic in the FPGA fabric, the HLS designer needs to achieve a parallel execution of the loop body for the full width.
Formally this requires reducing the Initiation interval (II) which describes the efficiency of the hardware: An II = 1 is achieved, when the generated HDL design is enabled to accept a new data input at each cycle. As hinted earlier a loop flattening directive is required and the HLS compiler offers such with the simple UNROLL pragma:
Beware that this is a very greedy approach to flattening but serves the discussion purpose, as per default UNROLL drives a full parallel implementation, only limited by the device resources. Still with this source code the HLS compiler will throw a WARNING message claiming that it »cannot completely unroll a loop with a variable trip count«. This again is an effect of the non-deterministic loop depth but this time actually compromising our directives to work properly. Just again as in the latency case before, we need to create a compile-time static value for the loop depth width. The solution to this issue becomes obvious if we allow for a constant loop counting that the code can break out at the expected iteration depth.
This would obviously require that the condition width ≤ N holds for all applicable widths.
Summary
This short derivation has demonstrated that the HLS compiler really requires support to digest C-code constructs not easily fixed at compile time. The example should guide into the way that code is observed and finally may need supportive changes to yield a promising result. In the following blog posts, we will revisit such situations, specifically when it comes to procedural or block parallelism. Again, be aware that in a more complex code base such variants can come with more intricate solutions, even splitting data blocks into static and non-static loops.
Recent Posts
Text