Menu

HLS Mini Series - II

Handling Variable Loop Depths

Level: Beginner
Reading time: 6 minutes

Catching up

 

Along with the first blog post in this series, the full High-Level Synthesis (HLS) project flow can be carried up to the output generation. So far, we coarsely covered the idea of the baseline approach but have not yet any means of scaling the generated IP for performance or resources. We also assumed to start with a fully well-behaved C-code that would already satisfy the claims as required from the first chapter. Let me create awareness here that the C/C++ program may not always be in good shape from the start. I will focus on one aspect of the C-code that may seem obvious but is still encountered quite often. This can get even more involved if adopting a larger grown code base for the HLS-based hardware acceleration processing.

Today: Variable Iteration Depth

 

For the simple task of calculating an average of a variable length field, the depth of the sample set may be given as a parameter in a C function:

#include „ap_int.h“
#define N 32
 
typedef ap_int<8> din_t;
typedef ap_int<13> dout_t;
typedef ap_uint<5> dsel_t;
 
dout_t code028(din_t A[N], dsel_t width) {
 
       dout_t out_accum=0;
       dsel_t x;
 
LP_X: for (x=0;x<width; x++) {
       out_accum += A[x];
       }

 
       /* actual division */
return out_accum;
}

Any CPU-based programmer would readily use such functions for handling almost an arbitrary field size, even going into more complex loop and iterator implementation in C++.

However, already in the baseline iteration, where the implementation stick closely to the original C/C++ code, the HLS compiler cannot determine the loop latency as the width argument is not known at compile time (HLS synthesis time). It will create an HDL state machine that gets the width value out of a control register but cannot assume any actual value. This yields question marks in the result reports, i.e., an unclear performance metric. Vitis HLS helps in such situations by supporting a compiler directive that defines an estimated execution depth. In this way, the directive LOOP_TRIPCOUNT mitigates this issue. In our simple case this would yield:

#include „ap_int.h“
#define N 32
 
typedef ap_int<8> din_t;
typedef ap_int<13> dout_t;
typedef ap_uint<5> dsel_t;
 
dout_t code028(din_t A[N], dsel_t width) {
 
       dout_t out_accum=0;
       dsel_t x;
 
LOOP_X: for (x=0;x<width; x++) {
#pragma HLS LOOP_TRIPCOUNT min = N/2 max = N
 
       out_accum += A[x];
       }

 
       /* actual division */
return out_accum;
}

Yet, this directive will not alter the code generation but only works towards the reporting side. As a result, the HLS compiler will still create a sequential representation and each iteration will add to the overall latency. Without further guidance, the accumulation will not yet show the FPGA advantage. But to make most efficient use of the generated logic in the FPGA fabric, the HLS designer needs to achieve a parallel execution of the loop body for the full width.

Formally, this requires reducing the Initiation Interval (II), which describes the efficiency of the hardware: An II = 1 is achieved, when the generated HDL design is enabled to accept a new data input at each cycle. As hinted earlier, a loop flattening directive is required, and the HLS compiler offers such with the simple UNROLL pragma:

#include „ap_int.h“
#define N 32
 
typedef ap_int<8> din_t;
typedef ap_int<13> dout_t;
typedef ap_uint<5> dsel_t;
 
dout_t code028(din_t A[N], dsel_t width) {
 
       dout_t out_accum=0;
       dsel_t x;
 
LOOP_X: for (x=0;x<width; x++) {

#pragma HLS unroll
 
       out_accum += A[x];
       }

 
       /* actual division */
return out_accum;
}

Beware that this is a very greedy approach to flattening, but serves the discussion purpose, as by default UNROLL drives a full parallel implementation, only limited by the device resources. Still, with this source code the HLS compiler will throw a WARNING message claiming that it »cannot completely unroll a loop with a variable trip count«. This, again, is an effect of the non-deterministic loop depth, but this time it actually compromises our directives to work properly. Just as in the latency case before, we need to create a compile-time static value for the loop depth width. The solution to this issue becomes obvious if we allow for a constant loop counting that the code can break out at the expected iteration depth.

#include „ap_int.h“
#define N 32
 
typedef ap_int<8> din_t;
typedef ap_int<13> dout_t;
typedef ap_uint<5> dsel_t;
 
dout_t loop_max_bounds(din_t A[N], dsel_t width) {
 
       dout_t out_accum=0;
       dsel_t x;
 
LOOP_X:for (x=0 ; x
#pragma HLS unroll
 
       if (x < width) { out_accum += A[x];
       }
}
 
       /* actual division */
return out_accum;
}

This would obviously require that the condition width ≤ N holds for all applicable widths.

Summary

 

This short derivation has demonstrated that the HLS compiler requires support to digest C-code constructs, that were not easily fixed at compile time. The example should guide into the way that C/C++ -code is interpreted and how it may need supportive changes to yield a promising result. In the following blog posts, we will revisit such situations, specifically when it comes to procedural or block parallelism. Again, be aware that in a more complex code base, such variants can come with more intricate solutions, even splitting data blocks into static and non-static loops.