Menu

High-Level Synthesis (HLS):

Handling Variable Loop Depths

Catching up

 
Along the first article Fast Track Project Starter, the full HLS project flow can be carried up to the output generation. So far we coarsely covered the idea of the baseline approach but not yet any means of scaling the generated IP for performance nor resources. We also assumed to start with a fully well behaved C-code that would already satisfy the claims as required from the first chapter.
 
Let me create awareness here that the C/C++ program may not always be in good shape from the start. I will focus on one aspect of the C-code that may seem obvious but still is encountered quite often. It can get involved if adopting a larger grown code base to go into HLS-based hardware acceleration.

Variable Iteration Depth

 
For the simple task of calculating an average of a variable length field, the depth of the sample set may be given as a parameter in a C-function:

#include „ap_int.h“
#define N 32
 
typedef ap_int<8> din_t;
typedef ap_int<13> dout_t;
typedef ap_uint<5> dsel_t;
 
dout_t code028(din_t A[N], dsel_t width) {
 
       dout_t out_accum=0;
       dsel_t x;
 
LP_X: for (x=0;x<width; x++) {
       out_accum += A[x];
       }

 
       /* actual division */
return out_accum;
}

Any CPU-based programmer would readily use such functions for handling almost arbitrary field size, even going into more complex loop and iterator implementation in C++.

But the HLS compiler already in the baseline implementation cannot determine the loop latency as the width argument is not statically known. It will create an HDL state machine that gets the width value out of a control register but cannot assume any actual value. This yields question marks in the result reports, i.e., an unclear performance metric. Vitis HLS helps in such situations by supporting a compiler directive that defines an estimated execution depth. In that way the directive LOOP_TRIPCOUNT mitigates this issue. In our simple case this would yield:

#include „ap_int.h“
#define N 32
 
typedef ap_int<8> din_t;
typedef ap_int<13> dout_t;
typedef ap_uint<5> dsel_t;
 
dout_t code028(din_t A[N], dsel_t width) {
 
       dout_t out_accum=0;
       dsel_t x;
 
LOOP_X: for (x=0;x<width; x++) {
#pragma HLS LOOP_TRIPCOUNT min = N/2 max = N
 
       out_accum += A[x];
       }

 
       /* actual division */
return out_accum;
}

However this directive will not alter the code generation but only works towards the reporting side. So the HLS compiler will still create a sequential representation and each iteration will add to the overall latency. Without further guidance, the accumulation will not yet show the FPGA advantage. But to make most efficient use of the generated logic in the FPGA fabric, the HLS designer needs to achieve a parallel execution of the loop body for the full width.

Formally this requires reducing the Initiation interval (II) which describes the efficiency of the hardware: An II = 1 is achieved, when the generated HDL design is enabled to accept a new data input at each cycle. As hinted earlier a loop flattening directive is required and the HLS compiler offers such with the simple UNROLL pragma:

#include „ap_int.h“
#define N 32
 
typedef ap_int<8> din_t;
typedef ap_int<13> dout_t;
typedef ap_uint<5> dsel_t;
 
dout_t code028(din_t A[N], dsel_t width) {
 
       dout_t out_accum=0;
       dsel_t x;
 
LOOP_X: for (x=0;x<width; x++) {

#pragma HLS unroll
 
       out_accum += A[x];
       }

 
       /* actual division */
return out_accum;
}

Beware that this is a very greedy approach to flattening but serves the discussion purpose, as per default UNROLL drives a full parallel implementation, only limited by the device resources. Still with this source code the HLS compiler will throw an WARNING message claiming that it »cannot completely unroll a loop with a variable trip count«. This again is an effect of the non-deterministic loop depth but this time actually compromising our directives to work properly. Just again as in the latency case before, we need to create a compile-time static value for the loop depth width.
 
The solution to this issue becomes obvious, if we allow for a constant loop counting that the code can break out at the expected iteration depth.

#include „ap_int.h“
#define N 32
 
typedef ap_int<8> din_t;
typedef ap_int<13> dout_t;
typedef ap_uint<5> dsel_t;
 
dout_t loop_max_bounds(din_t A[N], dsel_t width) {
 
       dout_t out_accum=0;
       dsel_t x;
 
LOOP_X:for (x=0 ; x
#pragma HLS unroll
 
       if (x < width) { out_accum += A[x];
       }
}
 
       /* actual division */
return out_accum;
}

This would obviously require that the condition width ≤ N holds for all applicable widths.

Summary

 
This short derivation has demonstrated that the HLS compiler really requires support to digest C-code constructs not easily fixed at compile time. The example should guide into the way that code is observed and finally may need supportive changes to yield a promising result. In the following blogs we will revisit such situations, specifically when it comes to procedural or block parallelism.
 
Again, be aware that in a more complex code base such variants can come with more intricate solutions, even splitting data blocks into static and non-static loops.

If you would like to deep dive in the world of HLS right now, sign up for PLC2’s training course Compact Vitis HLS and benefit from 2 hours of free coaching support with every training session you book.