zoukankan      html  css  js  c++  java
  • HLS Coding Style: Functions and Loops

    Unsupported C Constructs

    Functions

    The top-level function becomes the top level of the RTL design after synthesis.
    Sub-functions are synthesized into blocks in the RTL design.

    The primary impact of a coding style on functions is on the function arguments and
    interface.


    If the arguments to a function are sized accurately, Vivado HLS can propagate this
    information through the design. There is no need to create arbitrary precision types for
    every variable.

    1 #include "ap_cint.h"
    2 int24 foo(int x, int y) {
    3 int tmp;
    4 tmp = (x * y);
    5 return tmp
    6 }

    When this code is synthesized, the result is a 32-bit multiplier with the output truncated to
    24-bit.
    If the inputs are correctly sized to 12-bit types (int12) as shown in the following code
    example, the final RTL uses a 24-bit multiplier.

    1 #include "ap_cint.h"
    2 typedef int12 din_t;
    3 typedef int24 dout_t;
    4 dout_t func_sized(din_t x, din_t y) {
    5 int tmp;
    6 tmp = (x * y);
    7 return tmp
    8 }

    Loops

    • RECOMMENDED: Avoid use of global variables for loop index variables, as this can inhibit some optimizations.
    • IMPORTANT: When a loop or function is pipelined, Vivado HLS unrolls all loops in the hierarchy below the function or loop. If there is a loop with variable bounds in this hierarchy, it prevents pipelining.
    • TIP: When a loop or function is pipelined, any loop in the hierarchy below the loop or function being pipelined must be unrolled.

    Variable Loop Bounds

    Loop Pipelining

     When pipelining loops, the most optimum balance between area and performance is
    typically found by pipelining the inner most loop. This is also results in the fastest run time.
    The following code example demonstrates the trade-offs when pipelining loops and
    functions.

     1 #include "loop_pipeline.h"
     2 dout_t loop_pipeline(din_t A[N]) {
     3 int i,j;
     4 static dout_t acc;
     5 LOOP_I:for(i=0; i < 20; i++){
     6 LOOP_J: for(j=0; j < 20; j++){
     7 acc += A[i] * j;
     8 }
     9 }
    10 return acc;
    11 }

    If the inner-most (LOOP_J) is pipelined, there is one copy of LOOP_J in hardware, (a single
    multiplier). Vivado HLS automatically flattens the loops when possible, as in this case, and
    effectively creates a new single loop of 20*20 iterations. Only 1 multiplier operation and 1
    array access need to be scheduled, then the loop iterations can be scheduled as single
    loop-body entity (20x20 loop iterations).

    • TIP: When a loop or function is pipelined, any loop in the hierarchy below the loop or function being pipelined must be unrolled.

    If the outer-loop (LOOP_I) is pipelined, inner-loop (LOOP_J) is unrolled creating 20 copies
    of the loop body: 20 multipliers and 20 array accesses must now be scheduled. Then each
    iteration of LOOP_I can be scheduled as a single entity.


    If the top-level function is pipelined, both loops must be unrolled: 400 multipliers and 400
    arrays accessed must now be scheduled. It is very unlikely that Vivado HLS will produce a
    design with 400 multiplications because in most designs data dependencies often prevent
    maximal parallelism, for example, in this case, even if a dual-port RAM is used for A[N] the
    design can only access two values of A[N] in any clock cycle.


    The concept to appreciate when selecting at which level of the hierarchy to pipeline is to
    understand that pipelining the inner-most loop gives the smallest hardware with generally
    acceptable throughput for most applications. Pipelining the upper-levels of the hierarchy
    unrolls all sub-loops and can create many more operations to schedule (which could impact
    run time and memory capacity), but typically gives the highest performance design in terms
    of throughput and latency.


    To summarize the above options:
    •Pipeline LOOP_J
    Latency is approximately 400 cycles (20x20) and requires less than 100 LUTs and
    registers (the I/O control and FSM are always present).
    •Pipeline LOOP_I
    Latency is approximately 20 cycles but requires a few hundred LUTs and registers. About
    20 times the logic as first option, minus any logic optimizations that can be made.
    •Pipeline function loop_pipeline
    Latency is approximately 10 (20 dual-port accesses) but requires thousands of LUTs and
    registers (about 400 times the logic of the first option minus any optimizations that can
    be made).

    Loop Parallelism

    Loop Dependencies

    Unrolling Loops in C++ Classes

    Reference:

    1. Xilinx UG902

  • 相关阅读:
    NLP(二十九)一步一步,理解Self-Attention
    树莓派4B踩坑指南
    树莓派4B踩坑指南
    树莓派4B踩坑指南
    【2020.4.17】发现GitHub图片又裂了
    右键管理-干掉多余又删不掉的access
    Python format参数中的列表和元组可以使用“*”
    树莓派4B踩坑指南
    树莓派4B踩坑指南
    树莓派4B踩坑指南
  • 原文地址:https://www.cnblogs.com/wordchao/p/10944844.html
Copyright © 2011-2022 走看看