zoukankan      html  css  js  c++  java
  • HLS Coding Style: Hardware Efficient C Code

    Hardware Efficient C Code

    When C code is compiled for a CPU, the complier transforms and optimizes the C code into
    a set of CPU machine instructions. In many cases, the developers work is done at this stage.
    If however, there is a need for performance the developer will seek to perform some or all
    of the following:
    • Understand if any additional optimizations can be performed by the compiler.
    • Seek to better understand the processor architecture and modify the code to take
    advantage of any architecture specific behaviors (for example, reducing conditional
    branching to improve instruction pipelining)
    • Modify the C code to use CPU-specific intrinsics to perform key operations in parallel.
    (for example, ARM NEON intrinsics)

    The same methodology applies to code written for a DSP or a GPU, and when using an
    FPGA: an FPGA device is simply another target.
    C code synthesized by Vivado HLS will execute on an FPGA and provide the same
    functionality as the C simulation. In some cases, the developers work is done at this stage.

    Typical C Code for a Convolution Function

    The algorithm structure can be summarized as follows:

     1 template<typename T, int K>
     2 static void convolution_orig(
     3 int width,
     4 int height,
     5 const T *src,
     6 T *dst,
     7 const T *hcoeff,
     8 const T *vcoeff) {
     9 T local[MAX_IMG_ROWS*MAX_IMG_COLS];
    10 // Horizontal convolution
    11 HconvH:for(int col = 0; col < height; col++){
    12 HconvWfor(int row = border_width; row < width - border_width; row++){
    13 Hconv:for(int i = - border_width; i <= border_width; i++){
    14 }
    15 }
    16 // Vertical convolution
    17 VconvH:for(int col = border_width; col < height - border_width; col++){
    18 VconvW:for(int row = 0; row < width; row++){
    19 Vconv:for(int i = - border_width; i <= border_width; i++){
    20 }
    21 }
    22 // Border pixels
    23 Top_Border:for(int col = 0; col < border_width; col++){
    24 }
    25 Side_Border:for(int col = border_width; col < height - border_width; col++){
    26 }
    27 Bottom_Border:for(int col = height - border_width; col < height; col++){
    28 }
    29 }

     1 const int conv_size = K;
     2 const int border_width = int(conv_size / 2);
     3 #ifndef __SYNTHESIS__
     4 T * const local = new T[MAX_IMG_ROWS*MAX_IMG_COLS];
     5 #else // Static storage allocation for HLS, dynamic otherwise
     6 T local[MAX_IMG_ROWS*MAX_IMG_COLS];
     7 #endif
     8 Clear_Local:for(int i = 0; i < height * width; i++){
     9 local[i]=0;
    10 }
    11 // Horizontal convolution
    12 HconvH:for(int col = 0; col < height; col++){
    13 HconvWfor(int row = border_width; row < width - border_width; row++){
    14 int pixel = col * width + row;
    15 Hconv:for(int i = - border_width; i <= border_width; i++){
    16 local[pixel] += src[pixel + i] * hcoeff[i + border_width];
    17 }
    18 }
    19 }

    The first issue for the quality of the FPGA implementation is the array local. Since this is
    an array it will be implemented using internal FPGA block RAM. This is a very large memory
    to implement inside the FPGA. It may require a larger and more costly FPGA device. The use
    of block RAM can be minimized by using the DATAFLOW optimization and streaming the
    data through small efficient FIFOs, but this will require the data to be used in a streaming
    manner.


    The next issue is the initialization for array local. The loop Clear_Local is used to set
    the values in array local to zero. Even if this loop is pipelined, this operation will require
    approximately 2 million clock cycles (HEIGHT*WIDTH) to implement. This same initialization
    of the data could be performed using a temporary variable inside loop HConv to initialize
    the accumulation before the write.


    Finally, the throughput of the data is limited by the data access pattern.
    • For the first output, the first K values are read from the input.
    • To calculate the second output, the same K-1 values are re-read through the data input
    port.
    • This process of re-reading the data is repeated for the entire image.

     1 Clear_Dst:for(int i = 0; i < height * width; i++){
     2 dst[i]=0;
     3 }
     4 // Vertical convolution
     5 VconvH:for(int col = border_width; col < height - border_width; col++){
     6 VconvW:for(int row = 0; row < width; row++){
     7 int pixel = col * width + row;
     8 Vconv:for(int i = - border_width; i <= border_width; i++){
     9 int offset = i * width;
    10 dst[pixel] += local[pixel + offset] * vcoeff[i + border_width];
    11 }
    12 }
    13 }


    This code highlights similar issues to those already discussed with the horizontal
    convolution code.
    • Many clock cycles are spent to set the values in the output image dst to zero. In this
    case, approximately another 2 million cycles for a 1920*1080 image size.
    • There are multiple accesses per pixel to re-read data stored in array local.
    • There are multiple writes per pixel to the output array/port dst.


    Another issue with the code above is the access pattern into array local. The algorithm
    requires the data on row K to be available to perform the first calculation. Processing data
    down the rows before proceeding to the next column requires the entire image to be stored
    locally. In addition, because the data is not streamed out of array local, a FIFO cannot be
    used to implement the memory channels created by DATAFLOW optimization. If DATAFLOW
    optimization is used on this design, this memory channel requires a ping-pong buffer: this
    doubles the memory requirements for the implementation to approximately 4 million data
    samples all stored locally on the FPGA.

     1 int border_width_offset = border_width * width;
     2 int border_height_offset = (height - border_width - 1) * width;
     3 // Border pixels
     4 Top_Border:for(int col = 0; col < border_width; col++){
     5 int offset = col * width;
     6 for(int row = 0; row < border_width; row++){
     7 int pixel = offset + row;
     8 dst[pixel] = dst[border_width_offset + border_width];
     9 }
    10 for(int row = border_width; row < width - border_width; row++){
    11 int pixel = offset + row;
    12 dst[pixel] = dst[border_width_offset + row];
    13 }
    14 for(int row = width - border_width; row < width; row++){
    15 int pixel = offset + row;
    16 dst[pixel] = dst[border_width_offset + width - border_width - 1];
    17 }
    18 }
    19 Side_Border:for(int col = border_width; col < height - border_width; col++){
    20 int offset = col * width;
    21 for(int row = 0; row < border_width; row++){
    22 int pixel = offset + row;
    23 dst[pixel] = dst[offset + border_width];
    24 }
    25 for(int row = width - border_width; row < width; row++){
    26 int pixel = offset + row;
    27 dst[pixel] = dst[offset + width - border_width - 1];
    28 }
    29 }
    30 Bottom_Border:for(int col = height - border_width; col < height; col++){
    31 int offset = col * width;
    32 for(int row = 0; row < border_width; row++){
    33 int pixel = offset + row;
    34 dst[pixel] = dst[border_height_offset + border_width];
    35 }
    36 for(int row = border_width; row < width - border_width; row++){
    37 int pixel = offset + row;
    38 dst[pixel] = dst[border_height_offset + row];
    39 }
    40 for(int row = width - border_width; row < width; row++){
    41 int pixel = offset + row;
    42 dst[pixel] = dst[border_height_offset + width - border_width - 1];
    43 }
    44 }


    The code suffers from the same repeated access for data. The data stored outside the FPGA
    in array dst must now be available to be read as input data re-read multiple time. Even in
    the first loop, dst[border_width_offset + border_width] is read multiple times but the
    values of border_width_offset and border_width do not change.

    The final aspect where this coding style negatively impact the performance and quality of
    the FPGA implementation is the structure of how the different conditions is address. A
    for-loop processes the operations for each condition: top-left, top-row, etc. The
    optimization choice here is to:
    Pipelining the top-level loops, (Top_Border, Side_Border, Bottom_Border) is not
    possible in this case because some of the sub-loops have variable bounds (based on the
    value of input width). In this case you must pipeline the sub-loops and execute each set of
    pipelined loops serially.


    The question of whether to pipeline the top-level loop and unroll the sub-loops or pipeline
    the sub-loops individually is determined by the loop limits and how many resources are
    available on the FPGA device. If the top-level loop limit is small, unroll the loops to replicate
    the hardware and meet performance. If the top-level loop limit is large, pipeline the lower
    level loops and lose some performance by executing them sequentially in a loop
    (Top_Border, Side_Border, Bottom_Border).

    As shown in this review of a standard convolution algorithm, the following coding styles
    negatively impact the performance and size of the FPGA implementation:
    • Setting default values in arrays costs clock cycles and performance.
    • Multiple accesses to read and then re-read data costs clock cycles and performance.
    • Accessing data in an arbitrary or random access manner requires the data to be stored
    locally in arrays and costs resources.

    Ensuring the Continuous Flow of Data and Data Reuse

    The ideal behavior is to have the data
    samples constantly flow through the FPGA.
    • Maximize the flow of data through the system. Refrain from using any coding
    techniques or algorithm behavior which limits the flow of data.
    • Maximize the reuse of data. Use local caches to ensure there are no requirements to
    re-read data and the incoming data can keep flowing.

    If the data is transferred from the CPU or system memory to the FPGA it will typically be
    transferred in this streaming manner. The data transferred from the FPGA back to the
    system should also be performed in this manner.

    Using HLS Streams for Streaming Data
    One of the first enhancements which can be made to the earlier code is to use the HLS
    stream construct, typically referred to as an hls::stream. An hls::stream object can be used to
    store data samples in the same manner as an array. The data in an hls::stream can only be
    accessed sequentially. In the C code, the hls::stream behaves like a FIFO of infinite depth.

    Code written using hls::streams will generally create designs in an FPGA which have
    high-performance and use few resources because an hls::stream enforces a coding style
    which is ideal for implementation in an FPGA.

    Multiple reads of the same data from an hls::stream are impossible. Once the data has been
    read from an hls::stream it no longer exists in the stream. This helps remove this coding
    practice.

    If the data from an hls::stream is required again, it must be cached. This is another good
    practice when writing code to be synthesized on an FPGA.

    The hls::stream forces the C code to be developed in a manner which ideal for an FPGA
    implementation.
    When an hls::stream is synthesized it is automatically implemented as a FIFO channel which
    is 1 element deep. This is the ideal hardware for connecting pipelined tasks.
    There is no requirement to use hls::streams and the same implementation can be performed
    using arrays in the C code. The hls::stream construct does help enforce good coding
    practices. With an hls::stream construct the outline of the new optimized code is as follows:

     1 template<typename T, int K>
     2 static void convolution_strm(
     3 int width,
     4 int height,
     5 hls::stream<T> &src,
     6 hls::stream<T> &dst,
     7 const T *hcoeff,
     8 const T *vcoeff)
     9 {
    10 hls::stream<T> hconv("hconv");
    11 hls::stream<T> vconv("vconv");
    12 // These assertions let HLS know the upper bounds of loops
    13 assert(height < MAX_IMG_ROWS);
    14 assert(width < MAX_IMG_COLS);
    15 assert(vconv_xlim < MAX_IMG_COLS - (K - 1));
    16 // Horizontal convolution
    17 HConvH:for(int col = 0; col < height; col++) {
    18 HConvW:for(int row = 0; row < width; row++) {
    19 HConv:for(int i = 0; i < K; i++) {
    20 }
    21 }
    22 }
    23 // Vertical convolution
    24 VConvH:for(int col = 0; col < height; col++) {
    25 VConvW:for(int row = 0; row < vconv_xlim; row++) {
    26 VConv:for(int i = 0; i < K; i++) {
    27 }
    28 }
    29 Border:for (int i = 0; i < height; i++) {
    30 for (int j = 0; j < width; j++) {
    31 }
    32 }

     
    Some noticeable differences compared to the earlier code are:
    • The input and output data is now modelled as hls::streams.
    • Instead of a single local array of size HEIGHT*WDITH there are two internal hls::streams
    used to save the output of the horizontal and vertical convolutions.

     1 // Horizontal convolution
     2 HConvW:for(int col = 0; col < height; col++) {
     3 HconvW:for(int row = border_width; row < width - border_width; row++){
     4 T in_val = src.read();
     5 T out_val = 0;
     6 HConv:for(int i = 0; i < K; i++) {
     7 hwin[i] = i < K - 1 ? hwin[i + 1] : in_val;
     8 out_val += hwin[i] * hcoeff[i];
     9 }
    10 if (row >= K - 1)
    11 hconv << out_val;
    12 }
    13 }

    The algorithm keeps reading input samples a caching them into hwin. Each time is reads a
    new sample, it pushes an unneeded sample out of hwin. The first time an output value can
    be written is after the Kth input has been read. Now an output value can be written.

    Throughout the entire process, the samples in the src input are processed in a
    raster-streaming manner. Every sample is read in turn. The outputs from the task are either
    discarded or used, but the task keeps constantly computing. This represents a difference
    from code written to perform on a CPU.


    In a CPU architecture, conditional or branch operations are often avoided. When the
    program needs to branch it loses any instructions stored in the CPU fetch pipeline. In an
    FPGA architecture, a separate path already exists in the hardware for each conditional
    branch and there is no performance penalty associated with branching inside a pipelined
    task. It is simply a case of selecting which branch to use.

    The outputs are stored in the hls::stream hconv for use by the vertical convolution loop.

     1 // Vertical convolution
     2 VConvH:for(int col = 0; col < height; col++) {
     3 VConvW:for(int row = 0; row < vconv_xlim; row++) {
     4 #pragma HLS DEPENDENCE variable=linebuf inter false
     5 #pragma HLS PIPELINE
     6 T in_val = hconv.read();
     7 T out_val = 0;
     8 VConv:for(int i = 0; i < K; i++) {
     9 T vwin_val = i < K - 1 ? linebuf[i][row] : in_val;
    10 out_val += vwin_val * vcoeff[i];
    11 if (i > 0)
    12 linebuf[i - 1][row] = vwin_val;
    13 }
    14 if (col >= K - 1)
    15 vconv << out_val;
    16 }
    17 }

    A line buffer allows K-1 lines of data to be stored. Each time a new sample is read, another
    sample is pushed out the line buffer. An interesting point to note here is that the newest
    sample is used in the calculation and then the sample is stored into the line buffer and the
    old sample ejected out. This ensure only K-1 lines are required to be cached, rather than K
    lines. Although a line buffer does require multiple lines to be stored locally, the convolution
    kernel size K is always much less than the 1080 lines in a full video image.

     1 Border:for (int i = 0; i < height; i++) {
     2 for (int j = 0; j < width; j++) {
     3 T pix_in, l_edge_pix, r_edge_pix, pix_out;
     4 #pragma HLS PIPELINE
     5 if (i == 0 || (i > border_width && i < height - border_width)) {
     6 if (j < width - (K - 1)) {
     7 pix_in = vconv.read();
     8 borderbuf[j] = pix_in;
     9 }
    10 if (j == 0) {
    11 l_edge_pix = pix_in;
    12 }
    13 if (j == width - K) {
    14 r_edge_pix = pix_in;
    15 }
    16 }
    17 if (j <= border_width) {
    18 pix_out = l_edge_pix;
    19 } else if (j >= width - border_width - 1) {
    20 pix_out = r_edge_pix;
    21 } else {
    22 pix_out = borderbuf[j - border_width];
    23 }
    24 dst << pix_out;
    25 }
    26 }
    27 }

    The final code for this FPGA-friendly algorithm has the following optimization directives
    used.

     1 template<typename T, int K>
     2 static void convolution_strm(
     3 int width,
     4 int height,
     5 hls::stream<T> &src,
     6 hls::stream<T> &dst,
     7 const T *hcoeff,
     8 const T *vcoeff)
     9 {
    10 #pragma HLS DATAFLOW
    11 #pragma HLS ARRAY_PARTITION variable=linebuf dim=1 complete
    12 hls::stream<T> hconv("hconv");
    13 hls::stream<T> vconv("vconv");
    14 // These assertions let HLS know the upper bounds of loops
    15 assert(height < MAX_IMG_ROWS);
    16 assert(width < MAX_IMG_COLS);
    17 assert(vconv_xlim < MAX_IMG_COLS - (K - 1));
    18 // Horizontal convolution
    19 HConvH:for(int col = 0; col < height; col++) {
    20 HConvW:for(int row = 0; row < width; row++) {
    21 #pragma HLS PIPELINE
    22 HConv:for(int i = 0; i < K; i++) {
    23 }
    24 }
    25 }
    26 // Vertical convolution
    27 VConvH:for(int col = 0; col < height; col++) {
    28 VConvW:for(int row = 0; row < vconv_xlim; row++) {
    29 #pragma HLS PIPELINE
    30 #pragma HLS DEPENDENCE variable=linebuf inter false
    31 VConv:for(int i = 0; i < K; i++) {
    32 }
    33 }
    34 Border:for (int i = 0; i < height; i++) {
    35 for (int j = 0; j < width; j++) {
    36 #pragma HLS PIPELINE
    37 }
    38 }

    Each of the tasks are pipelined at the sample level. The line buffer is full partitioned into
    registers to ensure there are no read or write limitations due to insufficient block RAM
    ports. The line buffer also requires a dependence directive. All of the tasks execute in a
    dataflow region which will ensure the tasks run concurrently. The hls::streams are
    automatically implemented as FIFOs with 1 element.

    Summary of C for Efficient Hardware


    Minimize data input reads. Once data has been read into the block it can easily feed many
    parallel paths but the input ports can be bottlenecks to performance. Read data once and
    use a local cache if the data must be reused.


    Minimize accesses to arrays, especially large arrays. Arrays are implemented in block RAM
    which like I/O ports only have a limited number of ports and can be bottlenecks to
    performance. Arrays can be partitioned into smaller arrays and even individual registers but
    partitioning large arrays will result in many registers being used. Use small localized caches
    to hold results such as accumulations and then write the final result to the array.


    Seek to perform conditional branching inside pipelined tasks rather than conditionally
    execute tasks, even pipelined tasks. Conditionals will be implemented as separate paths in
    the pipeline. Allowing the data from one task to flow into with the conditional performed
    inside the next task will result in a higher performing system.


    Minimize output writes for the same reason as input reads: ports are bottlenecks.
    Replicating addition ports simply pushes the issue further out into the system.


    For C code which processes data in a streaming manner, consider using hls::streams as
    these will enforce good coding practices. It is much more productive to design an algorithm
    in C which will result in a high-performance FPGA implementation than debug why the
    FPGA is not operating at the performance required.

    Sequential I/O Access of Matrix Multiplier

    Original Code for the Matrix Multiplier:

     1 #include "matrixmul.h"
     2 
     3 void matrixmul(
     4       mat_a_t a[MAT_A_ROWS][MAT_A_COLS],
     5       mat_b_t b[MAT_B_ROWS][MAT_B_COLS],
     6       result_t res[MAT_A_ROWS][MAT_B_COLS])
     7 {
     8   // Iterate over the rows of the A matrix
     9    Row: for(int i = 0; i < MAT_A_ROWS; i++) {
    10       // Iterate over the columns of the B matrix
    11       Col: for(int j = 0; j < MAT_B_COLS; j++) {
    12          res[i][j] = 0;
    13          // Do the inner product of a row of A and col of B
    14          Product: for(int k = 0; k < MAT_B_ROWS; k++) {
    15             res[i][j] += a[i][k] * b[k][j];
    16          }
    17       }
    18    }
    19 
    20 }

    Array res performs writes in the following sequence:

    • Write to [0][0] on line57.
    • Then a write to [0][0] on line 60.
    • Then a write to [0][0] on line 60.
    • Then a write to [0][0] on line 60.
    • Write to [0][1] on line 57 (after index J increments).
    • Then a write to [0][1] on line 60.
    • Etc.
    Four consecutive writes to address [0][0] does not constitute a streaming access pattern;
    this is random access.
    There are similar issues reading arrays a and b.

    Modified Code for the Matrix Multiplier:

     1 #include "matrixmul.h"
     2 
     3 void matrixmul(
     4       mat_a_t a[MAT_A_ROWS][MAT_A_COLS],
     5       mat_b_t b[MAT_B_ROWS][MAT_B_COLS],
     6       result_t res[MAT_A_ROWS][MAT_B_COLS])
     7 {
     8 #pragma HLS ARRAY_RESHAPE variable=b complete dim=1
     9 #pragma HLS ARRAY_RESHAPE variable=a complete dim=2
    10 #pragma HLS INTERFACE ap_fifo port=a
    11 #pragma HLS INTERFACE ap_fifo port=b
    12 #pragma HLS INTERFACE ap_fifo port=res
    13   mat_a_t a_row[MAT_A_ROWS];
    14   mat_b_t b_copy[MAT_B_ROWS][MAT_B_COLS];
    15   int tmp = 0;
    16 
    17   // Iterate over the rowa of the A matrix
    18   Row: for(int i = 0; i < MAT_A_ROWS; i++) {
    19     // Iterate over the columns of the B matrix
    20     Col: for(int j = 0; j < MAT_B_COLS; j++) {
    21 #pragma HLS PIPELINE rewind
    22       // Do the inner product of a row of A and col of B
    23       tmp=0;
    24       // Cache each row (so it's only read once per function)
    25       if (j == 0)
    26         Cache_Row: for(int k = 0; k < MAT_A_ROWS; k++)
    27           a_row[k] = a[i][k];
    28 
    29        // Cache all cols (so they are only read once per function)
    30      if (i == 0)
    31             Cache_Col: for(int k = 0; k < MAT_B_ROWS; k++)
    32                b_copy[k][j] = b[k][j];
    33 
    34       Product: for(int k = 0; k < MAT_B_ROWS; k++) {
    35         tmp += a_row[k] * b_copy[k][j];
    36       }
    37       res[i][j] = tmp;
    38     }
    39   }
    40 }

    To have a hardware design with sequential streaming accesses, the ports accesses can only
    be those shown highlighted in red. For the read ports, the data must be cached internally to
    ensure the design does not have to re-read the port. For the write port res, the data must
    be saved into a temporary variable and only written to the port in the cycles shown in red.
    • In a streaming interface, the values must be accessed in sequential order.
    • In the code, the accesses were also port accesses, which High-Level Synthesis is unable
    to move around and optimize. The C code specified writing the value zero to port res
    at the start of every product loop. This may be part of the intended behavior. HLS
    cannot simply decide to change the specification of the algorithm.

    Review the code and confirm the following:
    • The directives, including the FIFO interfaces, are specified in the code as
    pragmas.
    • For-loops have been added to cache the row and column reads.
    • A temporary variable is used for the accumulation and port res is only written to when
    the final result is computed for each value.
    • Because the for-loops to cache the row and column would require multiple cycles to
    perform the reads, the pipeline directive has been applied to the Col for-loop, ensuring
    these cache for-loops are automatically unrolled.

    Reference:

    1. Xilinx UG902

    2.Xilinx UG871

  • 相关阅读:
    随笔和文章的区别
    02. 什么是 SOA?
    GOF 23种设计模式(Patterns in Java)
    ASP.NET MVC: how to include <span> in link from Html.ActionLink?
    Head First 02.观察者模式(Observer)
    ActionResult types in MVC2
    探索推荐引擎内部的秘密,第 1 部分: 推荐引擎初探
    Head First 01.策略模式(Strategy Pattern)
    SOA实施:不能忽略商业价值
    nmon analyser——生成 AIX 性能报告的免费工具
  • 原文地址:https://www.cnblogs.com/wordchao/p/10950131.html
Copyright © 2011-2022 走看看