zoukankan      html  css  js  c++  java
  • OpenCL学习笔记(三):OpenCL安装,编程简介与helloworld

    欢迎转载,转载请注明:本文出自Bin的专栏blog.csdn.net/xbinworld。 技术交流QQ群:433250724,欢迎对算法、技术、应用感兴趣的同学加入。

    OpenCL安装

    安装我不打算花篇幅写,原因是OpenCL实在是可以太多的平台+环境下实现了,包括GPU和FPGA,以及不同的器件支持,在这里我主要把网上可以找到比较不错的经验贴列一下,方便大家,我主要关注了FPGA的,其他GPU的大家网上搜搜吧:

    altera opencl sdk下载:
    https://www.altera.com.cn/products/design-software/embedded-software-developers/opencl/overview.html

    alter的安装指南,《Altera SDK for OpenCL
    Getting Started Guide》

    理论上看上面两个就够了,你需要做的事情包括:
    下载opencl SDK,或者quatuars II软件(含SDK),下载相应开发板的支持(altera上面有一些,但是其他的可能就需要你从相应的供应商那边找了);还需要opencl的license,不然是不能编译的。

    中文的一些经验贴可以看:
    《Altera OpenCL入门(beta版)》http://wenku.baidu.com/link?url=bkIyo01jXeWfdGsrA_M0J1zomx6f0lYk0NPf-9-MNaC0OkWRmukDwY5yFz0I3Wrctqi5qD3jC8BhQQzjoqw1HXpUgIM68_blz5Cr3vxpaZC

    【Altera SoC体验之旅】+ 正式开启OpenCL模式
    http://home.eeworld.com.cn/my/space-uid-169743-blogid-247647.html

    OpenCL编程简介

    下面的图简单说明了OpenCL的编程框架,图是用的GPU,其他类似;
    这里写图片描述
    从图中可以看出(参考《OpenCL 编程入门》):
    1. 异构计算设备,可以是CPU或GPU。现在也有支持OpenCL的FPGA设备和至强融核协处理设备(MIC)。
    2. OpenCL的API通过Context(环境上下文)联系在一起。
    3. 运行设备端的程序,经过了编译->设置参数->运行等步骤。

    名词的概念:
    Platform (平台):主机加上OpenCL框架管理下的若干设备构成了这个平台,通过这个平台,应用程序可以与设备共享资源并在设备上执行kernel。实际使用中基本上一个厂商对应一个Platform,比如Intel, AMD都是这样。

    Device(设备):官方的解释是计算单元(Compute Units)的集合。举例来说,GPU是典型的device。Intel和AMD的多核CPU也提供OpenCL接口,所以也可以作为Device。

    Context(上下文):OpenCL的Platform上共享和使用资源的环境,包括kernel、device、memory objects、command queue等。使用中一般一个Platform对应一个Context。

    Program:OpenCL程序,由kernel函数、其他函数和声明等组成。
    Kernel(核函数):可以从主机端调用,运行在设备端的函数。

    Memory Object(内存对象):在主机和设备之间传递数据的对象,一般映射到OpenCL程序中的global memory。有两种具体的类型:Buffer Object(缓存对象)和Image Object(图像对象)。

    Command Queue(指令队列):在指定设备上管理多个指令(Command)。队列里指令执行可以顺序也可以乱序。一个设备可以对应多个指令队列。

    NDRange:主机端运行设备端kernel函数的主要接口。实际上还有其他的,NDRange是非常常见的,用于分组运算,以后具体用到的时候就知道区别了。

    Host端来看,OpenCL的组要执行流程是这样的:
    这里写图片描述
    其实基本上大部分简单的程序HOST部分都是差不多的,不用改很多,具体下面看一个例子就知道了。

    第一个程序
    这里贴一个altera官方的vector add的实例code,基本就是helloworld级别了,不过它的host写的很通用(考虑到对多个device统一编程),可以过一遍看看是不是和上面的图对的上。其实看过这个基本其他的也就差不多了。

    Host部分:(Kernel在最后)

    // Copyright (C) 2013-2014 Altera Corporation, San Jose, California, USA. All rights reserved. 
    // Permission is hereby granted, free of charge, to any person obtaining a copy of this 
    // software and associated documentation files (the "Software"), to deal in the Software 
    // without restriction, including without limitation the rights to use, copy, modify, merge, 
    // publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to 
    // whom the Software is furnished to do so, subject to the following conditions: 
    // The above copyright notice and this permission notice shall be included in all copies or 
    // substantial portions of the Software. 
    //  
    // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
    // EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES 
    // OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 
    // NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 
    // HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, 
    // WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
    // FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR 
    // OTHER DEALINGS IN THE SOFTWARE. 
    //  
    // This agreement shall be governed in all respects by the laws of the State of California and 
    // by the laws of the United States of America. 
    
    ///////////////////////////////////////////////////////////////////////////////////
    // This host program executes a vector addition kernel to perform:
    //  C = A + B
    // where A, B and C are vectors with N elements.
    //
    // This host program supports partitioning the problem across multiple OpenCL
    // devices if available. If there are M available devices, the problem is
    // divided so that each device operates on N/M points. The host program
    // assumes that all devices are of the same type (that is, the same binary can
    // be used), but the code can be generalized to support different device types
    // easily.
    //
    // Verification is performed against the same computation on the host CPU.
    ///////////////////////////////////////////////////////////////////////////////////
    
    #include <stdio.h>
    #include <stdlib.h>
    #include <math.h>
    #include "CL/opencl.h"
    #include "AOCL_Utils.h"
    
    using namespace aocl_utils;
    
    // OpenCL runtime configuration
    cl_platform_id platform = NULL;
    unsigned num_devices = 0;
    scoped_array<cl_device_id> device; // num_devices elements
    cl_context context = NULL;
    scoped_array<cl_command_queue> queue; // num_devices elements
    cl_program program = NULL;
    scoped_array<cl_kernel> kernel; // num_devices elements
    scoped_array<cl_mem> input_a_buf; // num_devices elements
    scoped_array<cl_mem> input_b_buf; // num_devices elements
    scoped_array<cl_mem> output_buf; // num_devices elements
    
    // Problem data.
    const unsigned N = 1000000; // problem size
    scoped_array<scoped_aligned_ptr<float> > input_a, input_b; // num_devices elements
    scoped_array<scoped_aligned_ptr<float> > output; // num_devices elements
    scoped_array<scoped_array<float> > ref_output; // num_devices elements
    scoped_array<unsigned> n_per_device; // num_devices elements
    
    // Function prototypes
    float rand_float();
    bool init_opencl();
    void init_problem();
    void run();
    void cleanup();
    
    // Entry point.
    int main() {
      // Initialize OpenCL.
      if(!init_opencl()) {
        return -1;
      }
    
      // Initialize the problem data.
      // Requires the number of devices to be known.
      init_problem();
    
      // Run the kernel.
      run();
    
      // Free the resources allocated
      cleanup();
    
      return 0;
    }
    
    /////// HELPER FUNCTIONS ///////
    
    // Randomly generate a floating-point number between -10 and 10.
    float rand_float() {
      return float(rand()) / float(RAND_MAX) * 20.0f - 10.0f;
    }
    
    // Initializes the OpenCL objects.
    bool init_opencl() {
      cl_int status;
    
      printf("Initializing OpenCL
    ");
    
      if(!setCwdToExeDir()) {
        return false;
      }
    
      // Get the OpenCL platform.
      platform = findPlatform("Altera");
      if(platform == NULL) {
        printf("ERROR: Unable to find Altera OpenCL platform.
    ");
        return false;
      }
    
      // Query the available OpenCL device.
      device.reset(getDevices(platform, CL_DEVICE_TYPE_ALL, &num_devices));
      printf("Platform: %s
    ", getPlatformName(platform).c_str());
      printf("Using %d device(s)
    ", num_devices);
      for(unsigned i = 0; i < num_devices; ++i) {
        printf("  %s
    ", getDeviceName(device[i]).c_str());
      }
    
      // Create the context.
      context = clCreateContext(NULL, num_devices, device, NULL, NULL, &status);
      checkError(status, "Failed to create context");
    
      // Create the program for all device. Use the first device as the
      // representative device (assuming all device are of the same type).
      std::string binary_file = getBoardBinaryFile("vectorAdd", device[0]);
      printf("Using AOCX: %s
    ", binary_file.c_str());
      program = createProgramFromBinary(context, binary_file.c_str(), device, num_devices);
    
      // Build the program that was just created.
      status = clBuildProgram(program, 0, NULL, "", NULL, NULL);
      checkError(status, "Failed to build program");
    
      // Create per-device objects.
      queue.reset(num_devices);
      kernel.reset(num_devices);
      n_per_device.reset(num_devices);
      input_a_buf.reset(num_devices);
      input_b_buf.reset(num_devices);
      output_buf.reset(num_devices);
    
      for(unsigned i = 0; i < num_devices; ++i) {
        // Command queue.
        queue[i] = clCreateCommandQueue(context, device[i], CL_QUEUE_PROFILING_ENABLE, &status);
        checkError(status, "Failed to create command queue");
    
        // Kernel.
        const char *kernel_name = "vectorAdd";
        kernel[i] = clCreateKernel(program, kernel_name, &status);
        checkError(status, "Failed to create kernel");
    
        // Determine the number of elements processed by this device.
        n_per_device[i] = N / num_devices; // number of elements handled by this device
    
        // Spread out the remainder of the elements over the first
        // N % num_devices.
        if(i < (N % num_devices)) {
          n_per_device[i]++;
        }
    
        // Input buffers.
        input_a_buf[i] = clCreateBuffer(context, CL_MEM_READ_ONLY, 
            n_per_device[i] * sizeof(float), NULL, &status);
        checkError(status, "Failed to create buffer for input A");
    
        input_b_buf[i] = clCreateBuffer(context, CL_MEM_READ_ONLY, 
            n_per_device[i] * sizeof(float), NULL, &status);
        checkError(status, "Failed to create buffer for input B");
    
        // Output buffer.
        output_buf[i] = clCreateBuffer(context, CL_MEM_WRITE_ONLY, 
            n_per_device[i] * sizeof(float), NULL, &status);
        checkError(status, "Failed to create buffer for output");
      }
    
      return true;
    }
    
    // Initialize the data for the problem. Requires num_devices to be known.
    void init_problem() {
      if(num_devices == 0) {
        checkError(-1, "No devices");
      }
    
      input_a.reset(num_devices);
      input_b.reset(num_devices);
      output.reset(num_devices);
      ref_output.reset(num_devices);
    
      // Generate input vectors A and B and the reference output consisting
      // of a total of N elements.
      // We create separate arrays for each device so that each device has an
      // aligned buffer. 
      for(unsigned i = 0; i < num_devices; ++i) {
        input_a[i].reset(n_per_device[i]);
        input_b[i].reset(n_per_device[i]);
        output[i].reset(n_per_device[i]);
        ref_output[i].reset(n_per_device[i]);
    
        for(unsigned j = 0; j < n_per_device[i]; ++j) {
          input_a[i][j] = rand_float();
          input_b[i][j] = rand_float();
          ref_output[i][j] = input_a[i][j] + input_b[i][j];
        }
      }
    }
    
    void run() {
      cl_int status;
    
      const double start_time = getCurrentTimestamp();
    
      // Launch the problem for each device.
      scoped_array<cl_event> kernel_event(num_devices);
      scoped_array<cl_event> finish_event(num_devices);
    
      for(unsigned i = 0; i < num_devices; ++i) {
    
        // Transfer inputs to each device. Each of the host buffers supplied to
        // clEnqueueWriteBuffer here is already aligned to ensure that DMA is used
        // for the host-to-device transfer.
        cl_event write_event[2];
        status = clEnqueueWriteBuffer(queue[i], input_a_buf[i], CL_FALSE,
            0, n_per_device[i] * sizeof(float), input_a[i], 0, NULL, &write_event[0]);
        checkError(status, "Failed to transfer input A");
    
        status = clEnqueueWriteBuffer(queue[i], input_b_buf[i], CL_FALSE,
            0, n_per_device[i] * sizeof(float), input_b[i], 0, NULL, &write_event[1]);
        checkError(status, "Failed to transfer input B");
    
        // Set kernel arguments.
        unsigned argi = 0;
    
        status = clSetKernelArg(kernel[i], argi++, sizeof(cl_mem), &input_a_buf[i]);
        checkError(status, "Failed to set argument %d", argi - 1);
    
        status = clSetKernelArg(kernel[i], argi++, sizeof(cl_mem), &input_b_buf[i]);
        checkError(status, "Failed to set argument %d", argi - 1);
    
        status = clSetKernelArg(kernel[i], argi++, sizeof(cl_mem), &output_buf[i]);
        checkError(status, "Failed to set argument %d", argi - 1);
    
        // Enqueue kernel.
        // Use a global work size corresponding to the number of elements to add
        // for this device.
        // 
        // We don't specify a local work size and let the runtime choose
        // (it'll choose to use one work-group with the same size as the global
        // work-size).
        //
        // Events are used to ensure that the kernel is not launched until
        // the writes to the input buffers have completed.
        const size_t global_work_size = n_per_device[i];
        printf("Launching for device %d (%d elements)
    ", i, global_work_size);
    
        status = clEnqueueNDRangeKernel(queue[i], kernel[i], 1, NULL,
            &global_work_size, NULL, 2, write_event, &kernel_event[i]);
        checkError(status, "Failed to launch kernel");
    
        // Read the result. This the final operation.
        status = clEnqueueReadBuffer(queue[i], output_buf[i], CL_FALSE,
            0, n_per_device[i] * sizeof(float), output[i], 1, &kernel_event[i], &finish_event[i]);
    
        // Release local events.
        clReleaseEvent(write_event[0]);
        clReleaseEvent(write_event[1]);
      }
    
      // Wait for all devices to finish.
      clWaitForEvents(num_devices, finish_event);
    
      const double end_time = getCurrentTimestamp();
    
      // Wall-clock time taken.
      printf("
    Time: %0.3f ms
    ", (end_time - start_time) * 1e3);
    
      // Get kernel times using the OpenCL event profiling API.
      for(unsigned i = 0; i < num_devices; ++i) {
        cl_ulong time_ns = getStartEndTime(kernel_event[i]);
        printf("Kernel time (device %d): %0.3f ms
    ", i, double(time_ns) * 1e-6);
      }
    
      // Release all events.
      for(unsigned i = 0; i < num_devices; ++i) {
        clReleaseEvent(kernel_event[i]);
        clReleaseEvent(finish_event[i]);
      }
    
      // Verify results.
      bool pass = true;
      for(unsigned i = 0; i < num_devices && pass; ++i) {
        for(unsigned j = 0; j < n_per_device[i] && pass; ++j) {
          if(fabsf(output[i][j] - ref_output[i][j]) > 1.0e-5f) {
            printf("Failed verification @ device %d, index %d
    Output: %f
    Reference: %f
    ",
                i, j, output[i][j], ref_output[i][j]);
            pass = false;
          }
        }
      }
    
      printf("
    Verification: %s
    ", pass ? "PASS" : "FAIL");
    }
    
    // Free the resources allocated during initialization
    void cleanup() {
      for(unsigned i = 0; i < num_devices; ++i) {
        if(kernel && kernel[i]) {
          clReleaseKernel(kernel[i]);
        }
        if(queue && queue[i]) {
          clReleaseCommandQueue(queue[i]);
        }
        if(input_a_buf && input_a_buf[i]) {
          clReleaseMemObject(input_a_buf[i]);
        }
        if(input_b_buf && input_b_buf[i]) {
          clReleaseMemObject(input_b_buf[i]);
        }
        if(output_buf && output_buf[i]) {
          clReleaseMemObject(output_buf[i]);
        }
      }
    
      if(program) {
        clReleaseProgram(program);
      }
      if(context) {
        clReleaseContext(context);
      }
    }
    
    

    Kernel部分:

    // ACL kernel for adding two input vectors
    __kernel void vectorAdd(__global const float *x, 
                            __global const float *y, 
                            __global float *restrict z)
    {
        // get index of the work item
        int index = get_global_id(0);
    
        // add the vector elements
        z[index] = x[index] + y[index];
    }

    kernel部分代码就这几行,__global是一个限定符,表示用外部存储(比如DDR)来存储,其他语法和标准C语言是一样的,就不多说了。
    代码中最重要的就是get_global_id,这个是在多work-item工作模式下的常用手段,通过id确定work-item然后进行操作,所有的item都是一样的,因此就add的函数里面就没有习惯的for()的写法了。可以对kernel的设置进行定制,包括compute unit,SIMD模式等,这样来控制程序的并行性,更大的并行往往性能高,但是更耗资源。

    具体的Kernel函数的内容可以参考OpenCL的《The OpenCL Specification 1.0》以及altera的opencl编程指南,后面的笔记我会具体写一下。

    内存模型
    最后写一下Opencl的内存模型,看下面的示意图:
    这里写图片描述

    用核函数中的内存变量来简单地解释:用clCreateBuffer 创建、用clSetKernelArg 传递的数据在global memory 和constant memory中;核函数中的寄存器变量在private memory 中;核函数的内部变量、缓存等,在local memory 中。图例中可以看到Device 并不直接访问global memory,而是通过Cache 来访问。可以想象当同时运行的work-item,使用的内存都在同一块cache 中,则内存吞吐的效率最高。对应到work group 中,就是在程序设计上尽量使同一个work group 中的work item 操作连续的内存,以提高访存效率。

    本篇就到这里。

  • 相关阅读:
    Hystrix解析(三)
    Hystrix解析(二)
    在阿里云开发平台编写第一个 HelloWorld 程序
    Jenkins与Docker的自动化CI/CD实战
    网页计数器例子
    ServletContext 对象
    Session
    Cookie
    Request 对象作用域
    转发,重定向(包括请求中文乱码解决)
  • 原文地址:https://www.cnblogs.com/yihaha/p/7265317.html
Copyright © 2011-2022 走看看