zoukankan      html  css  js  c++  java
  • An Easy Introduction to CUDA C and C++

    An Easy Introduction to CUDA C and C++

    This post is the first in a series on CUDA C and C++, which is the C/C++ interface to the CUDA parallel computing platform. This series of posts assumes familiarity with programming in C. We will be running a parallel series of posts about CUDA Fortran targeted at Fortran programmers . These two series will cover the basic concepts of parallel computing on the CUDA platform. From here on unless I state otherwise, I will use the term “CUDA C” as shorthand for “CUDA C and C++”. CUDA C is essentially C/C++ with a few extensions that allow one to execute functions on the GPU using many threads in parallel.

    CUDA Programming Model Basics

    Before we jump into CUDA C code, those new to CUDA will benefit from a basic description of the CUDA programming model and some of the terminology used.

    The CUDA programming model is a heterogeneous model in which both the CPU and GPU are used. In CUDA, the host refers to the CPU and its memory, while the device refers to the GPU and its memory. Code run on the host can manage memory on both the host and device, and also launches kernels which are functions executed on the device. These kernels are executed by many GPU threads in parallel.

    Given the heterogeneous nature of the CUDA programming model, a typical sequence of operations for a CUDA C program is:

    1. Declare and allocate host and device memory.
    2. Initialize host data.
    3. Transfer data from the host to the device.
    4. Execute one or more kernels.
    5. Transfer results from the device to the host.

    Keeping this sequence of operations in mind, let’s look at a CUDA C example.

    A First CUDA C Program

    In a recent post, I illustrated Six Ways to SAXPY, which includes a CUDA C version. SAXPY stands for “Single-precision A*X Plus Y”, and is a good “hello world” example for parallel computation. In this post I will dissect a more complete version of the CUDA C SAXPY, explaining in detail what is done and why. The complete SAXPY code is:

    #include <stdio.h>
    
    __global__
    void saxpy(int n, float a, float *x, float *y)
    {
      int i = blockIdx.x*blockDim.x + threadIdx.x;
      if (i < n) y[i] = a*x[i] + y[i];
    }
    
    int main(void)
    {
      int N = 1<<20;
      float *x, *y, *d_x, *d_y;
      x = (float*)malloc(N*sizeof(float));
      y = (float*)malloc(N*sizeof(float));
    
      cudaMalloc(&d_x, N*sizeof(float)); 
      cudaMalloc(&d_y, N*sizeof(float));
    
      for (int i = 0; i < N; i++) {
        x[i] = 1.0f;
        y[i] = 2.0f;
      }
    
      cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
      cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
    
      // Perform SAXPY on 1M elements
      saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
    
      cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
    
      float maxError = 0.0f;
      for (int i = 0; i < N; i++)
        maxError = max(maxError, abs(y[i]-4.0f));
      printf("Max error: %f
    ", maxError);
    
      cudaFree(d_x);
      cudaFree(d_y);
      free(x);
      free(y);
    }
    

    The function saxpy is the kernel that runs in parallel on the GPU, and the main function is the host code. Let’s begin our discussion of this program with the host code.

    Host Code

    The main function declares two pairs of arrays.

      float *x, *y, *d_x, *d_y;
      x = (float*)malloc(N*sizeof(float));
      y = (float*)malloc(N*sizeof(float));
    
      cudaMalloc(&d_x, N*sizeof(float)); 
      cudaMalloc(&d_y, N*sizeof(float));
    
  • 相关阅读:
    关于celery踩坑
    关于git的分批提交pull requests流程
    SymGAN—Exploiting Images for Video Recognition: Heterogeneous Feature Augmentation via Symmetric Adversarial Learning学习笔记
    AFN—Larger Norm More Transferable: An Adaptive Feature Norm Approach for Unsupervised Domain Adaptation学习笔记
    Learning to Transfer Examples for Partial Domain Adaptation学习笔记
    Partial Adversarial Domain Adaptation学习笔记
    Partial Transfer Learning with Selective Adversarial Networks学习笔记
    Importance Weighted Adversarial Nets for Partial Domain Adaptation学习笔记
    Exploiting Images for Video Recognition with Hierarchical Generative Adversarial Networks学习笔记
    improved open set domain adaptation with backpropagation 学习笔记
  • 原文地址:https://www.cnblogs.com/yaos/p/14014329.html
Copyright © 2011-2022 走看看