zoukankan      html  css  js  c++  java
  • CUDA by Example 第三章 部分翻译实践 GPU器件参数提取

    由于这本书内容实在是多,很多内容和其他讲解cuda的书又重复了,所以我只翻译一些重点,时间就是金钱嘛,一起来学cuda吧。如有错误,欢迎纠正


    由于第一章第二章暂时没时间仔细看,我们从第三章开始

    不喜欢受制于人,所以不用它的头文件,所有程序我都会改写,有些程序实在是太无聊,就算了。

    //hello.cu

    #include<stdio.h>

    #include<cuda.h>

    int main( void ) {
    printf( "Hello, World!\n" );
    return 0;
    }

    这第一个cuda程序并不能算是严格的cuda程序,它只不过用到了cuda的头文件,编译命令: nvcc hello.cu -o hello

    执行命令:./hello

    并没有在cuda上面执行任何任务。


    第二个程序


    #include<stdio.h>

    #include<cuda.h>

    __global__ void kernel(void){}

    int main( void ) {

    kernel<<<1,1>>>();

    printf( "Hello, World!\n" );
    return 0;
    }

    这个程序调用了一个函数,__global__的含义是该函数在CPU上调用,GPU上执行。

    至于三个尖括号里面的参数是什么呢? 要看下一章


      1 #include <stdio.h>
      2 #include <cuda.h>
      3 __global__ void add( int a, int b, int *c ) {
      4         *c = a + b;
      5 }
      6 int main( void )
      7 {       
      8         int c;
      9         int *dev_c;
     10         cudaMalloc( (void**)&dev_c, sizeof(int) );
     11         add<<<1,1>>>( 2, 7, dev_c );
     12         cudaMemcpy( &c,dev_c,sizeof(int),cudaMemcpyDeviceToHost );
     13         printf( "2 + 7 = %d\n", c );
     14         cudaFree( dev_c );
     15         return 0;
     16 }       
     17

    cudaMalloc()分配GPU上的存储空间,cudaMemcpy是把运行结果从GPU上拷贝到CPU上cudaMemcpyDeviceToHost,或者把执行参数从CPU上拷贝到GPU上cudaMemcpyHostToDevice。

    cudaFree是释放GPU上的空间,和CPU上的Free是同样的意义,只不过对象不同。


    这一章的重点(对我来说)是3.3  访问GPU(device)


    这章呢,是说,如果你没有你所用的GPU的说明书,或者你懒得拆解下来看,或者,为了让你的程序可以适用于更多不同的硬件环境,尝试用编程的方式来得到关于GPU的某些参数。


    大量的废话大家自己看吧。俺讲写有意义的。


    现在很多电脑里面都不只有一个GPU显卡,尤其是显卡做计算的集成环境,所以我们可以通过

    int count;

    cudaGetDeviceCount(&count);

    来获得集成环境的显卡数量。


    然后通过cudaDeviceProp这个结构提可以获得显卡的相关性能。

    下面是以cuda3.0为例子.

    定义的这个机构体在自己的程序中可以直接调用,无需自己定义。


    struct cudaDeviceProp {
    char name[256];         //器件的名字
    size_t totalGlobalMem;    //Global Memory 的byte大小
    size_t sharedMemPerBlock;   //线程块可以使用的共用记忆体的最大值。byte为单位,多处理器上的所有线程块可以同时共用这些记忆体
    int regsPerBlock;                 //线程块可以使用的32位寄存器的最大值,多处理器上的所有线程快可以同时实用这些寄存器
    int warpSize;                    //按线程计算的wrap块大小
    size_t memPitch;        //做内存复制是可以容许的最大间距,允许通过cudaMallocPitch()为包含记忆体区域的记忆提复制函数的最大间距,以byte为单位。
    int maxThreadsPerBlock;   //每个块中最大线程数
    int maxThreadsDim[3];       //块各维度的最大值
    int maxGridSize[3];             //Grid各维度的最大值
    size_t totalConstMem;  //常量内存的大小
    int major;            //计算能力的主代号
    int minor;            //计算能力的次要代号
    int clockRate;     //时钟频率
    size_t textureAlignment; //纹理的对齐要求
    int deviceOverlap;    //器件是否能同时执行cudaMemcpy()和器件的核心代码
    int multiProcessorCount; //设备上多处理器的数量
    int kernelExecTimeoutEnabled; //是否可以给核心代码的执行时间设置限制
    int integrated;                  //这个GPU是否是集成的
    int canMapHostMemory; //这个GPU是否可以讲主CPU上的存储映射到GPU器件的地址空间
    int computeMode;           //计算模式
    int maxTexture1D;          //一维Textures的最大维度 
    int maxTexture2D[2];      //二维Textures的最大维度
    int maxTexture3D[3];      //三维Textures的最大维度
    int maxTexture2DArray[3];     //二维Textures阵列的最大维度
    int concurrentKernels;           //GPU是否支持同时执行多个核心程序
    }


    实例程序:


      1 #include<stdio.h>
      2 #include<stdlib.h>
      3 #include<cuda.h>
      4
      5 int main()
      6 {
      7     int i;
      8     /*cudaGetDeviceCount(&count)*/
      9     int count;
     10     cudaGetDeviceCount(&count);
     11     printf("The count of CUDA devices:%d\n",count);
     12         ////
     13
     14     cudaDeviceProp prop;
     15     for(i=0;i<count;i++)
     16     {
     17         cudaGetDeviceProperties(&prop,i);
     18         printf("\n---General Information for device %d---\n",i);
     19         printf("Name of the cuda device: %s\n",prop.name);
     20         printf("Compute capability: %d.%d\n",prop.major,prop.minor);
     21         printf("Clock rate: %d\n",prop.clockRate);
     22         printf("Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution):  ");
     23         if(prop.deviceOverlap)
     24             printf("Enabled\n");
     25         else
     26             printf("Disabled\n");
     27         printf("Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): ");
     28         if(prop.kernelExecTimeoutEnabled)
     29             printf("Enabled\n");
     30         else
     31             printf("Disabled\n");
     32
     33         printf("\n---Memory Information for device %d ---\n",i);
     34         printf("Total global mem in bytes: %ld\n",prop.totalGlobalMem);
     35         printf("Total constant Mem: %ld\n",prop.totalConstMem);
     36         printf("Max mem pitch for memory copies in bytes: %ld\n",prop.memPitch);
     37         printf("Texture Alignment: %ld\n",prop.textureAlignment);
     38
     39         printf("\n---MP Information for device %d---\n",i);
     40         printf("Multiprocessor count: %d\n",prop.multiProcessorCount);
     41         printf("Shared mem per mp(block): %ld\n",prop.sharedMemPerBlock);
     42         printf("Registers per mp(block):%d\n",prop.regsPerBlock);
     43         printf("Threads in warp:%d\n",prop.warpSize);
     44         printf("Max threads per block: %d\n",prop.maxThreadsPerBlock);
     45         printf("Max thread dimensions in a block:(%d,%d,%d)\n",prop.maxThreadsDim[0],prop.maxThreadsDim[1],prop.maxThreadsDim[2]);
     46         printf("Max blocks dimensions in a grid:(%d,%d,%d)\n",prop.maxGridSize[0],prop.maxGridSize[1],prop.maxGridSize[2]);
     47         printf("\n");
     48
     49         printf("\nIs the device an integrated GPU:");
     50         if(prop.integrated)
     51             printf("Yes!\n");
     52         else
     53             printf("No!\n");
     54
     55         printf("Whether the device can map host memory into CUDA device address space:");
     56         if(prop.canMapHostMemory)
     57             printf("Yes!\n");
     58         else
     59             printf("No!\n");
     60
     61         printf("Device's computing mode:%d\n",prop.computeMode);
     62
     63         printf("\n The maximum size for 1D textures:%d\n",prop.maxTexture1D);
     64         printf("The maximum dimensions for 2D textures:(%d,%d)\n",prop.maxTexture2D[0],prop.maxTexture2D[1]);
     65         printf("The maximum dimensions for 3D textures:(%d,%d,%d)\n",prop.maxTexture3D[0],prop.maxTexture3D[1],prop.maxTexture3D[2]);
     66 //      printf("The maximum dimensions for 2D texture arrays:(%d,%d,%d)\n",prop.maxTexture2DArray[0],prop.maxTexture2DArray[1],prop.maxTexture2DArray[2]);
     67
     68         printf("Whether the device supports executing multiple kernels within the same context simultaneously:");
     69         if(prop.concurrentKernels)
     70             printf("Yes!\n");
     71         else
     72             printf("No!\n");
     73     }
     74
     75 }

    运行结果:

    The count of CUDA devices:1

    ---General Information for device 0---
    Name of the cuda device: GeForce GTX 470
    Compute capability: 2.0
    Clock rate: 1215000
    Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution):  Enabled
    Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): Enabled

    ---Memory Information for device 0 ---
    Total global mem in bytes: 1341325312
    Total constant Mem: 65536
    Max mem pitch for memory copies in bytes: 2147483647
    Texture Alignment: 512

    ---MP Information for device 0---
    Multiprocessor count: 14
    Shared mem per mp(block): 49152
    Registers per mp(block):32768
    Threads in warp:32
    Max threads per block: 1024
    Max thread dimensions in a block:(1024,1024,64)
    Max blocks dimensions in a grid:(65535,65535,65535)


    Is the device an integrated GPU:No!
    Whether the device can map host memory into CUDA device address space:Yes!
    Device's computing mode:0

     The maximum size for 1D textures:65536
    The maximum dimensions for 2D textures:(65536,65535)
    The maximum dimensions for 3D textures:(2048,2048,2048)
    Whether the device supports executing multiple kernels within the same context simultaneously:Yes!
    yue@ubuntu-10:~/cuda/cudabye$ vim cudabyex331.cu
    yue@ubuntu-10:~/cuda/cudabye$ vim cudabyex331.cu
    yue@ubuntu-10:~/cuda/cudabye$ ./cuda
    -bash: ./cuda: 沒有此一檔案或目錄
    yue@ubuntu-10:~/cuda/cudabye$ ./cudabyex331
    The count of CUDA devices:1

    ---General Information for device 0---
    Name of the cuda device: GeForce GTX 470
    Compute capability: 2.0
    Clock rate: 1215000
    Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution):  Enabled
    Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): Enabled

    ---Memory Information for device 0 ---
    Total global mem in bytes: 1341325312
    Total constant Mem: 65536
    Max mem pitch for memory copies in bytes: 2147483647
    Texture Alignment: 512

    ---MP Information for device 0---
    Multiprocessor count: 14
    Shared mem per mp(block): 49152
    Registers per mp(block):32768
    Threads in warp:32
    Max threads per block: 1024
    Max thread dimensions in a block:(1024,1024,64)
    Max blocks dimensions in a grid:(65535,65535,65535)


    Is the device an integrated GPU:No!
    Whether the device can map host memory into CUDA device address space:Yes!
    Device's computing mode:0

     The maximum size for 1D textures:65536
    The maximum dimensions for 2D textures:(65536,65535)
    The maximum dimensions for 3D textures:(2048,2048,2048)
    Whether the device supports executing multiple kernels within the same context simultaneously:Yes!




    参考书籍:《CUDA BY EXAMPLE》

  • 相关阅读:
    如何将数据库中已有表导入到powerDesigner生成pdm文件
    Delphi TcxTreelist 表格左边总是缩进去 ,好像有偏移 解决方法
    HTML、CSS、JS对unicode字符的不同处理
    老生常谈ajax
    浅谈javascript面向对象
    HTML5原生拖放实例分析
    一个小动画,颠覆你的CSS世界观
    布局神器display:table-cell
    javascript 日常总结
    G2 2.0 更灵活、更强大、更完备的可视化引擎!
  • 原文地址:https://www.cnblogs.com/javawebsoa/p/3001493.html
Copyright © 2011-2022 走看看