GPU
GPU英文全称Graphic Processing Unit,中文翻译为“图形处理器”。GPU是相对于CPU的一个概念,由于在现代的计算机中(特别是家用系统,游戏的发烧友)图形的处理变得越来越重要,需要一个专门的图形的核心处理器。
GPU有非常多的厂商都生产,和CPU一样,生产的厂商比较多,但大家熟悉的却只有3个,以至于大家以为GPU只有AMD、NVIDIA、Intel3个生产厂商。
nVidia GPU | AMD GPU | Intel MIC协处理器 | nVidia Tegra 4 | AMD ARM服务器 |
CUDA C/C++ CUDA fortran |
OpenCL | MIC OpenMP | CUDA |
GPU 并行计算
- 可以同CPU或主机进行协同处理
- 拥有自己的内存
- 可以同时开启1000个线程
- 单精度:4.58TFlops 双精度 1.31TFlops
GPU编程方面主要有一下方法:
采用GPU进行计算时与CPU主要进行以下交互:
- CPU与GPU之间的数据交换
- 在GPU上进行数据交换
GPU编程--CUDA
http://developer.nvidia.com/cuda/cuda-downloads
选择适合的版本~~~~我的下载的是5.0 notebook版本
具体安装方法:可参考这里http://blog.csdn.net/diyoosjtu/article/details/8454253
安装后,打开VS->新建,就会发现一个nVidia,里面有一个CUDA
主要过程:
- Hello World
- Basic syntax, compile & run
- GPU memory management
- Malloc/free
- memcpy
- Writing parallel kernels
- Threads & block
- Memory hierachy
//hello_world.c: #include <stdio.h> void hello_world_kernel(){ printf(“Hello World\n”); } int main(){ hello_world_kernel();}
Compile & Run: gcc hello_world.c ./a.out
CUDA:
//hello_world.cu: #include <stdio.h> __global__ void hello_world_kernel(){ printf(“Hello World\n”); } int main(){ hello_world_kernel<<<1,1>>>();} Compile & Run: nvcc hello_world.cu ./a.out
GPU计算的主要过程:
- Allocate CPU memory for n integers
- Allocate GPU memory for n integers
- Initialize GPU memory to 0s
- Copy from CPU to GPU
- call the __global__function, compute
Keyword for CUDA kernel
- Copy from GPU to CPU
- Print the values
- free
主要函数:
//Host (CPU) manages device (GPU) memory: cudaMalloc (void ** pointer, size_t nbytes) cudaMemset (void * pointer, int value, size_t count) cudaFree (void* pointer) int nbytes = 1024*sizeof(int); int * d_a = 0; cudaMalloc( (void**)&d_a, nbytes ); cudaMemset( d_a, 0, nbytes); cudaFree(d_a); cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction); //returns after the copy is complete /*blocks CPU thread until all bytes have been copied doesn’t start copying until previous CUDA calls complete enum cudaMemcpyKind cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice*/
其中,<<<grid,block>>>
- 2-level hierarchy: blocks and grid
- Block = a group of up to 1024 threads
- Grid = all blocks for a given kernel launch
- E.g. total 72 threads
- blockDim=12, gridDim=6
- A block can:
- Synchronize their execution
- Communicate via shared memory
- Size of grid and blocks are specified during kernel launch
例子:
#include<stdio.h> __global__ void add(int *a, int *b) { *a = *a + *b; } int main() { int c=0; int a=1, b=2; int *h_a, *h_b; cudaMalloc(&h_a, sizeof(a)); cudaMalloc(&h_b, sizeof(b)); cudaMemset(h_a,0,sizeof(a)); cudaMemset(h_b,0,sizeof(b)); cudaMemcpy(h_a, &a, sizeof(int), cudaMemcpyHostToDevice); cudaMemcpy(h_b, &b, sizeof(int), cudaMemcpyHostToDevice); add<<<1,1>>>(h_a,h_b); cudaMemcpy(&c,h_a,sizeof(int),cudaMemcpyDeviceToHost); printf("%d",c); cudaFree(h_a); cudaFree(h_b); }
Thread index computation :
idx = blockIdx.x*blockDim.x + threadIdx.x:
应用
High performance math routines for your applications:
- cuFFT – Fast Fourier Transforms Library
- cuBLAS – Complete BLAS Library
- cuSPARSE – Sparse Matrix Library
- cuRAND – Random Number Generation (RNG) Library
- NPP – Performance Primitives for Image & Video Processing
- Thrust – Templated C++ Parallel Algorithms & Data Structures
- math.h - C99 floating-point Library