zoukankan html css js c++ java

CUDA C Programming Guide 在线教程学习笔记 Part 9

▶ 协作组，要求 cuda ≥ 9.0，一个简单的例子见 http://www.cnblogs.com/cuancuancuanhao/p/7881093.html

● 灵活调节需要进行通讯的线程组合（不一定是线程块或是线程束）的尺寸，在更多粒度上进行线程协作。

● 协作组功能支持 CUDA 的各种并行模式，包括生产者 - 消费者并行（producer-consumer parallelism），机会并行（opportunistic parallelism），全网个同步（global synchronization）。

● 构成要素：① 参与协作的线程组合（即协作组整体）的数据类型；② 从 CUDA lauch API 中创建协作组（intrinsic groups？）的操作；③ 将现有协作组划分为新的协作组的操作；④ 协作组内的栅栏同步函数；⑤ 检查组内属性和执行组内特定命令的操作（如线程表决函数）。

● 块内协作组（Intra-block Group）使用方法。

 1 # include <cooperative_groups.h>        // 使用的头文件
 2 
 3 using namespace cooperative_groups;     // 命名空间
 4 
 5 thread_block g = this_thread_block();   // 将当前线程块打包为一个协作组，命名为 g
 6 
 7 thread_group gTile = tiled_partition(g, SIZE);
 8 // 将之前的协作组分割成大小为 SIZE 的协作组（SIZE 可以取 1，2，4，8，16，32），但组内不能使用线程束表决函数和统筹函数
 9 
10 thread_block_tile<SIZE> gTile = tiled_partition<SIZE>(g);
11 // 同样的分割函数，使用模板函数，编译时处理，比函数 tiled_partition() 更高效，且组内可以使用线程束表决函数和统筹函数
12 
13 // 协作组的一些方法
14 void sync();                            // 协作组同步（协作组内的线程栅栏同步）
15 unsigned size();                        // 获得协作组的大小（线程个数）
16 unsigned thread_rank();                 // 获得当前线程在协作组内的编号
17 bool is_valid();                        // 协作组是否有效（符合 API 约束）
18 dim3 group_index();                     // 指出当前线程块在线程格中的编号
19 dim3 thread_index();                    // 指出当前线程在线程块中的编号
20 
21 // 协作组内也可以使用的表决函数和统筹函数（成员函数）
22 int shfl();
23 int shfl_down();
24 int shfl_up();
25 int shfl_xor();
26 int any();
27 int all();
28 int ballot();
29 int match_any();
30 int match_all();

● 线程束发生分支的时候设备将会串行执行每个分支，在同道中保持活跃的所有线程称为合并的，协作组有能力发现并为合并的线程创建一个组。

1 coalesced_group active = coalesced_threads();// 在分支中，将当前活跃的线程创建为一个协作组

● 发现模式。两个示例代码段等价，但没看懂在干什么。

 1 {
 2     unsigned int writemask = __activemask();
 3     unsigned int total = __popc(writemask);
 4     unsigned int prefix = __popc(writemask & __lanemask_lt());
 5     // Find the lowest-numbered active lane
 6     int elected_lane = __ffs(writemask) - 1;
 7     int base_offset = 0;
 8     if (prefix == 0)
 9         base_offset = atomicAdd(p, total);
10     base_offset = __shfl_sync(writemask, base_offset, elected_lane);
11     int thread_offset = prefix + base_offset;
12     return thread_offset;
13 }
14 {
15     cg::coalesced_group g = cg::coalesced_threads();
16     int prev;
17     if (g.thread_rank() == 0)
18         prev = atomicAdd(p, g.size());
19     prev = g.thread_rank() + g.shfl(prev, 0);
20     return prev;
21 }

● 线程格同步，需要额外的一些步骤。

 1 // 通过 CUDA Driver API 的函数 cuDeviceGetAttribute() 来检查设备是否支持 cooperative launch 属性
 2 int pi = 0;
 3 cuDevice dev;
 4 cuDeviceGet(&dev, 0)
 5 cuDeviceGetAttribute(&pi, CU_DEVICE_ATTRIBUTE_COOPERATIVE_LAUNCH, dev);// 如果支持，则 pi 被置 1
 6 
 7 // 使用函数 cudaLaunchCooperativeKernel() 或 CUDA Driver API 中的几种调用方法来启动内核，不能使用 <<< >>>
 8 cudaLaunchCooperativeKernel(const T *func, dim3 gridDim, dim3 blockDim, void **args, size_t sharedMem = 0, cudaStream_t stream = 0);
 9 
10 // 建议精心优化线程格尺寸和线程块尺寸（下面两例分别是使用最大线程块数和自动优化线程块数）
11 {
12     cudaDeviceProp deviceProp;
13     cudaGetDeviceProperties(&deviceProp, dev);
14     cudaLaunchCooperativeKernel((void*)my_kernel, deviceProp.multiProcessorCount, numThreads, args);
15 }
16 {
17     cudaOccupancyMaxActiveBlocksPerMultiprocessor(&numBlocksPerSm, my_kernel, numThreads, 0));
18     cudaLaunchCooperativeKernel((void*)my_kernel, numBlocksPerSm, numThreads, args);
19 }
20 
21 // 使用函数 this_grid() 来获得当前线程格，以及使用线程格同步函数
22 grid_group grid = this_grid();
23 grid.sync();
24 
25 // 编译命令，打开 Relocatable Device Code（允许分离编译）
26 nvcc - arch = sm_61 - rdc = true mytestfile.cu - o mytest

● 多设备同步，需要额外的一些步骤。

 1 // 通过 CUDA Driver API 的函数 cuDeviceGetAttribute() 来检查设备是否支持 cooperative multi-device launch 属性
 2 int pi = 0;
 3 cuDevice dev;
 4 cuDeviceGet(&dev, 0)
 5 cuDeviceGetAttribute(&pi, CU_DEVICE_ATTRIBUTE_COOPERATIVE_MULTI_DEVICE_LAUNCH, dev);// 如果支持，则 pi 被置 1
 6 
 7 // 使用结构 CUDA_LAUNCH_PARAMS_st 来存储需要调用的内核的相关参数
 8 typedef struct CUDA_LAUNCH_PARAMS_st
 9 {
10     CUfunction function;
11     unsigned int gridDimX;
12     unsigned int gridDimY;
13     unsigned int gridDimZ;
14     unsigned int blockDimX;
15     unsigned int blockDimY;
16     unsigned int blockDimZ;
17     unsigned int sharedMemBytes;
18     CUstream hStream;
19     void **kernelParams;
20 }
21 CUDA_LAUNCH_PARAMS; 
22 
23 // 使用函数 cudaLaunchCooperativeKernelMultiDevice() 来启动内核，该函数允许主机线程创建一个跨设备的内核，以提供多设备同步功能
24 cudaLaunchCooperativeKernelMultiDevice(CUDA_LAUNCH_PARAMS *launchParamsList, unsigned int numDevices);
25 
26 // 使用函数 this_multi_grid() 来获得当前线程格，以及使用多设备同步函数
27 multi_grid_group multi_grid = this_multi_grid();
28 multi_grid.sync();
29 
30 // 编译命令，与线程格同步相同

■ 其他要点：

① 该 API 保证了操作的原子性，保证各主机线程在所有指定设备上独立的启动内核；不能将两个 launchParamsList 映射到同一个设备上

② 使用的所有设备必须具有相同的计算能力 major 和 minor 号；所有设备上使用的线程格尺寸、线程块尺寸和共享内存大小必须相同；通过该 API 启动的函数应该是相同的，API 内并没有内置相关检查。

③ 内核中使用的所有 __device__，__constant__，__managed__ 变量在各设备中相互独立，应该在启动内存钱分别初始化完成。

查看全文

相关阅读:
1.两数之和
 [Udemy] ES 7 and Elastic Stack
[Udemy] ES 7 and Elastic Stack
Common Linux Commands 日常工作常用Linux命令
 ELK 学习
 web 3d 技术预研及数据可视化技术
 AWS Cloud Practioner 官方课程笔记
 怎么用 pytorch 查看 GPU 信息
 ECG 项目预研
 怎么查看keras 或者 tensorflow 正在使用的GPU

原文地址：https://www.cnblogs.com/cuancuancuanhao/p/7847081.html