zoukankan html css js c++ java

最远点采样介绍及CUDA实现分析

最远点采样（Farthest Point sampling/FPS）是一个基本的点云采样算法，在各类点云处理算法中都有使用，如PointNet++，以及各类三维物体检测算法。

本文从以下几个方面对FPS算法进行介绍和分析

FPS逻辑描述
FPS算法串行实现与分析
FPS算法并行实现与分析
串行实现与并行实现的性能比较

1. FPS逻辑描述

假设有(n)个点，要从中按照FPS算法，采样出(k)个点（(k<=n)）。那么，可以将所有点归类到两个集合(A,B)中，(A)集合表示选中的点形成的集合，(B)集合表示未选中的点构成的集合。FPS所做的事情就是每次从集合(B)中选取一个到集合(A)里的点距离最大的点。

初始情况：集合(A)为空，集合(B)包含所有点

选第一个点：对所有点进行shuffle后，选取第一个点，将其加入集合(A)中。此时(size(A)=1,size(B)=n-1)

选第二个点：分别计算出集合(B)里面每个点到集合(A)中一个点的距离，选取距离最大的点，加入集合(A)。此时，(size(A)=2, size(B)=n-1)。

选第三个点及后续的点：设此时集合A中有(m)个点（(m>=2)），集合(B)中有(n-m)个点。此时需要定义集合(B)中的点(p_B)到集合(A)中点的距离，注意到集合(A)中不止一个点，这是FPS的核心。对(p_B)到集合(A)的距离(d_B)定义如下

[egin{aligned} d_B &= d(p_B,A) = min(d(p_B, p^j_A)) quad (p_B in B, p^j_A in A) end{aligned} ]

分别计算出(p_B)到集合(A)中每个点的距离。当集合(A)中有(m)个点时，可以计算出(m)个距离值
从计算出来的(k)个距离值中，取最小的距离值，作为点(p_B)到集合(A)的距离

对于集合(B)中的(n-m)个点，每个点都可以计算出一个距离值：({d^1_B,d^2_B,...,d^{n-m}_B})。选出最大距离值对应的点，记为第(m+1)个点，将其加入集合(A)。此时集合(A)包含(m+1)个点，集合(B)包含(n-(m+1))个点。

重复上述步骤3，直至选出(k)个点为止。

2. FPS算法串行实现与分析

当集合(A)中有(m)个点时，集合(B)中有(n-m)个点，此时选取下一个点的时间复杂度为((n-m)*m)

显然该步骤可以优化，分析如下：

选第(m+1)个点时，对于集合(B)中的一个点(p_B)，需要计算出到集合(A)中(m)个点的距离。假设(o^1_B)表示点(p_B)到集合(A)中第一个点的距离，那么需要计算的距离为：({o^1_B, o^2_B, ..., o^m_B})，然后取最小值(min({o^1_B, o^2_B, ..., o^m_B}))，作为(p_B)到集合(A)的距离。
考虑选第m个点的过程，对于集合中的一个点(p_B)，需要计算出到集合(A)中(m-1)个点的距离，需要计算的距离为({o^1_B, o^2_B,...,o^{m-1}_B})。

可以看到，({o^1_B, o^2_B,...,o^{m-1}_B})这些值被重复计算了，所以可以进行如下的优化。

假设在选第(m)个点时，对于点(p_Bin B)：

(t^{m-1}_B)表示点(p_B)到集合(A)中的(m-1)个点的距离的最小值：

[t^{m-1}_B=min({o^1_B, o^2_B,...,o^{m-1}_B}) ]

选出第(m)个点之后，对于点(p_B)，继续选第(m+1)个点时，只需要计算点(p_B)到选出的第(m)个点的距离(o^m_B)，因为以下公式：

[min({o^1_B, o^2_B, ..., o^m_B}) = min(min{o^1_B, o^2_B,...,o^{m-1}_B},o^m_B) ]

上述公式还可以写为(t^m_B=min(t^{m-1}_B,o^m_B))，这是一个简单的动态规划问题。在实现上，只需要一个长度为n的数组，分别保存每个点到集合(A)的距离，每当将一个新点加入集合(A)后，更新该数组即可。

串行实现代码如下

/*
dataset: (b, n, 3)
temp: (b, n)
idxs: (b, m)
*/
void farthest_point_sampling_cpu(int b, int n, int m, const float *dataset, float *temp, int *idxs){
    const float * const dataset_start = dataset;
    float * const temp_start = temp;
    int * const idxs_start = idxs;
    for(int i=0; i<b; ++i){
        dataset = dataset_start + i*n*3;
        temp = temp_start + i*n;
        idxs = idxs_start + i*m;
        int old = 0;
        idxs[0] = old;
        for(int j=1; j<m; ++j){
            int besti = 0;
            float best = -1;
            float x1 = dataset[old * 3 + 0];
            float y1 = dataset[old * 3 + 1];
            float z1 = dataset[old * 3 + 2];
            for(int k=0; k<n; ++k){
                float x2, y2, z2;
                x2 = dataset[k*3+0];
                y2 = dataset[k*3+1];
                z2 = dataset[k*3+2];

                float d = (x2 - x1)*(x2 - x1)+(y2 - y1)*(y2 - y1)+(z2 - z1)*(z2 - z1);
                float d2 = min(d, temp[k]);
                temp[k] = d2;
                besti = d2 > best ? k : besti;
                best = d2 > best ? d2 : best;
            }
            old = besti;
            idxs[j] = old;
        }
    }
}

hints
- temp数组即保存了(t_B)的值，该数组在函数调用前，需要将所有值初始化为INF/infinity
- 该实现是串行的batch version的FPS实现
时间复杂度分析
- 每步选一个点，需要计算(n)个距离，总共选(k)个点，时间复杂度为(mathcal{O}(kn))
- (k)和(n)一般是常数比例关系，所以可以认为是(mathcal{O}(n^2))
- 考虑到batch size为(b), 最终的时间复杂度为(mathcal{O}(b*n^2))

3. FPS算法并行实现与分析

在实际使用中，面对大规模点云数据的下采样需求，串行实现的FPS算法在性能和效率上不能满足要求，因此常常使用并行化技术加速FPS算法，这里给出一个CUDA上的FPS算法实现例子和分析。

__device__ void __update(float *__restrict__ dists, int *__restrict__ dists_i, int idx1, int idx2){
    const float v1 = dists[idx1], v2 = dists[idx2];
    const int i1 = dists_i[idx1], i2 = dists_i[idx2];
    dists[idx1] = max(v1, v2);
    dists_i[idx1] = v2 > v1 ? i2 : i1;
}

template <unsigned int block_size>__global__ void farthest_point_sampling_kernel(int b, int n, int m, 
    const float *__restrict__ dataset, float *__restrict__ temp, int *__restrict__ idxs) {
    // dataset: (B, N, 3)
    // temp: (B, N)
    // output:
    //      idxs : (B, M)

    if (m <= 0) return;
    __shared__ float dists[block_size];
    __shared__ int dists_i[block_size];

    int batch_index = blockIdx.x;
    dataset += batch_index * n * 3;
    temp += batch_index * n;
    idxs += batch_index * m;

    int tid = threadIdx.x;
    const int stride = block_size;

    int old = 0;
    if (threadIdx.x == 0)
        idxs[0] = old;

    __syncthreads();
    for (int j = 1; j < m; j++) {
        int besti = 0;
        float best = -1;
        float x1 = dataset[old * 3 + 0];
        float y1 = dataset[old * 3 + 1];
        float z1 = dataset[old * 3 + 2];
        for (int k = tid; k < n; k += stride) {
            float x2, y2, z2;
            x2 = dataset[k * 3 + 0];
            y2 = dataset[k * 3 + 1];
            z2 = dataset[k * 3 + 2];

            float d = (x2 - x1) * (x2 - x1) + (y2 - y1) * (y2 - y1) + (z2 - z1) * (z2 - z1);
            float d2 = min(d, temp[k]);
            temp[k] = d2;
            besti = d2 > best ? k : besti;
            best = d2 > best ? d2 : best;
        }
        dists[tid] = best;
        dists_i[tid] = besti;
        __syncthreads();
        
        //Reduce programming pattern
        //Loop unrolling
        if (block_size >= 1024) {
            if (tid < 512) {
                __update(dists, dists_i, tid, tid + 512);
            }
            __syncthreads();
        }

        if (block_size >= 512) {
            if (tid < 256) {
                __update(dists, dists_i, tid, tid + 256);
            }
            __syncthreads();
        }
        if (block_size >= 256) {
            if (tid < 128) {
                __update(dists, dists_i, tid, tid + 128);
            }
            __syncthreads();
        }
        if (block_size >= 128) {
            if (tid < 64) {
                __update(dists, dists_i, tid, tid + 64);
            }
            __syncthreads();
        }
        if (block_size >= 64) {
            if (tid < 32) {
                __update(dists, dists_i, tid, tid + 32);
            }
            __syncthreads();
        }
        if (block_size >= 32) {
            if (tid < 16) {
                __update(dists, dists_i, tid, tid + 16);
            }
            __syncthreads();
        }
        if (block_size >= 16) {
            if (tid < 8) {
                __update(dists, dists_i, tid, tid + 8);
            }
            __syncthreads();
        }
        if (block_size >= 8) {
            if (tid < 4) {
                __update(dists, dists_i, tid, tid + 4);
            }
            __syncthreads();
        }
        if (block_size >= 4) {
            if (tid < 2) {
                __update(dists, dists_i, tid, tid + 2);
            }
            __syncthreads();
        }
        if (block_size >= 2) {
            if (tid < 1) {
                __update(dists, dists_i, tid, tid + 1);
            }
            __syncthreads();
        }

        old = dists_i[0];
        if (tid == 0) 
            idxs[j] = old;
    }
}

void farthest_point_sampling_kernel_launcher(int b, int n, int m,
    const float *dataset, float *temp, int *idxs) {
    // dataset: (B, N, 3)
    // temp: (B, N)
    // output:
    //      idx: (B, M)

    cudaError_t err;
    unsigned int n_threads = opt_n_threads(n);
    
    printf("n_threads:%d
",n_threads);
    switch (n_threads) {
        case 1024:
        farthest_point_sampling_kernel<1024><<<b, n_threads>>>(b, n, m, dataset, temp, idxs); break;
        case 512:
        farthest_point_sampling_kernel<512><<<b, n_threads>>>(b, n, m, dataset, temp, idxs); break;
        case 256:
        farthest_point_sampling_kernel<256><<<b, n_threads>>>(b, n, m, dataset, temp, idxs); break;
        case 128:
        farthest_point_sampling_kernel<128><<<b, n_threads>>>(b, n, m, dataset, temp, idxs); break;
        case 64:
        farthest_point_sampling_kernel<64><<<b, n_threads>>>(b, n, m, dataset, temp, idxs); break;
        case 32:
        farthest_point_sampling_kernel<32><<<b, n_threads>>>(b, n, m, dataset, temp, idxs); break;
        case 16:
        farthest_point_sampling_kernel<16><<<b, n_threads>>>(b, n, m, dataset, temp, idxs); break;
        case 8:
        farthest_point_sampling_kernel<8><<<b, n_threads>>>(b, n, m, dataset, temp, idxs); break;
        case 4:
        farthest_point_sampling_kernel<4><<<b, n_threads>>>(b, n, m, dataset, temp, idxs); break;
        case 2:
        farthest_point_sampling_kernel<2><<<b, n_threads>>>(b, n, m, dataset, temp, idxs); break;
        case 1:
        farthest_point_sampling_kernel<1><<<b, n_threads>>>(b, n, m, dataset, temp, idxs); break;
        default:
        farthest_point_sampling_kernel<512><<<b, n_threads>>>(b, n, m, dataset, temp, idxs);
    }

    err = cudaGetLastError();
    if (cudaSuccess != err) {
        fprintf(stderr, "CUDA kernel failed : %s
", cudaGetErrorString(err));
        exit(-1);
    }
}

CUDA入门材料可以参考[4]，此处不再赘述。此部分代码分析可见参考[3]，写的十分详细，此处只强调几个地方。

for (int k = tid; k < n; k += stride) {
    float x2, y2, z2;
    x2 = dataset[k * 3 + 0];
    y2 = dataset[k * 3 + 1];
    z2 = dataset[k * 3 + 2];

    float d = (x2 - x1) * (x2 - x1) + (y2 - y1) * (y2 - y1) + (z2 - z1) * (z2 - z1);
    float d2 = min(d, temp[k]);
    temp[k] = d2;
    besti = d2 > best ? k : besti;
    best = d2 > best ? d2 : best;
}

这部分代码是用来计算(n)个点到old点的距离的，这里主要分析线程块内线程和数据的映射关系（假设(blockDim.x=1024)）：

一个thread block对应的数据为dataset中的一个batch，点数为n，很显然常常有(n >> blockDim.x)成立
这里的循环就是用1024个线程处理(n)个点的pattern
- 每个线程访问[tid, tid+stride, ..., tid+i*stride,...]处对应的数据
- 这里的线程warp在开始时无控制分歧出现，只在最后会出现一部分控制分歧
- 同时，线程的数据访问是接合的模式，即连续线程所访问的数据可以由一次DRAM burst获得，这样可以提高DRAM的访问效率

//Reduce programming pattern
//Loop unrolling
if (block_size >= 1024) {
    if (tid < 512) {
        __update(dists, dists_i, tid, tid + 512);
    }
    __syncthreads();
}

if (block_size >= 512) {
    if (tid < 256) {
        __update(dists, dists_i, tid, tid + 256);
    }
    __syncthreads();
}
if (block_size >= 256) {
    if (tid < 128) {
        __update(dists, dists_i, tid, tid + 128);
    }
    __syncthreads();
}
if (block_size >= 128) {
    if (tid < 64) {
        __update(dists, dists_i, tid, tid + 64);
    }
    __syncthreads();
}
if (block_size >= 64) {
    if (tid < 32) {
        __update(dists, dists_i, tid, tid + 32);
    }
    __syncthreads();
}
if (block_size >= 32) {
    if (tid < 16) {
        __update(dists, dists_i, tid, tid + 16);
    }
    __syncthreads();
}
if (block_size >= 16) {
    if (tid < 8) {
        __update(dists, dists_i, tid, tid + 8);
    }
    __syncthreads();
}
if (block_size >= 8) {
    if (tid < 4) {
        __update(dists, dists_i, tid, tid + 4);
    }
    __syncthreads();
}
if (block_size >= 4) {
    if (tid < 2) {
        __update(dists, dists_i, tid, tid + 2);
    }
    __syncthreads();
}
if (block_size >= 2) {
    if (tid < 1) {
        __update(dists, dists_i, tid, tid + 1);
    }
    __syncthreads();
}

上述代码主要有两点需要注意

循环展开
- 消除了for循环的边界判断分支指令以及loop计数器更新
- 实现性能提升
规约（Reduce）模式
- 这里就是用来求dists和dists_i数组的最大值，是一个标准的max规约操作

4. 串行实现与并行实现的性能比较

在如下的硬件环境下进行了性能比较实验

CPU: Intel i7-10710U
GPU: Nvidia MX350
CUDA version: 11.2

Batch Size	#InputPts	#SampledPts	串行实现用时	并行实现用时
1	10000	1000	76ms	8ms
2	10000	1000	157ms	8ms
3	10000	1000	234ms	8ms
4	10000	1000	311ms	15ms
5	10000	1000	381ms	20ms
6	10000	1000	459ms	26ms
6	10000	10000	4561ms	250ms

可以看到并行实现相比于串行实现在最困难的设置下加速比为4561ms/250ms=18.244，这表明并行实现相比于串行实现的效率和速度提升是十分明显的。

参考资料

[1] Farthest Point sampling (FPs)算法核心思想解析

[2] OpenPCDet源码中的fps-cuda-version实现

[3] Farthest Point Sampling in 3D Object Detection

[4] 大规模并行处理器编程实战（第2版）

查看全文

相关阅读:
STM32 F4 DAC DMA Waveform Generator
STM32 F4 General-purpose Timers for Periodic Interrupts
Python第十四天序列化 pickle模块 cPickle模块 JSON模块 API的两种格式
 Python第十三天 django 1.6 导入模板定义数据模型访问数据库 GET和POST方法 SimpleCMDB项目 urllib模块 urllib2模块 httplib模块 django和web服务器整合 wsgi模块 gunicorn模块
 查看SQL Server服务运行帐户和SQL Server的所有注册表项
 Pycharm使用技巧（转载）
SQL Server 2014内存优化表的使用场景
 Python第十天 print >> f,和fd.write()的区别 stdout的buffer 标准输入标准输出从控制台重定向到文件标准错误重定向输出流和输入流捕获sys.exit()调用 optparse argparse
Python第七天函数函数参数函数里的变量函数返回值多类型传值函数递归调用匿名函数内置函数
 Python第六天类型转换

原文地址：https://www.cnblogs.com/cbw052/p/14672280.html