Please note that cv::cuda::GpuMat and cv::Mat using different memory allocation method. cv::cuda::GpuMat the data in is Nvidia Gpu Ram, but cv::Mat store in normal Ram.
The cv::Mat allocated memory normally is continuous, but cv::cuda::GpuMat may have gap between row and row data. Because cv::cuda::GpuMat is using cuda function cudaMallocPitch, which make the step size different from COLS.
So when passing the row data of cv::cuda::GpuMat into a CUDA kernel function, should also pass in the step size into it, so the function can access the row data correctly. If using COLS instead of step, it will easily get wrong, and it is a headache to debug the problem.
For example:
__global__ void kernel_select_cmp_point( float* dMap, float* dPhase, uint8_t* matResult, uint32_t step, const int ROWS, const int COLS, const int span) { int start = blockIdx.x * blockDim.x + threadIdx.x; int stride = blockDim.x * gridDim.x; for (int row = start; row < ROWS; row += stride) { int offsetOfInput = row * step; int offsetOfResult = row * step; } }