zoukankan      html  css  js  c++  java
  • OpenCL中的half与float的转换

    在kernel中使用half类型可以在牺牲一定精度的代价下来提升运算速度. 在kernel中, 可以比较方便的对half数据进行计算, 但在host上的, 对half的使用就没那么方便了. 查看cl_float的定义:typedef uint16_t cl_half __attribute__((aligned(2)));可知其本质是一个uint16_t. 所以, 如果直接拿cl_float的内存的值来使用的话, 系统会把它当做一个uint16_t来解析. 一般来说, 我们遇到最多的情况可能是在kernel中保为half, 然后把该内存数据复制到host, 然后在host中使用. 关于half和float的转换, 主要有如下几个方面值得注意.

    使用vstore_half和vload_half

    OpenCL 1.1文档中是这么说的:

    Loads from a pointer to a half and stores to a pointer to a half can be performed using the **vload_half, vload_halfn, vloada_halfn and vstore_half, vstore_halfn, vstorea_halfn **functions
    respectively as described in section 6.11.7. The load functions read scalar or vector half values
    from memory and convert them to a scalar or vector float value. The store functions take a
    scalar or vector float value as input, convert it to a half scalar or vector value (with appropriate
    rounding mode) and write the half scalar or vector value to memory

    函数申明如下: vector类型类似

    float vload_half (size_t offset, const __global half *p);
    void vstore_half (float data, size_t offset, __global half *p);
    

    可以知道, load时接受的half的内存数据, 然后vload_half会自动把他们变成float. store时接手的float数据, 然后vstore_half会自动把float数据变成half类型数据写入内存.

    使用read_imageh和write_imageh

    来看定义:

    half4 read_imageh (image2d_t image, sampler_t sampler, int2 coord);
    half4 read_imageh (image2d_t image, sampler_t sampler, float2 coord);
    
    void write_imageh (image2d_t image, int2 coord, half4 color);
    

    其中, 对于read_iamgeh, 其返回值与image2d_t的image_channel_data_type类型有关:

    image_channel_data_type 返回值
    CL_UNORM_INT8,or CL_UNORM_INT16 [0.0 , 1.0]
    CL_SNORM_INT8, or CL_SNORM_INT16 [-1.0, 1.0]
    CL_HALF_FLOAT half精度的值

    如果image2d_t的类型定义不是表格中所表示的类型, 那么read的返回值将是undefined. 同理, write写入的iamge对象也只能是定义为表格中类型.

    在host中进行float和half的转换

    我们前面说到在host中, half实际是按照一个unit16_t来存储, 所以我们肯定需要一个算法或者规则来解析其内存数据, 得到我们想要的half-float值. 幸好, 我在高通的opencl sdk中找到了转换方法, 大家可去下载, 贴出代码如下:

    //--------------------------------------------------------------------------------------
    // File: half_float.cpp
    // Desc:
    //
    // Author:      QUALCOMM
    //
    //               Copyright (c) 2018 QUALCOMM Technologies, Inc.
    //                         All Rights Reserved.
    //                      QUALCOMM Proprietary/GTDR
    //--------------------------------------------------------------------------------------
    
    #include "half_float.h"
    #include <cmath>
    #include <limits>
    
    cl_half to_half(float f)
    {
        static const struct
        {
            unsigned int bit_size       = 16;                                                 // total number of bits in the representation
            unsigned int num_frac_bits  = 10;                                                 // number of fractional (mantissa) bits
            unsigned int num_exp_bits   = 5;                                                  // number of (biased) exponent bits
            unsigned int sign_bit       = 15;                                                 // position of the sign bit
            unsigned int sign_mask      = 1 << 15;                                            // mask to extract sign bit
            unsigned int frac_mask      = (1 << 10) - 1;                                      // mask to extract the fractional (mantissa) bits
            unsigned int exp_mask       = ((1 << 5) - 1) << 10;                               // mask to extract the exponent bits
            unsigned int e_max          = (1 << (5 - 1)) - 1;                                 // max value for the exponent
            int          e_min          = -((1 << (5 - 1)) - 1) + 1;                          // min value for the exponent
            unsigned int max_normal     = ((((1 << (5 - 1)) - 1) + 127) << 23) | 0x7FE000;    // max value that can be represented by the 16 bit float
            unsigned int min_normal     = ((-((1 << (5 - 1)) - 1) + 1) + 127) << 23;          // min value that can be represented by the 16 bit float
            unsigned int bias_diff      = ((unsigned int)(((1 << (5 - 1)) - 1) - 127) << 23); // difference in bias between the float16 and float32 exponent
            unsigned int frac_bits_diff = 23 - 10;                                            // difference in number of fractional bits between float16/float32
        } float16_params;
    
        static const struct
        {
            unsigned int abs_value_mask    = 0x7FFFFFFF; // ANDing with this value gives the abs value
            unsigned int sign_bit_mask     = 0x80000000; // ANDing with this value gives the sign
            unsigned int e_max             = 127;        // max value for the exponent
            unsigned int num_mantissa_bits = 23;         // 23 bit mantissa on single precision floats
            unsigned int mantissa_mask     = 0x007FFFFF; // 23 bit mantissa on single precision floats
        } float32_params;
    
        const union
        {
            float f;
            unsigned int bits;
        } value = {f};
    
        const unsigned int f_abs_bits = value.bits & float32_params.abs_value_mask;
        const bool         is_neg     = value.bits & float32_params.sign_bit_mask;
        const unsigned int sign       = (value.bits & float32_params.sign_bit_mask) >> (float16_params.num_frac_bits + float16_params.num_exp_bits + 1);
        cl_half            half       = 0;
    
        if (std::isnan(value.f))
        {
            half = float16_params.exp_mask | float16_params.frac_mask;
        }
        else if (std::isinf(value.f))
        {
            half = is_neg ? float16_params.sign_mask | float16_params.exp_mask : float16_params.exp_mask;
        }
        else if (f_abs_bits > float16_params.max_normal)
        {
            // Clamp to max float 16 value
            half = sign | (((1 << float16_params.num_exp_bits) - 1) << float16_params.num_frac_bits) | float16_params.frac_mask;
        }
        else if (f_abs_bits < float16_params.min_normal)
        {
            const unsigned int frac_bits    = (f_abs_bits & float32_params.mantissa_mask) | (1 << float32_params.num_mantissa_bits);
            const int          nshift       = float16_params.e_min + float32_params.e_max - (f_abs_bits >> float32_params.num_mantissa_bits);
            const unsigned int shifted_bits = nshift < 24 ? frac_bits >> nshift : 0;
            half                            = sign | (shifted_bits >> float16_params.frac_bits_diff);
        }
        else
        {
            half = sign | ((f_abs_bits + float16_params.bias_diff) >> float16_params.frac_bits_diff);
        }
        return half;
    }
    
    cl_float to_float(cl_half f)
    {
        static const struct {
            uint16_t sign_mask                   = 0x8000;
            uint16_t exp_mask                    = 0x7C00;
            int      exp_bias                    = 15;
            int      exp_offset                  = 10;
            uint16_t biased_exp_max              = (1 << 5) - 1;
            uint16_t frac_mask                   = 0x03FF;
            float    smallest_subnormal_as_float = 5.96046448e-8f;
        } float16_params;
    
        static const struct {
            int sign_offset = 31;
            int exp_bias    = 127;
            int exp_offset  = 23;
        } float32_params;
    
        const bool     is_pos          = (f & float16_params.sign_mask) == 0;
        const uint32_t biased_exponent = (f & float16_params.exp_mask) >> float16_params.exp_offset;
        const uint32_t frac            = (f & float16_params.frac_mask);
        const bool     is_inf          = biased_exponent == float16_params.biased_exp_max
                                         && (frac == 0);
    
        if (is_inf)
        {
            return is_pos ? std::numeric_limits<float>::infinity() : -std::numeric_limits<float>::infinity();
        }
    
        const bool is_nan = biased_exponent == float16_params.biased_exp_max
                            && (frac != 0);
        if (is_nan)
        {
            return std::numeric_limits<float>::quiet_NaN();
        }
    
        const bool is_subnormal = biased_exponent == 0;
        if (is_subnormal)
        {
            return static_cast<float>(frac) * float16_params.smallest_subnormal_as_float * (is_pos ? 1.f : -1.f);
        }
    
        const int      unbiased_exp        = static_cast<int>(biased_exponent) - float16_params.exp_bias;
        const uint32_t biased_f32_exponent = static_cast<uint32_t>(unbiased_exp + float32_params.exp_bias);
    
        union
        {
            cl_float f;
            uint32_t ui;
        } res = {0};
    
        res.ui = (is_pos ? 0 : 1 << float32_params.sign_offset)
                 | (biased_f32_exponent << float32_params.exp_offset)
                 | (frac << (float32_params.exp_offset - float16_params.exp_offset));
    
        return res.f;
    }
    
    

    关于转换精度

    贴出 一组数据给大家感受下:

    //原始float数据
    	11.15780, 	-128.9570, 	6.154780, 	0.9487320, 	-1327.1247, 	256.0, 		0.0, 		-127.597, 		917.0, 		-1.0047
    //to_half然后在to_float的数据
    11.156250 	-128.875000 	6.152344 	0.948730 	-1327.000000 	256.000000 	0.000000 	-127.562500 	917.000000 	-1.003906
    

    根据文档, 在0~2048范围内的整数是可准确表示的. 然后对于浮点数, 精度大概可以形容为百分比的形式, 即如果数本身绝对值大, 那么相差的绝对值也大, 如果本身小, 相差的绝对值也小. 对于常用的图像处理来说, 精度一般是够了.

  • 相关阅读:
    【转】技术人员,你拿什么来拯救你的生活一个牛人的故事
    正则表达式匹配Html标签
    WebClient读取网络数据
    [转]浮点数的存储格式
    [转].NET.GC 浅谈.net托管程序中的资源释放问题
    [转]c#利用WebClient和WebRequest获取网页源代码的比较
    bzoj1934
    1036: [ZJOI2008]树的统计Count (树链剖分模板)
    1834: [ZJOI2010]network 网络扩容 (最小费用最大流模板)
    1602: [Usaco2008 Oct]牧场行走(倍增模板)
  • 原文地址:https://www.cnblogs.com/willhua/p/9915534.html
Copyright © 2011-2022 走看看