zoukankan      html  css  js  c++  java
  • PaddlePaddle inference 源码分析(四)

    本节介绍预测处理的流程。预测处理流程主要分为3部分,包括准备输入数据、执行、获取输出数据。

    一、放入输入数据

    简单的使用方法如下所示:

    vector<string> input_names = predictor->GetInputNames();
    unique_ptr<Tensor> input_t = predictor->GetInputHandle(input_names[0]);
    input_t->Reshape(input_shape);
    input_t->CopyFromCpu(input.data());

    我们按照这个流程一步一步来深入

    1、GetInputNames

    这个调用有点绕,因为对外提供的头文件是paddle_infer作用域,因此这里的实际实现是先在paddle_infer下函数调用,然后调用了实际创建出来的AnalysisPredictor::GetInputNamse。

    这一步是获取输入的节点名称。这里idx2feeds_是std::map<size_t, std::string>,保存的是模型文件中op->Type==feed的名称

    // 接口类实现
    namespace paddle_infer {
    std::vector<std::string> Predictor::GetInputNames() {
      return predictor_->GetInputNames();
    }
    }
    
    // 实际实现
    std::vector<std::string> AnalysisPredictor::GetInputNames() {
      std::vector<std::string> input_names;
      for (auto &item : idx2feeds_) {
        input_names.push_back(item.second);
      }
      return input_names;
    }

    2、GetInputHandle

    作用是根据节点名称获取到对应的内存区域。前文介绍过Scope中保存了所有节点的信息,这里就是拿到输入节点Scope的内存区域.这里executor保存的scope是predictor的sub_scope。

    namespace paddle_infer {
    std::unique_ptr<Tensor> Predictor::GetInputHandle(const std::string &name) {
      return predictor_->GetInputTensor(name);
    }
    }
    std::unique_ptr<ZeroCopyTensor> AnalysisPredictor::GetInputTensor(
        const std::string &name) {
      PADDLE_ENFORCE_NOT_NULL(
          executor_->scope()->FindVar(name),
          platform::errors::PreconditionNotMet(
              "The variable named %s is not found in the scope of the exector.",
              name));
      // 拿到scope
      std::unique_ptr<ZeroCopyTensor> res(
          new ZeroCopyTensor(static_cast<void *>(executor_->scope())));
      res->input_or_output_ = true;
      res->SetName(name);
      // 根据设备获取对应place
      if (platform::is_cpu_place(place_)) {
        res->SetPlace(PaddlePlace::kCPU);
      } else if (platform::is_xpu_place(place_)) {
        if (config_.lite_engine_enabled()) {
          // Currently, Paddle-Lite's XPU user interface only supports the transfer
          // of host data pointers. If it is currently used as a subgraph, execution
          // efficiency will be sacrificed, so it is temporarily set to cpu place.
          // And, the current lite engine of xpu must execute all parts of the
          // model.
          res->SetPlace(PaddlePlace::kCPU);
        } else {
          auto xpu_place = BOOST_GET_CONST(platform::XPUPlace, place_);
          res->SetPlace(PaddlePlace::kXPU, xpu_place.GetDeviceId());
        }
      } else if (platform::is_npu_place(place_)) {
        auto npu_place = BOOST_GET_CONST(platform::NPUPlace, place_);
        res->SetPlace(PaddlePlace::kNPU, npu_place.GetDeviceId());
      } else {
        auto gpu_place = BOOST_GET_CONST(platform::CUDAPlace, place_);
        res->SetPlace(PaddlePlace::kGPU, gpu_place.GetDeviceId());
      }
      return res;
    }

    3、ZeroCopyTensor::Reshape

    这一步骤的作用就是操作输入tensort,重新确定输入数据的维度信息。这里我们会详细介绍一下tensor的操作。

    3.1 基类及接口是paddle_infer::Tensor(paddle_tensor.h/inference/api/details/zero_copy_tensor.cc).ZeroCopyTensor(paddle_api.h)是paddle_infer::Tensor的子类,主要重写了copy相关的函数。会在下一小结具体讲述。

    3.2 实际的reshape操作作用在Tensor::Reshape中,实际逻辑为从sub_scope中取出对应名称的Variable(framework/variable.h)并对其进行操作。

    void Tensor::Reshape(const std::vector<int> &shape) {
      // 判断是否设置了name
      PADDLE_ENFORCE_EQ(
          name_.empty(), false,
          paddle::platform::errors::PreconditionNotMet(
              "Need to SetName first, so that the corresponding tensor can "
              "be retrieved."));
      // 判断是否为input,只有input才能重新设置
      PADDLE_ENFORCE_EQ(input_or_output_, true,
                        paddle::platform::errors::PermissionDenied(
                            "Can't reshape the output tensor, it is readonly"));
      // 获取scope,然后取出对应名称节点的变量并进行设置。这里使用的是sub_scope,其中保存的都是非永久性的节点
      auto *scope = static_cast<paddle::framework::Scope *>(scope_);
      auto *var = scope->FindVar(name_);
      PADDLE_ENFORCE_NOT_NULL(
          var, paddle::platform::errors::PreconditionNotMet(
                   "No tensor called [%s] in the runtime scope", name_));
      auto *tensor = var->GetMutable<paddle::framework::LoDTensor>();
      tensor->Resize(paddle::framework::make_ddim(shape));
    }

    3.3 var->GetMutable,这里实际在Variable中创建对应类型的存储数据。存储数据用LoDTensor(lod_tensor.h),创建一个LoDTensor的对象赋值给

      template <typename T>
      T* GetMutable() {
        if (!holder_) {
          holder_.reset(new PlaceholderImpl<T>());
        } else {
          PADDLE_ENFORCE_EQ(
              holder_->Type(), VarTypeTrait<T>::kId,
              platform::errors::InvalidArgument(
                  "The Variable type must be %s, but the type it holds is %s.",
                  ToTypeName(VarTypeTrait<T>::kId), ToTypeName(holder_->Type())));
        }
        return static_cast<T*>(holder_->Ptr());
      }

    PlaceholderImpl是一个模板类,用于包装T,这样Variable类在构造时不需要包含模板,只需要把Placeholder指针作为成员变量即可std::shared_ptr<Placeholder> holder_;PlaceholderImpl构造时会保存obj指针,同时保存obj的类型序号,序号实际在proto::VarType中定义。对应关系实现已注册好。

      // Placeholder hides type T, so it doesn't appear as a template
      // parameter of Variable.
      template <typename T>
      struct PlaceholderImpl : public Placeholder {
        static_assert(
            IsRegisteredVarType<T>(),
            "Not registered type. Please register T inside var_type_traits.h");
        PlaceholderImpl() { this->Init(&obj_, VarTypeTrait<T>::kId); }
    
       private:
        T obj_;
      };

    这里会检查T类型是否已注册,注册列表详见framework/var_type_traits.h

    REG_PROTO_VAR_TYPE_TRAIT(LoDTensor, proto::VarType::LOD_TENSOR);
    REG_PROTO_VAR_TYPE_TRAIT(SelectedRows, proto::VarType::SELECTED_ROWS);
    REG_PROTO_VAR_TYPE_TRAIT(std::vector<Scope *>, proto::VarType::STEP_SCOPES);
    REG_PROTO_VAR_TYPE_TRAIT(LoDRankTable, proto::VarType::LOD_RANK_TABLE);
    REG_PROTO_VAR_TYPE_TRAIT(LoDTensorArray, proto::VarType::LOD_TENSOR_ARRAY);
    REG_PROTO_VAR_TYPE_TRAIT(platform::PlaceList, proto::VarType::PLACE_LIST);
    REG_PROTO_VAR_TYPE_TRAIT(ReaderHolder, proto::VarType::READER);
    REG_PROTO_VAR_TYPE_TRAIT(FeedList, proto::VarType::FEED_LIST);
    REG_PROTO_VAR_TYPE_TRAIT(FetchList, proto::VarType::FETCH_LIST);
    REG_PROTO_VAR_TYPE_TRAIT(int, proto::VarType::INT32);
    REG_PROTO_VAR_TYPE_TRAIT(float, proto::VarType::FP32);
    REG_PROTO_VAR_TYPE_TRAIT(Vocab, proto::VarType::VOCAB);
    REG_PROTO_VAR_TYPE_TRAIT(String, proto::VarType::STRING);
    REG_PROTO_VAR_TYPE_TRAIT(Strings, proto::VarType::STRINGS);

    3.4 LoDTensor这里命名空间为paddle::framework,注意与之前paddle_infer::Tensor区分开。LoDTensor的父类为paddle::framework::Tensor(framework/tensor.h),Resize操作也是直接使用父类函数

    Tensor& Tensor::Resize(const DDim& dims) {
      dims_ = dims;
      return *this;
    }

      paddle::framework::make_ddim(shape)这里创建DDim(ddim.h),其中rank_保存了维度长度,dim_保存了具体的维度数据,最大为9维

    4. ZeroCopyTensor::CopyFromCpu 

    这一步真正进行内存拷贝。我们分4步详细介绍

    template <typename T>
    void Tensor::CopyFromCpu(const T *data) {
      // 1
      EAGER_GET_TENSOR(paddle::framework::LoDTensor);
      PADDLE_ENFORCE_GE(tensor->numel(), 0,
                        paddle::platform::errors::PreconditionNotMet(
                            "You should call Tensor::Reshape(const "
                            "std::vector<int> &shape)"
                            "function before copying data from cpu."));
      // 2
      size_t ele_size = tensor->numel() * sizeof(T);
    
      // 3
      if (place_ == PlaceType::kCPU) {
        auto *t_data = tensor->mutable_data<T>(paddle::platform::CPUPlace());
        std::memcpy(static_cast<void *>(t_data), data, ele_size);
      } else if (place_ == PlaceType::kGPU) {
      // 4
    #if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
        paddle::platform::DeviceContextPool &pool =
            paddle::platform::DeviceContextPool::Instance();
        paddle::platform::CUDAPlace gpu_place(device_);
        auto *t_data = tensor->mutable_data<T>(gpu_place);
        auto *dev_ctx = static_cast<const paddle::platform::CUDADeviceContext *>(
            pool.Get(gpu_place));
    
        paddle::memory::Copy(gpu_place, static_cast<void *>(t_data),
                             paddle::platform::CPUPlace(), data, ele_size,
                             dev_ctx->stream());
    #else
        PADDLE_THROW(paddle::platform::errors::Unavailable(
            "Can not create tensor with CUDA place because paddle is not compiled "
            "with CUDA."));
    #endif
      } else if (place_ == PlaceType::kXPU) {
       ...// 昆仑xpu相关
      } else if (place_ == PlaceType::kNPU) {
       ...// 华为昇腾相关
      } else {
        PADDLE_THROW(paddle::platform::errors::InvalidArgument(
            "The analysis predictor supports CPU, GPU, NPU and XPU now."));
      }
    }

    4.1 取出scope对应var中创建的LoDTensor指针,赋值给tensor_

    // 1调用入口
    EAGER_GET_TENSOR(paddle::framework::LoDTensor);
    // 2 调用FindTensor获取指针
    #define EAGER_GET_TENSOR(tensor_type)    \
      if (!tensor_) {                        \
        tensor_ = FindTensor<tensor_type>(); \
      }                                      \
      auto *tensor = static_cast<tensor_type *>(tensor_);
    // 3 实际逻辑,在scope对应var中使用GetMutable,由于Reshape时已经调用该接口进行了创建,而且本地调用类型与创建类型一致,会直接获取之前创建的LoDTensor对象指针。
    template <typename T>
    void *Tensor::FindTensor() const {
      PADDLE_ENFORCE_EQ(
          name_.empty(), false,
          paddle::platform::errors::PreconditionNotMet(
              "Need to SetName first, so that the corresponding tensor can "
              "be retrieved."));
      auto *scope = static_cast<paddle::framework::Scope *>(scope_);
      auto *var = scope->FindVar(name_);
      PADDLE_ENFORCE_NOT_NULL(
          var, paddle::platform::errors::PreconditionNotMet(
                   "No tensor called [%s] in the runtime scope", name_));
      auto *tensor = var->GetMutable<T>();
      return tensor;
    }

    4.2 计算需要的内存大小,实际与之前Reshap的大小以及T的类型有关。最终就是dim0*dim1*...*sizeof(T)

    int64_t Tensor::numel() const { return product(dims_); }

    最终通过这种递归调用模板的方式计算所有维度数据的乘积

    template <size_t kStart, size_t kEnd, bool kStop>
    struct UnrollProduct {
      template <typename T>
      HOSTDEVICE inline static T Run(const T *d) {
        return d[kStart] *
               UnrollProduct<kStart + 1, kEnd, kStart + 1 == kEnd>::Run(d);
      }
    };
    
    template <size_t kStart, size_t kEnd>
    struct UnrollProduct<kStart, kEnd, true> {
      template <typename T>
      HOSTDEVICE inline constexpr static T Run(const T *d) {
        return 1;
      }
    };

    4.3 对于CPU place的数据拷贝。对于CPU比较简单,就是从tensor中拿到内存,然后将数据进行拷贝。

    4.3.1 获取内存指针

    // 注意,这里tensor仍将是paddle::framework::Tensor
    // 1拿到内存
    auto *t_data = tensor->mutable_data<T>(paddle::platform::CPUPlace());

    4.3.2 获取T数据类型

    // 2.1 tgensor_impl.h首先判断T是否为pod,然后获取T类型
    template <typename T>
    inline T* Tensor::mutable_data(const platform::Place& place,
                                   size_t requested_size) {
      static_assert(std::is_pod<T>::value, "T must be POD");
      return reinterpret_cast<T*>(
          mutable_data(place, DataTypeTrait<T>::DataType(), requested_size));
    }
    // 2.2 具体获取类型的方法 data_type.h
      DataTypeTrait<T>::DataType()
    // 工具宏,用了遍历类型
    #define _ForEachDataType_(callback)                                      \
      _ForEachDataTypeHelper_(callback, float, FP32);                        \
      _ForEachDataTypeHelper_(callback, ::paddle::platform::float16, FP16);  \
      _ForEachDataTypeHelper_(callback, ::paddle::platform::bfloat16, BF16); \
      _ForEachDataTypeHelper_(callback, double, FP64);                       \
      _ForEachDataTypeHelper_(callback, int, INT32);                         \
      _ForEachDataTypeHelper_(callback, int64_t, INT64);                     \
      _ForEachDataTypeHelper_(callback, bool, BOOL);                         \
      _ForEachDataTypeHelper_(callback, uint8_t, UINT8);                     \
      _ForEachDataTypeHelper_(callback, int16_t, INT16);                     \
      _ForEachDataTypeHelper_(callback, int8_t, INT8);                       \
      _ForEachDataTypeHelper_(callback, ::paddle::platform::complex<float>,  \
                              COMPLEX64);                                    \
      _ForEachDataTypeHelper_(callback, ::paddle::platform::complex<double>, \
                              COMPLEX128);
    // 首先在初始化时创建map,其中数据类型与framework.proto中VarType::Type的对应
    //初始化函数,使用遍历宏调用RegisterType函数,这个函数把数据类型与proto_type对应关系写入map
    static DataTypeMap* InitDataTypeMap() {
      auto retv = new DataTypeMap();
    
    #define RegType(cc_type, proto_type) \
      RegisterType<cc_type>(retv, proto_type, #cc_type)
    
      _ForEachDataType_(RegType);
    
    #undef RegType
      return retv;
    }
    // 然后使用宏定义创建对应的特化DataTypeTrait
    // 创建特化类的宏
    #define DefineDataTypeTrait(cpp_type, proto_type)                           \
      template <>                                                               \
      struct DataTypeTrait<cpp_type> {                                          \
        constexpr static proto::VarType::Type DataType() { return proto_type; } \
      }
    // 使用遍历宏去调用创建宏
    _ForEachDataType_(DefineDataTypeTrait);

    4.3.3 申请内存

    确定了具体数据类型后,会实际调用framework::Tensor->mutable_data(place, type, requested_size=0)。然后仍旧使用numel()*SizeOfType(type)获取数据大小,然后使用memory::AllocShared获取对应place的内存,保存到Tensor的holder(memory::Allocation),同时返回内存指针

    4.4 GPU显存的拷贝

    相比与CPU内存的拷贝。GPU这里有两点不同,第一个是使用全局的DeviceContextPool,DeviceContextPool保存了place与deviceContext的对应关系。这里获取到对应的CUDADeviceContext。使用同样的方法获取内存指针后,将数据拷贝到显存中。

    二、执行预测

    执行Predictor::Run。这里实际执行的是AnalysisPredictor::ZeroCopyRun函数

    bool AnalysisPredictor::ZeroCopyRun() {
      paddle::platform::SetNumThreads(config_.cpu_math_library_num_threads());
    #ifdef PADDLE_WITH_MKLDNN
      ...#endif
    
      executor_->Run();
    
      if (config_.shape_range_info_collected()) {
        CollectShapeRangeInfo();
      }
    
      // Fix TensorArray reuse not cleaned bug.
      tensor_array_batch_cleaner_.CollectTensorArrays(sub_scope_);
      tensor_array_batch_cleaner_.ResetTensorArray();
    
      // recover the cpu_math_library_num_threads to 1, in order to avoid thread
      // conflict when integrating it into deployment service.
      paddle::platform::SetNumThreads(1);
    #ifdef PADDLE_WITH_MKLDNN
      ...#endif
    #if defined(PADDLE_WITH_MKLML)
      ...#endif
      return true;
    }

    1、NaiveExecutor::Run

    基础逻辑是逐步调用op->Run

    void NaiveExecutor::Run() {
    #ifdef PADDLE_WITH_MKLDNN
      platform::AttachPointerHashToMKLDNNKey(this, place_);
    #endif
      platform::ScopedFlushDenormal flush;
      for (auto &op : ops_) {
        VLOG(4) << std::this_thread::get_id() << " run "
                << op->DebugStringEx(scope_) << " on scope " << scope_;
        op->SetIsCalledByExecutor(false);
        op->Run(*scope_, place_);
      }
    }

    2、OP RUN

    OP RUN这里会直接调用OperatorBase::Run,先从place中获取设备id,然后调用子类OP的RunImpl.所需要的资源为scope和place

    void OperatorBase::Run(const Scope& scope, const platform::Place& place) {
      try {
        VLOG(4) << place << " " << DebugStringEx(&scope);
        if (platform::is_gpu_place(place)) {
    #if !defined(PADDLE_WITH_CUDA) && !defined(PADDLE_WITH_HIP)
          PADDLE_THROW(platform::errors::Unavailable(
              "Cannot run operator on place %s, please recompile paddle or "
              "reinstall Paddle with CUDA support.",
              place));
    #else
          auto dev_id = BOOST_GET_CONST(platform::CUDAPlace, place).device;
          platform::SetDeviceId(dev_id);
    #endif
        } else if (platform::is_xpu_place(place)) {
        ... //XPU相关,获取设备id
        } else if (platform::is_npu_place(place)) {
        ... //NPU相关即昇腾,获取id
        }
    
        {
          // TODO(wangchaochaohu) : refine code to use only one RecordEvent)
          // in order to record different op type cost time
          // and different op name cost time,we set two event.
          platform::RecordEvent op_type_record_event(Type());
          auto op_name = platform::OpName(outputs_, Type());
          platform::RecordEvent op_name_record_event(
              op_name, platform::EventRole::kUniqueOp);
          // 实际逻辑,调用子类OP的RunImpl
          RunImpl(scope, place);
        }
    
        VLOG(3) << GetExecutionPlace(place) << " " << DebugStringEx(&scope);
      } catch (platform::EnforceNotMet& exception) {
        ... // 错误信息
      }
    }

    OP这里分为两种,可以参考《PaddlePaddle inference 源码分析(三)

    一种OP直接继承自OperatorBase,另一种OP继承自OperatorWithKernel.

    两者区别为,继承自OperatorBase的OP直接实现了RunImpl,这类OP直接运行于CPU上。继承自OperatorWithKernel的OP运行OperatorWithKernel::RunImpl,然后运行对应Kernel,这里选择Kernel的逻辑为根据place以及op_device参数。

    2.1 OperatorBase类型的Runimpl

    此类OP一般为功能性OP

    我们以assert_op为例。它的基本逻辑为从scope中根据名称拿到输入的LoDTensor,然后从LoDTensor中获取数据,如果有数据,则取出具体数据,也就是错误信息进行打印。

    void RunImpl(const framework::Scope &scope,
                   const platform::Place &dev_place) const override {
        // 从scope中获取tensor
        const framework::Variable *cond_var_ptr = scope.FindVar(Input(kCond));
        PADDLE_ENFORCE_NOT_NULL(cond_var_ptr,
                                platform::errors::NotFound(
                                    "Input(Condition) of AssertOp is not found."));
        const LoDTensor &cond = cond_var_ptr->Get<LoDTensor>();
        PADDLE_ENFORCE_EQ(
            cond.dims(), paddle::framework::make_ddim({1}),
            platform::errors::InvalidArgument(
                "The numel of Input(Condition) of AssertOp must be 1. But now "
                "the Condition's shape is %s.",
                cond.dims().to_str()));
    
        // 判断tensor中是否有数据
        bool cond_data = GetCondData(cond);
        if (cond_data) {
          return;
        }
    
        TensorFormatter formatter;
        formatter.SetSummarize(Attr<int64_t>(kSummarize));
        // 对数据进行处理,也就是打印错误信息
        const std::vector<std::string> &x_names = Inputs(kData);
        for (const std::string &name : x_names) {
          const framework::Variable *x_var_ptr = scope.FindVar(name);
          const framework::LoDTensor &x_tensor = x_var_ptr->Get<LoDTensor>();
          formatter.Print(x_tensor, name);
        }
    
        PADDLE_THROW(platform::errors::InvalidArgument(
            "The condition variable '%s' of AssertOp must be "
            "true, but received false",
            Input(kCond)));
      }

    2.2 OperatorWithKernel RunImpl

    绝大部分的计算型OP均为OperatorWithKernel类型。

    它的调用步骤如下:OperatorWithKernel::RunImpl(scope,place)->OperatorWithKernel::RunImpl(scope,place,runtimecontext)->OperatorWithKernel::ChooseKernel->注册的kernel_func_

    2.2.1 RunImpl(scope,place)

    获取RuntimeContext。分为两种情况,一种允许cache rumtimecontext,那就会沿用OP保存的runtime_ctx_,否则会创建新的。

    2.2.2 RunImpl(scope, place,runtime_ctx)

    从单例的DeviceContextPool中根据place获取对应的DeviceContext

    如果OP对象缓存的kernel_type_或kernel_func_为空,则调用ChooseKernel获取kernel_func

    然后调用PrepareData获得transfer_scope

    调用子类实现的InferShape设置输入输出的维度信息

    调用kernel_func去实际进行计算

    2.2.3 ChooseKernel(runtime_ctx,scope,place)

    使用AllOpKernels获取到一个全局的map,这个map组成为<key=op名称,value=OpKernelMap>,每个OP的所有kernel实现都放在对应的OpKernelMap中。OpKernelMap的组成为<OpKernelType, OpKernelFunc>,其中OpKernelType包含了place、数据类型信息等。

    首先根据Op_type拿到对应的OpKernelMap,然后根据DeviceContext获取place,并且从OP以及RuntimeContext中 获取数据类型,生成对应的OpKernelType

    之后还会查看一下是否含有op_device这个参数,如果有则将place值改为op_device的属性值。然后取出对应Func

    将OpKernelType以及取出的OpKernelFunc保存到op的kernen_type_以及kernel_func_

    2.2.4 PrepareData

    准备输入的数据,这里会创建thread local的scope,并传递出来

    2.2.5 InferShape

    这里会调用OP自行实现的InferShape函数,用于设置输出的维度信息

    2.2.6 kernel_func

    会调用op实际注册的对应类型与place的kernel函数.这里注册形式也是使用模板类,在类中实现了Compute函数,实际执行时会运行Compute函数。

    联系方式:emhhbmdfbGlhbmcxOTkxQDEyNi5jb20=
  • 相关阅读:
    C++/C函数的调用规范
    Computer Science Conference Rankings 计算机科学会议排名 Rank
    Height of CComboBox's drop down list
    java instrument跟踪java freemarker调用过程
    Text to speech hello world sapi
    Visual Studio 插件 代码注释对齐
    #define WINVER 0x0501 之后菜单不显示图标了
    转 用NodeJS打造你的静态文件服务器
    代码注释对齐
    修复MSN上联系人全部显示脱机状态,删除缓存
  • 原文地址:https://www.cnblogs.com/zl1991/p/15728406.html
Copyright © 2011-2022 走看看