zoukankan html css js c++ java

我不会用 Triton 系列：Python Backend 的使用

Python Backend

Triton 提供了 pipeline 的功能，但是 Triton 的 pipeline 只能将输入和输出串联到一起，太过于简单静态了，不支持控制流，比如循环、判断等，模型和模型之间的数据格式不灵活，只能是向量。pipeline 太死板了，有没有办法支持更加灵活的操作呢？答案是使用 Python Backend 或者自己开发 C++ Backend。

使用 Python Backend 的好处是开发的速度快，并同时拥有 Python 语言的灵活。举个例子，人脸检测模型 MTCNN。MTCNN 是由三个模型组成的，三个模型分别是三个神经网络。除开这三个神经网络做前向传播之外，剩下的操作太过于灵活以至于几乎不能使用 pipeline 来搭建成一个模型，使到一次请求即可完成整个流程。使用了 Python Backend 之后，我们就可以将这些复杂的逻辑全部集成到一起，客户端只需要一次请求，就可以得到最终的结果。

因为英伟达的没有 Python Backend 中 python 的接口文档，所以在文章的最后，我整理了相关的接口，方便查询。这些 python 的接口既有 python 文件的，也有 pybind11 导出的。

这里有一个相对复杂和完整的例子，这里例子是人脸识别模型。客户端调用的时候，只需要输入一张图片，输出就可以得到带有人脸标注信息的图片。这整个流程需要分为几个步骤：人脸检测，截取图片，特征提取，特征匹配等。本文不会讲这个例子，本文使用英伟达仓库中一个简单的 add_sub 来作为例子，主要是为了阐明如何使用 Python Backend

地址：https://github.com/zzk0/triton/tree/master/face/face_ensemble_python

例子

这个简单的例子来自英伟达的仓库中。

目录结构：

(transformers) percent1@ubuntu:~/triton/triton/models/example_python$ tree
.
├── 1
│   ├── model.py         # 模型对应的脚本文件
│   └── __pycache__
├── client.py            # 客户端脚本，可以不放在这里
└── config.pbtxt         # 模型配置文件

服务端配置

这里例子叫做 add_sub，两个输入，两个输出，输出分别是两个输入相加、相减的结果。

name: "example_python"
backend: "python"
input [
  {
    name: "INPUT0"
    data_type: TYPE_FP32
    dims: [ 4 ]
  }
]
input [
  {
    name: "INPUT1"
    data_type: TYPE_FP32
    dims: [ 4 ]
  }
]
output [
  {
    name: "OUTPUT0"
    data_type: TYPE_FP32
    dims: [ 4 ]
  }
]
output [
  {
    name: "OUTPUT1"
    data_type: TYPE_FP32
    dims: [ 4 ]
  }
]

instance_group [
  {
    kind: KIND_CPU
  }
]

服务端

model.py 中需要提供三个接口：initialize, execute, finalize。其中 initialize 和 finalize 是模型实例初始化、模型实例清理的时候会调用的。如果有 n 个模型实例，那么会调用 n 次这两个函数。

class TritonPythonModel:

    def initialize(self, args):
        self.model_config = model_config = json.loads(args['model_config'])
        output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT0")
        output1_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT1")
        self.output0_dtype = pb_utils.triton_string_to_numpy(output0_config['data_type'])
        self.output1_dtype = pb_utils.triton_string_to_numpy(output1_config['data_type'])

    def execute(self, requests):
        output0_dtype = self.output0_dtype
        output1_dtype = self.output1_dtype
        responses = []
        for request in requests:
            in_0 = pb_utils.get_input_tensor_by_name(request, 'INPUT0')
            in_1 = pb_utils.get_input_tensor_by_name(request, 'INPUT1')
            out_0, out_1 = (in_0.as_numpy() + in_1.as_numpy(),
                            in_0.as_numpy() - in_1.as_numpy())
            out_tensor_0 = pb_utils.Tensor('OUTPUT0', out_0.astype(output0_dtype))
            out_tensor_1 = pb_utils.Tensor('OUTPUT1', out_1.astype(output1_dtype))
            inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor_0, out_tensor_1])
            responses.append(inference_response)
        return responses

    def finalize(self):
        print('Cleaning up...')

客户端

接下来，写一个脚本调用一下服务。

import numpy as np
import tritonclient.http as httpclient


if __name__ == '__main__':
    triton_client = httpclient.InferenceServerClient(url='127.0.0.1:8000')

    inputs = []
    inputs.append(httpclient.InferInput('INPUT0', [4], "FP32"))
    inputs.append(httpclient.InferInput('INPUT1', [4], "FP32"))
    input_data0 = np.random.randn(4).astype(np.float32)
    input_data1 = np.random.randn(4).astype(np.float32)
    inputs[0].set_data_from_numpy(input_data0, binary_data=False)
    inputs[1].set_data_from_numpy(input_data1, binary_data=False)
    outputs = []
    outputs.append(httpclient.InferRequestedOutput('OUTPUT0', binary_data=False))
    outputs.append(httpclient.InferRequestedOutput('OUTPUT1', binary_data=False))

    results = triton_client.infer('example_python', inputs=inputs, outputs=outputs)
    output_data0 = results.as_numpy('OUTPUT0')
    output_data1 = results.as_numpy('OUTPUT1')

    print(input_data0)
    print(input_data1)
    print(output_data0)
    print(output_data1)

我们可以验证一下结果，前两行是输入，后两行是输出。可以看到前两行加起来为第三行，相减为第四行。至此，一个简单的 python backend 就完成了。

[ 0.81060416 -0.18330468 -1.6702048  -0.28633776]
[ 0.06001309  0.801739   -0.9306069   0.18076313]
[ 0.8706173   0.6184343  -2.6008117  -0.10557464]
[ 0.75059104 -0.98504364 -0.73959786 -0.4671009 ]

导出 Python 环境

一般来说，我们需要各种各样的包，但是默认的 Python 环境中是缺少的，此时我们需要将自己的环境打包放上去。Python backend 默认是 Python 3.8，如果需要换 Python 版本，那么需要自己构建一边 Python Backend。如果不用换版本，只需要导出环境即可。

以 opencv 为例子，我们从头创建一个新的 conda 环境：

export PYTHONNOUSERSITE=True
conda create -n triton python=3.8
pip install opencv-python  # conda 安装不了，用 pip
conda install conda-pack
conda-pack  # 运行打包程序，将会打包到运行的目录下面
apt update
apt install ffmpeg libsm6 libxext6 -y --fix-missing  # 安装 opencv 的依赖, -y 表示 yes

调用其他模型

在 python backend 中，我们可以调用其他模型，从而实现类似 pipeline 的功能，避免和客户端之间过多的通信。

调用的方法如下：

封装一个 InferenceRequest 对象，设置模型的名字、需要的输出名字、输入的向量
调用 exec 方法，执行前向传播
获取输出，使用 get_output_tensor_by_name 来获取。

下面以调用 facenet 为例子：

inference_request = pb_utils.InferenceRequest(
    model_name='facenet',
    requested_output_names=[self.Facenet_outputs[0]],
    inputs=[pb_utils.Tensor(self.Facenet_inputs[0], face_img.astype(np.float32))]
)
inference_response = inference_request.exec()
pre = utils.pb_tensor_to_numpy(pb_utils.get_output_tensor_by_name(inference_response, self.Facenet_outputs[0]))

遇到的问题及解决办法

Tensor is stored in GPU and cannot be converted to NumPy.

Tensor 被存储在 GPU 上，不能转成 Numpy。然后，Triton 没有提供其他接口去获取数据。目前没有比较好的解决办法，用一个笨方法来解决，存在一定的性能损耗，不过不算很大。这只能等 Triton 那边把相应的接口做出来了。我先将 Tensor 通过 dlpack 转成 Pytorch 的 Tensor，然后调用 numpy 方法。

def pb_tensor_to_numpy(pb_tensor):
    if pb_tensor.is_cpu():
        return pb_tensor.as_numpy()
    else:
        pytorch_tensor = from_dlpack(pb_tensor.to_dlpack())
        return pytorch_tensor.cpu().numpy()

附：Python Backend 接口

因为 Python Backend 的仓库没有整理接口，所以这一节整理 Python Backend 的接口，看看有哪些方法可以调用。

utils

triton_python_backend_utils 中的接口在这个文件。

serialize_byte_tensor(input_tensor)
deserialize_bytes_tensor(encoded_tensor)
get_input_tensor_by_name(inference_request, name)
get_output_tensor_by_name(inference_response, name)
get_input_config_by_name(model_config, name)
get_output_config_by_name(model_config, name)
triton_to_numpy_type(data_type)
numpy_to_triton_type(data_type)
triton_string_to_numpy(triton_type_string)

Pybind11 导出的类

Python Backend 采用了 pybind11 来导出部分 python 类，我们可以从这个文件里面找到 Tensor、InferenceRequest、InferenceResponse、TritonError 的定义。

Tensor

Tensor 中比较重要的方法是 as_numpy，可以将 Tensor 变成 numpy 数组。

  py::class_<PbTensor, std::shared_ptr<PbTensor>>(module, "Tensor")
      .def(py::init(&PbTensor::FromNumpy))
      .def("name", &PbTensor::Name)
      .def("as_numpy", &PbTensor::AsNumpy)
      .def("triton_dtype", &PbTensor::TritonDtype)
      .def("to_dlpack", &PbTensor::ToDLPack)
      .def("is_cpu", &PbTensor::IsCPU)
      .def("from_dlpack", &PbTensor::FromDLPack);

InferenceRequest

我们可以从这个 InferenceRequest 构造出一个对其他 backend 的调用，注意看 pybind11 中的定义，我们可以设置 model_name, model_version, requested_output_names, inputs 来设置需要请求的模型、版本、输出、输入，最后调用 exec 方法来执行。

  py::class_<InferRequest, std::shared_ptr<InferRequest>>(
      module, "InferenceRequest")
      .def(
          py::init<
              const std::string&, uint64_t,
              const std::vector<std::shared_ptr<PbTensor>>&,
              const std::vector<std::string>&, const std::string&,
              const int64_t>(),
          py::arg("request_id") = "", py::arg("correlation_id") = 0,
          py::arg("inputs"), py::arg("requested_output_names"),
          py::arg("model_name"), py::arg("model_version") = -1)
      .def(
          "inputs", &InferRequest::Inputs,
          py::return_value_policy::reference_internal)
      .def("request_id", &InferRequest::RequestId)
      .def("correlation_id", &InferRequest::CorrelationId)
      .def("exec", &InferRequest::Exec)
      .def(
          "async_exec",
          [](std::shared_ptr<InferRequest>& infer_request) {
            py::object loop =
                py::module_::import("asyncio").attr("get_running_loop")();
            py::cpp_function callback = [infer_request]() {
              auto response = infer_request->Exec();
              return response;
            };
            py::object future =
                loop.attr("run_in_executor")(py::none(), callback);
            return future;
          })
      .def(
          "requested_output_names", &InferRequest::RequestedOutputNames,
          py::return_value_policy::reference_internal);

InferenceResponse

  py::class_<InferResponse>(module, "InferenceResponse")
      .def(
          py::init<
              const std::vector<std::shared_ptr<PbTensor>>&,
              std::shared_ptr<PbError>>(),
          py::arg("output_tensors"), py::arg("error") = nullptr)
      .def(
          "output_tensors", &InferResponse::OutputTensors,
          py::return_value_policy::reference)
      .def("has_error", &InferResponse::HasError)
      .def("error", &InferResponse::Error);

TritonError

  py::class_<PbError, std::shared_ptr<PbError>>(module, "TritonError")
      .def(py::init<std::string>())
      .def("message", &PbError::Message);

TritonModelException

  py::register_exception<PythonBackendException>(
      module, "TritonModelException");

查看全文

相关阅读:
JavaScript设计模式
 AgileConfig-如何使用AgileConfig.Client读取配置
 Java8的Optional：如何干掉空指针？
k8s之DNS服务器搭建
 如何根据角色批量激活SAP Fiori服务
 被自己以为的GZIP秀到了
 Kubernetes官方java客户端之二：序列化和反序列化问题
 java中有几种类型的流？JDK为每种类型的流提供了一些抽象类以供继承，请说出他们分别是哪些类？
Shiro运行原理？
Shiro认证过程？

原文地址：https://www.cnblogs.com/zzk0/p/15535828.html