zoukankan      html  css  js  c++  java
  • 继续

    Import From ONNX

    ONNX版本更迭比较快,TensorRT 5.1.x支持ONNX Parser支持ONNX IR(中间表示)版本0.0.3,opset版本9。ONNX版本不兼容的问题,见ONNX Model Opset Version Converter

    Create the build, network, and parser

    with builder = trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
    with open(model_path, 'rb') as model:
    parser.parse(model.read())

    Building the engine

    对象bulider有很多的属性,可以用这些属性可以控制量化精度、batch size 大小等等。

    builder有两个重要的属性:

    1、maximum batch size:TensorRT可以优化的最大的batch size,实际运行时,选择的batch size小于等于该值。

    2、the maximum workspace size:用于层算法的临时空间,这个值限制了网络中层占用空间的最大值,如果这个值设置的小了,TensorRT可能找不到给定层的实现。

    build the engine:

    builder.max_batch_size = max_batch_size
    builder.max_workspace_size = 1 <<  20 # This determines the amount of memory available to the builder when building an optimized engine and should generally be set as high as possible.
    with trt.Builder(TRT_LOGGER) as builder:
    with builder.build_cuda_engine(network) as engine:
    # Do inference here.

    在engine built的同时,TensorRT copy权重数据。

    serializing a model

    所谓的serialize,就是将这个engine转换为一种格式存储起来,后面用在inference上。

    在Inference的时候,只需要deserialize这个存储的engine就可以了。

    之所以这样做,是因为build engine的过程时比较消耗时间的,如果能将已经build的engine存储起来后面调用,这会加速整个inference的准备时间。

    注意:保存的engine时不能跨平台使用的。

    Serialize the model to a modelstream:

    serialized_engine = engine.serialize()

    Deserialize modelstream to perform inference. Deserializing requires creation of a runtime object:

    with trt.Runtime(TRT_LOGGER) as runtime:
    engine = runtime.deserialize_cuda_engine(serialized_engine)

    如果是将这个engine保存在一个文件中:

    Serialize the engine and write to a file:

    with open(“sample.engine”, “wb”) as f:
            f.write(engine.serialize())

    Read the engine from the file and deserialize:

    with open(“sample.engine”, “rb”) as f, trt.Runtime(TRT_LOGGER) as runtime:
            engine = runtime.deserialize_cuda_engine(f.read())

    Performing Inference
    为输入输出分配一些 host and device buffers

    # Determine dimensions and create page-locked memory buffers (i.e. won't be swapped to disk) to hold host inputs/outputs.
            h_input = cuda.pagelocked_empty(engine.get_binding_shape(0).volume(), dtype=np.float32)
            h_output = cuda.pagelocked_empty(engine.get_binding_shape(1).volume(), dtype=np.float32)
            # Allocate device memory for inputs and outputs.
            d_input = cuda.mem_alloc(h_input.nbytes)
            d_output = cuda.mem_alloc(h_output.nbytes)
            # Create a stream in which to copy inputs/outputs and run inference.
            stream = cuda.Stream()

    需要创建空间保存中间的激活值。engine里面有网络的定义和训练好的权重,需要额外的空间。These are held in an execution context:

    with engine.create_execution_context() as context:
            # Transfer input data to the GPU.
            cuda.memcpy_htod_async(d_input, h_input, stream)
            # Run inference.
            context.execute_async(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
            # Transfer predictions back from the GPU.
            cuda.memcpy_dtoh_async(h_output, d_output, stream)
            # Synchronize the stream
            stream.synchronize()
            # Return the host output. 
    return h_output
  • 相关阅读:
    Build 2019 彩蛋
    崂山
    Win10 iot 修改日期时间
    《 结网:改变世界的互联网产品经理 》
    <[你在荒废时间的时候别人都在拼命!]>
    《时间的玫瑰》阅读笔记
    翻石头价值投资手册-科技行业
    No module named flask.ext.sqlalchemy.SQLALchemy
    《寻找伟大的企业》
    <《基金经理投资笔记丛书4-1:投资是一种生活方式》>
  • 原文地址:https://www.cnblogs.com/yanxingang/p/10858757.html
Copyright © 2011-2022 走看看