zoukankan      html  css  js  c++  java
  • 为x86 CPU自动调度神经网络

    x86 CPU自动调度神经网络

    对特定设备和工作负载进行自动调试对于获得最佳性能至关重要。这是有关如何使用自动调度器为x86 CPU调试整个神经网络的文档。

    为了自动调试神经网络,将网络划分为小的子图,并对其进行独立调试。每个子图被视为一个搜索任务。任务调度程序可以对时间进行分片,并为这些任务动态分配时间资源。任务调度程序可以预测每个任务对端到端执行时间的影响,并优先调度可以最大程度地减少执行时间的任务。

    对于每个子图,使用compute声明tvm/python/topi获取张量表达式形式的计算DAG。然后,使用自动调度器来构造此DAG的搜索空间,并搜索良好的调度(低级优化)。

    与依靠手动模板定义搜索空间的基于模板的autotvm不同,自动调度程序不需要任何调度模板。换句话说,自动调度程序仅在tvm/python/topi中使用计算声明,而不使用现有的调度模板。

    注意,本文无法在Windows或最新版本的macOS上运行。要使其运行,需要将本文的内容包装在一个块中。if __name__ == "__main__":

    import numpy as np
     
    import tvm
    from tvm import relay, auto_scheduler
    import tvm.relay.testing
    from tvm.contrib import graph_runtime

    定义网络

    首先,需要使用中继前端API定义网络。可以加载一些预定义的网络tvm.relay.testing。还可以从MXNet,ONNX,PyTorch和TensorFlow加载模型。

    对于卷积神经网络,尽管自动调度程序可以在任何布局下正常工作,但使用NHWC布局通常可以实现最佳性能。还使用自动调度程序对NHWC布局实施了更多优化。因此,建议将模型转换为NHWC布局以使用自动调度程序。可以在TVM中使用ConvertLayout pass进行布局转换。

    def get_network(name, batch_size, layout="NHWC", dtype="float32"):
        """Get the symbol definition and random weight of a network"""
     
        # auto-scheduler prefers NHWC layout
        if layout == "NHWC":
            image_shape = (224, 224, 3)
        elif layout == "NCHW":
            image_shape = (3, 224, 224)
        else:
            raise ValueError("Invalid layout: " + layout)
     
        input_shape = (batch_size,) + image_shape
        output_shape = (batch_size, 1000)
     
        if name.startswith("resnet-"):
            n_layer = int(name.split("-")[1])
            mod, params = relay.testing.resnet.get_workload(
                num_layers=n_layer,
                batch_size=batch_size,
                layout=layout,
                dtype=dtype,
                image_shape=image_shape,
            )
        elif name.startswith("resnet3d-"):
            n_layer = int(name.split("-")[1])
            mod, params = relay.testing.resnet.get_workload(
                num_layers=n_layer,
                batch_size=batch_size,
                layout=layout,
                dtype=dtype,
                image_shape=image_shape,
            )
        elif name == "mobilenet":
            mod, params = relay.testing.mobilenet.get_workload(
                batch_size=batch_size, layout=layout, dtype=dtype, image_shape=image_shape
            )
        elif name == "squeezenet_v1.1":
            assert layout == "NCHW", "squeezenet_v1.1 only supports NCHW layout"
            mod, params = relay.testing.squeezenet.get_workload(
                version="1.1",
                batch_size=batch_size,
                dtype=dtype,
                image_shape=image_shape,
            )
        elif name == "inception_v3":
            input_shape = (batch_size, 3, 299, 299) if layout == "NCHW" else (batch_size, 299, 299, 3)
            mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)
        elif name == "mxnet":
            # an example for mxnet model
            from mxnet.gluon.model_zoo.vision import get_model
     
            assert layout == "NCHW"
     
            block = get_model("resnet50_v1", pretrained=True)
            mod, params = relay.frontend.from_mxnet(block, shape={"data": input_shape}, dtype=dtype)
            net = mod["main"]
            net = relay.Function(
                net.params, relay.nn.softmax(net.body), None, net.type_params, net.attrs
            )
            mod = tvm.IRModule.from_expr(net)
     
        return mod, params, input_shape, output_shape
     
     
    # Define the neural network and compilation target.
    # If the target machine supports avx512 instructions, replace the
    # "llvm -mcpu=core-avx2" with "llvm -mcpu=skylake-avx512"
    network = "resnet-50"
    batch_size = 1
    layout = "NHWC"
    target = tvm.target.Target("llvm -mcpu=core-avx2")
    dtype = "float32"
    log_file = "%s-%s-B%d-%s.json" % (network, layout, batch_size, target.kind.name)

    提取搜索任务

    接下来,从网络中提取搜索任务及其权重。任务的权重是整个网络中任务子图的出现次数。通过使用权重,可以将网络的端到端延迟近似为sum(latency[t] * weight[t]),其中latency[t]是任务的延迟,weight[t]是任务的权重。任务调度程序只会优化此目标。

    # Extract tasks from the network
    print("Extract tasks...")
    mod, params, input_shape, output_shape = get_network(network, batch_size, layout, dtype=dtype)
    tasks, task_weights = auto_scheduler.extract_tasks(mod["main"], params, target)
     
    for idx, task in enumerate(tasks):
        print("========== Task %d  (workload key: %s) ==========" % (idx, task.workload_key))
        print(task.compute_dag)

    出:

    Extract tasks...
    ========== Task 0  (workload key: ["b32ed43fb351136894c322ee49097a1a"]) ==========
    placeholder = PLACEHOLDER [1, 1000]
    T_softmax_maxelem(i0) max= placeholder[i0, k]
    T_softmax_exp(i0, i1) = tir.exp((placeholder[i0, i1] - T_softmax_maxelem[i0]))
    T_softmax_expsum(i0) += T_softmax_exp[i0, k]
    T_softmax_norm(i0, i1) = (T_softmax_exp[i0, i1]/T_softmax_expsum[i0])
     
    ========== Task 1  (workload key: ["6129df1a3d5f6326c8393a8d17160199"]) ==========
    placeholder = PLACEHOLDER [1, 2048]
    placeholder = PLACEHOLDER [1000, 2048]
    compute(z, y, x) += (placeholder[z, ((k*16) + x)]*placeholder[y, ((k*16) + x)])
    compute(y, x) += compute[y, x, kk]
    placeholder = PLACEHOLDER [1000]
    T_add(ax0, ax1) = (compute[ax0, ax1] + placeholder[ax1])
     
    ========== Task 2  (workload key: ["36ee2798ed60bae3bcd1bb89a0285fe8"]) ==========
    placeholder = PLACEHOLDER [1, 7, 7, 2048]
    tensor(ax0, ax1, ax2, ax3) += placeholder[ax0, ((ax1*7) + rv0), ((ax2*7) + rv1), ax3]
    tensor(ax0, ax1, ax2, ax3) = (tensor[ax0, ax1, ax2, ax3]/(float32((select((bool)1, ((ax1 + 1)*7), (((ax1 + 1)*7) + 1)) - (ax1*7)))*float32((select((bool)1, ((ax2 + 1)*7), (((ax2 + 1)*7) + 1)) - (ax2*7)))))
     
    ========== Task 3  (workload key: ["dcf6fcf5f56fa614bf9aef0c82382caf"]) ==========
    placeholder = PLACEHOLDER [1, 7, 7, 512]
    PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
    placeholder = PLACEHOLDER [1, 1, 512, 2048]
    Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
    placeholder = PLACEHOLDER [1, 7, 7, 2048]
    T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
    placeholder = PLACEHOLDER [1, 1, 1, 2048]
    T_multiply(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3]*placeholder[ax0, 0, 0, ax3])
    placeholder = PLACEHOLDER [1, 1, 1, 2048]
    T_add(ax0, ax1, ax2, ax3) = (T_multiply[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
     
    ========== Task 4  (workload key: ["7e3f0cf5a6dd80d36dab1a3dad92674a"]) ==========
    placeholder = PLACEHOLDER [1, 7, 7, 512]
    PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 8)) && (i2 >= 1)) && (i2 < 8)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
    placeholder = PLACEHOLDER [3, 3, 512, 512]
    Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
    placeholder = PLACEHOLDER [1, 1, 1, 512]
    T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
     
    ========== Task 5  (workload key: ["e0a9eb3795b531085e0ebb772e7e800c"]) ==========
    placeholder = PLACEHOLDER [1, 7, 7, 2048]
    PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
    placeholder = PLACEHOLDER [1, 1, 2048, 512]
    Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
    placeholder = PLACEHOLDER [1, 1, 1, 512]
    T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
     
    ========== Task 6  (workload key: ["03614e726dc588d11887eb0953a77e53"]) ==========
    placeholder = PLACEHOLDER [1, 7, 7, 512]
    PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
    placeholder = PLACEHOLDER [1, 1, 512, 2048]
    Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
    placeholder = PLACEHOLDER [1, 7, 7, 2048]
    T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
     
    ========== Task 7  (workload key: ["7657f886f5e9d8b5f19a5fd2c5b90d8d"]) ==========
    placeholder = PLACEHOLDER [1, 14, 14, 1024]
    PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
    placeholder = PLACEHOLDER [1, 1, 1024, 512]
    Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])
    placeholder = PLACEHOLDER [1, 1, 1, 512]
    T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
     
    ========== Task 8  (workload key: ["7e09b626cf077cd419190fee02091dd6"]) ==========
    placeholder = PLACEHOLDER [1, 14, 14, 256]
    PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
    placeholder = PLACEHOLDER [1, 1, 256, 1024]
    Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
    placeholder = PLACEHOLDER [1, 14, 14, 1024]
    T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
    placeholder = PLACEHOLDER [1, 1, 1, 1024]
    T_add(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
     
    ========== Task 9  (workload key: ["95bf49cc8cf7a351e974b2359702aac0"]) ==========
    placeholder = PLACEHOLDER [1, 14, 14, 256]
    PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 15)) && (i2 >= 1)) && (i2 < 15)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
    placeholder = PLACEHOLDER [3, 3, 256, 256]
    Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
    placeholder = PLACEHOLDER [1, 1, 1, 256]
    T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
     
    ========== Task 10  (workload key: ["e043f834cc7f19597227e09dc7f59503"]) ==========
    placeholder = PLACEHOLDER [1, 14, 14, 1024]
    PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
    placeholder = PLACEHOLDER [1, 1, 1024, 256]
    Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
    placeholder = PLACEHOLDER [1, 1, 1, 256]
    T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
     
    ========== Task 11  (workload key: ["cd7c4a374fb2bbc0d075c8cae638ad14"]) ==========
    placeholder = PLACEHOLDER [1, 14, 14, 256]
    PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
    placeholder = PLACEHOLDER [1, 1, 256, 1024]
    Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
    placeholder = PLACEHOLDER [1, 14, 14, 1024]
    T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
     
    ========== Task 12  (workload key: ["1dce2c5e4269b8a12dfc50cd4dd23ff1"]) ==========
    placeholder = PLACEHOLDER [1, 28, 28, 512]
    PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
    placeholder = PLACEHOLDER [1, 1, 512, 256]
    Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])
    placeholder = PLACEHOLDER [1, 1, 1, 256]
    T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
     
    ========== Task 13  (workload key: ["d3b36ce001dc24d693facfbdae1979b4"]) ==========
    placeholder = PLACEHOLDER [1, 28, 28, 128]
    PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
    placeholder = PLACEHOLDER [1, 1, 128, 512]
    Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
    placeholder = PLACEHOLDER [1, 28, 28, 512]
    T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
    placeholder = PLACEHOLDER [1, 1, 1, 512]
    T_add(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
     
    ========== Task 14  (workload key: ["0fb1dfcdb5b755e2dab290ed0129dcf2"]) ==========
    placeholder = PLACEHOLDER [1, 28, 28, 128]
    PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 29)) && (i2 >= 1)) && (i2 < 29)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
    placeholder = PLACEHOLDER [3, 3, 128, 128]
    Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
    placeholder = PLACEHOLDER [1, 1, 1, 128]
    T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
     
    ========== Task 15  (workload key: ["45acfc473c772458684f36a34549d8aa"]) ==========
    placeholder = PLACEHOLDER [1, 28, 28, 512]
    PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
    placeholder = PLACEHOLDER [1, 1, 512, 128]
    Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
    placeholder = PLACEHOLDER [1, 1, 1, 128]
    T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
     
    ========== Task 16  (workload key: ["5e3ceb6e23ae8c351d5a1770d5fc6c7c"]) ==========
    placeholder = PLACEHOLDER [1, 28, 28, 128]
    PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
    placeholder = PLACEHOLDER [1, 1, 128, 512]
    Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
    placeholder = PLACEHOLDER [1, 28, 28, 512]
    T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
     
    ========== Task 17  (workload key: ["a085717fb3dcb046e5c4c2c04d3dc541"]) ==========
    placeholder = PLACEHOLDER [1, 56, 56, 256]
    PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
    placeholder = PLACEHOLDER [1, 1, 256, 128]
    Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])
    placeholder = PLACEHOLDER [1, 1, 1, 128]
    T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
     
    ========== Task 18  (workload key: ["691feef049c8693bbe91bd5e7c9cdf34"]) ==========
    placeholder = PLACEHOLDER [1, 56, 56, 64]
    PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
    placeholder = PLACEHOLDER [1, 1, 64, 256]
    Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
    placeholder = PLACEHOLDER [1, 56, 56, 256]
    T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
    placeholder = PLACEHOLDER [1, 1, 1, 256]
    T_add(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
     
    ========== Task 19  (workload key: ["a9e632e5167afb60fbe29e7aeef1d152"]) ==========
    placeholder = PLACEHOLDER [1, 56, 56, 64]
    PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 57)) && (i2 >= 1)) && (i2 < 57)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
    placeholder = PLACEHOLDER [3, 3, 64, 64]
    Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
    placeholder = PLACEHOLDER [1, 1, 1, 64]
    T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
     
    ========== Task 20  (workload key: ["b51e06c1131d4cded40d1b215f722a4e"]) ==========
    placeholder = PLACEHOLDER [1, 56, 56, 256]
    PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
    placeholder = PLACEHOLDER [1, 1, 256, 64]
    Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
    placeholder = PLACEHOLDER [1, 1, 1, 64]
    T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
     
    ========== Task 21  (workload key: ["8fcee68a4342c38248a827f1c6c69177"]) ==========
    placeholder = PLACEHOLDER [1, 56, 56, 64]
    PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
    placeholder = PLACEHOLDER [1, 1, 64, 256]
    Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
    placeholder = PLACEHOLDER [1, 56, 56, 256]
    T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])
     
    ========== Task 22  (workload key: ["8dd7d81db440763f622f03fdc99e6d46"]) ==========
    placeholder = PLACEHOLDER [1, 56, 56, 64]
    PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
    placeholder = PLACEHOLDER [1, 1, 64, 64]
    Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
    placeholder = PLACEHOLDER [1, 1, 1, 64]
    T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
     
    ========== Task 23  (workload key: ["ba2026d923536b75e9b4faed89287d5f"]) ==========
    placeholder = PLACEHOLDER [1, 112, 112, 64]
    pad_temp(ax0, ax1, ax2, ax3) = tir.if_then_else(((((ax1 >= 1) && (ax1 < 113)) && (ax2 >= 1)) && (ax2 < 113)), placeholder[ax0, (ax1 - 1), (ax2 - 1), ax3], -3.40282e+38f)
    tensor(ax0, ax1, ax2, ax3) max= pad_temp[ax0, ((ax1*2) + dh), ((ax2*2) + dw), ax3]
    placeholder = PLACEHOLDER [1, 1, 1, 64]
    T_add(ax0, ax1, ax2, ax3) = (tensor[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
     
    ========== Task 24  (workload key: ["a0eb8d6048282a4a0986cc2ccf14eaa2"]) ==========
    placeholder = PLACEHOLDER [1, 224, 224, 3]
    PaddedInput(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 3) && (i1 < 227)) && (i2 >= 3)) && (i2 < 227)), placeholder[i0, (i1 - 3), (i2 - 3), i3], 0f)
    placeholder = PLACEHOLDER [7, 7, 3, 64]
    Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])
    placeholder = PLACEHOLDER [1, 1, 1, 64]
    T_add(ax0, ax1, ax2, ax3) = (Conv2dOutput[ax0, ax1, ax2, ax3] + placeholder[ax0, 0, 0, ax3])
    T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)
     
    ========== Task 25  (workload key: ["45b4de07687dee43ee1cbde9f516b2bf"]) ==========
    placeholder = PLACEHOLDER [1, 56, 56, 64]
    PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
    placeholder = PLACEHOLDER [1, 1, 64, 256]
    Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])
     
    ========== Task 26  (workload key: ["b2010aa63c95dedf1f58f3fe8bc78634"]) ==========
    placeholder = PLACEHOLDER [1, 56, 56, 256]
    PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
    placeholder = PLACEHOLDER [1, 1, 256, 512]
    Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])
     
    ========== Task 27  (workload key: ["4d7e646d99bfa3cea8245bd7100369cb"]) ==========
    placeholder = PLACEHOLDER [1, 28, 28, 512]
    PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
    placeholder = PLACEHOLDER [1, 1, 512, 1024]
    Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])
     
    ========== Task 28  (workload key: ["537c8642716948c33a6eaaabc86b159d"]) ==========
    placeholder = PLACEHOLDER [1, 14, 14, 1024]
    PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
    placeholder = PLACEHOLDER [1, 1, 1024, 2048]
    Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])

    开始Tuning调试

    现在,设置一些选项来优化和启动搜索任务

    • num_measure_trials是在调试期间可以使用的测量试验次数。可以将其设置为较小的数字(例如200)以进行快速演示。实际上,建议将其设置为800 * len(tasks),通常足以使搜索收敛。例如,resnet-50中有29个任务,可以将其设置为20000。可以根据时间预算调试此参数。
    • 此外,还用RecordToFile将测量记录转储到日志文件中,这些测量记录可用于最好地查询历史记录,恢复搜索以及以后进行更多分析。
    • 有关更多参数, 请参见auto_scheduler.TuningOptions, auto_scheduler.LocalRunner
    def run_tuning():
        print("Begin tuning...")
        tuner = auto_scheduler.TaskScheduler(tasks, task_weights)
        tune_option = auto_scheduler.TuningOptions(
            num_measure_trials=200,  # change this to 20000 to achieve the best performance
            runner=auto_scheduler.LocalRunner(repeat=10, enable_cpu_cache_flush=True),
            measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
        )
     
        tuner.tune(tune_option)
     
     
    # We do not run the tuning in our webpage server since it takes too long.
    # Uncomment the following line to run it by yourself.
     
    # run_tuning()

    注意

    tuning调试期间说明打印的信息

    在tuning调试期间,控制台上会打印很多信息。它们用于调试目的。最重要的信息是任务调度程序的输出。下表是示例输出。

    ----------------------------------------------------------------------
    ------------------------------  [ Task Scheduler ]
    ----------------------------------------------------------------------
    |  ID  | Latency (ms) | Speed (GFLOPS) | Trials |
    -------------------------------------------------
    |    0 |        0.010 |           0.40 |     64 |
    |    1 |        0.087 |          47.19 |     64 |
    |    2 |        0.008 |          -0.00 |     64 |
    |    3 |        0.177 |         582.07 |     64 |
    |    4 |        0.268 |         862.37 |    256 |
    |    5 |        0.166 |         621.13 |    128 |
    |    6 |        0.170 |         605.10 |    128 |
    |    7 |        0.128 |         403.20 |     64 |
    |    8 |        0.189 |         545.71 |     64 |
    |    9 |        0.231 |        1001.01 |    448 |
    |   10 |        0.155 |         664.80 |    256 |
    |   11 |        0.155 |         662.86 |    256 |
    |   12 |        0.119 |         434.08 |     64 |
    |   13 |        0.199 |         522.13 |     64 |
    |   14 |        0.235 |         986.56 |    320 |
    |   15 |        0.149 |         689.13 |    128 |
    |   16 |        0.155 |         664.80 |    192 |
    |   17 |        0.151 |         340.64 |     64 |
    |   18 |        0.176 |         597.55 |    128 |
    |   19 |        0.220 |        1054.37 |    192 |
    |   20 |        0.150 |         686.01 |    128 |
    |   21 |        0.159 |         650.88 |    128 |
    |   22 |        0.073 |         358.19 |     64 |
    |   23 |        0.031 |          70.63 |     64 |
    |   24 |        0.251 |         947.73 |    128 |
    |   25 |        0.157 |         652.47 |    128 |
    |   26 |        0.215 |         954.84 |    128 |
    |   27 |        0.237 |         868.92 |    128 |
    |   28 |        0.266 |         774.06 |    128 |
    -------------------------------------------------
    Estimated total latency: 10.016 ms      Trials: 3992    Used time : 1131 s      Next ID: 15

    下表列出了所有任务的延迟和(估计)速度。它还列出了所有任务的测量试验分配。最后一行显示这些任务的总加权延迟,这可以粗略估计网络的端到端执行时间。最后一行还显示测量试验的总数,自动调试所花费的总时间以及要调试的下一个任务的ID。

    也将出现一些“ dmlc :: Error”错误,因为自动调度程序将尝试某些无效的调度。如果可以继续进行调试,则可以放心地忽略它们,因为这些错误与主要过程是隔离的。

    注意

    提前终止调试

    可以通过强制终止此过程来提前终止调试。只要为日志文件中的每个任务获得至少一个有效的调度,就应该能够进行编译(下面的部分)。

    编译和评估

    自动调试后,可以使用发现的最佳时间表来编译网络。在自动调试过程中,所有测量记录都将转储到日志文件中,因此可以读取日志文件并加载最佳调度。

    # Compile with the history best
    print("Compile...")
    with auto_scheduler.ApplyHistoryBest(log_file):
        with tvm.transform.PassContext(opt_level=3, config={"relay.backend.use_auto_scheduler": True}):
            lib = relay.build(mod, target=target, params=params)
     
    # Create graph runtime
    ctx = tvm.context(str(target), 0)
    module = graph_runtime.GraphModule(lib["default"](ctx))
    data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
    module.set_input("data", data_tvm)
     
    # Evaluate
    print("Evaluate inference time cost...")
    ftimer = module.module.time_evaluator("run", ctx, repeat=3, min_repeat_ms=500)
    prof_res = np.array(ftimer().results) * 1e3  # convert to millisecond
    print("Mean inference time (std dev): %.2f ms (%.2f ms)" % (np.mean(prof_res), np.std(prof_res)))

    出:

    Compile...
    Evaluate inference time cost...
    Mean inference time (std dev): 30.72 ms (0.09 ms)

    其他技巧

    • 在调试期间,自动调度器需要编译许多程序并从中提取功能。此部分占用大量CPU,因此建议使用具有多个内核的高性能CPU以加快搜索速度。
    • 可以 用python3 -m tvm.auto_scheduler.measure_record --mode distill --i log.json来提取大型日志文件,而仅保存最有用的记录。
    • 可以从上一个日志文件继续搜索。load_log_file在function中创建任务调度程序时,只需添加一个新参数run_tuning。也就是, tuner = auto_scheduler.TaskScheduler(tasks, task_weights, load_log_file=log_file)
    • 如果有多个目标CPU,则可以将它们全部用于测量以并行化测量。检查本 以了解如何使用RPC跟踪器和RPC服务器。要在自动调度使用RPC跟踪,在TuningOptions中用auto_scheduler.RPCRunner更换runner 。
    人工智能芯片与自动驾驶
  • 相关阅读:
    Boost练习程序(强制转换)
    4873279(1002)
    A+B Problem(1000)
    STL练习程序(去除相同元素)
    Boost练习程序(智能指针)
    Sql技巧总结
    MySql Show Status详解
    mysql show status调优
    mysql decimal、numeric数据类型
    Apache Thrift学习小记
  • 原文地址:https://www.cnblogs.com/wujianming-110117/p/14182391.html
Copyright © 2011-2022 走看看