zoukankan      html  css  js  c++  java
  • NVIDIA GPU卷积网络的自动调谐

    NVIDIA GPU卷积网络的自动调谐

    针对特定设备和工作负载的自动调整对于获得最佳性能至关重要。这是关于如何为NVIDIA GPU调整整个卷积网络。             

    NVIDIA GPU在TVM中的操作实现是以模板形式编写的。模板有许多可调旋钮(平铺系数、展开等)。将调整神经网络中的所有卷积和深度卷积算子。在调优之后,生成一个日志文件,其中存储了所有所需操作符的最佳旋钮值。当TVM编译器编译这些运算符时,它将查询此日志文件以获得最佳的旋钮值。             

    还发布了一些NVIDIA GPU的预调参数。可以去NVIDIA GPU基准测试看看结果。             

    本文不会在Windows或最新版本的macOS上运行。要让它运行,需要将主体包装在if __name__ == "__main__": 块中。             

    安装依赖项             

    要在tvm中使用autotvm包,需要安装一些额外的依赖项。(如果使用python2,请将“3”更改为“2”):

    pip3 install --user psutil xgboost tornado

    为了使TVM在调谐过程中运行更快,建议使用cython作为TVM的FFI。在tvm的根目录下,执行:

    pip3 install --user cython

    sudo make cython3

    现在回到python代码。导入包。

    import os

     

    import numpy as np

     

    import tvm

    from tvm import relay, autotvm

    import tvm.relay.testing

    from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner

    import tvm.contrib.graph_runtime as runtime

    Define Network

    首先需要在中继前端API中定义网络。可以从tvm中转测试. 也可以从MXNet、ONNX和TensorFlow加载模型。

    def get_network(name, batch_size):

        """Get the symbol definition and random weight of a network"""

        input_shape = (batch_size, 3, 224, 224)

        output_shape = (batch_size, 1000)

     

        if "resnet" in name:

            n_layer = int(name.split("-")[1])

            mod, params = relay.testing.resnet.get_workload(

                num_layers=n_layer, batch_size=batch_size, dtype=dtype

            )

        elif "vgg" in name:

            n_layer = int(name.split("-")[1])

            mod, params = relay.testing.vgg.get_workload(

                num_layers=n_layer, batch_size=batch_size, dtype=dtype

            )

        elif name == "mobilenet":

            mod, params = relay.testing.mobilenet.get_workload(batch_size=batch_size, dtype=dtype)

        elif name == "squeezenet_v1.1":

            mod, params = relay.testing.squeezenet.get_workload(

                batch_size=batch_size, version="1.1", dtype=dtype

            )

        elif name == "inception_v3":

            input_shape = (batch_size, 3, 299, 299)

            mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)

        elif name == "mxnet":

            # an example for mxnet model

            from mxnet.gluon.model_zoo.vision import get_model

     

            block = get_model("resnet18_v1", pretrained=True)

            mod, params = relay.frontend.from_mxnet(block, shape={"data": input_shape}, dtype=dtype)

            net = mod["main"]

            net = relay.Function(

                net.params, relay.nn.softmax(net.body), None, net.type_params, net.attrs

            )

            mod = tvm.IRModule.from_expr(net)

        else:

            raise ValueError("Unsupported network: " + name)

     

        return mod, params, input_shape, output_shape

    Set Tuning Options

    在调整之前,应用一些配置。

    #### DEVICE CONFIG ####

    target = tvm.target.cuda()

     

    #### TUNING OPTION ####

    network = "resnet-18"

    log_file = "%s.log" % network

    dtype = "float32"

     

    tuning_option = {

        "log_filename": log_file,

        "tuner": "xgb",

        "n_trial": 2000,

        "early_stopping": 600,

        "measure_option": autotvm.measure_option(

            builder=autotvm.LocalBuilder(timeout=10),

            runner=autotvm.LocalRunner(number=20, repeat=3, timeout=4, min_repeat_ms=150),

        ),

    }

    注意             

    如何设置调整选项             

    一般来说,这里提供的默认值工作正常。             

    如果有大量的时间预算,可以设置n_trial, early_stopping,这使调整运行更长时间。             

    如果有多个设备,则可以使用所有设备进行测量,以加快调整过程。(请参阅下面的“放大测量”部分)。             

    开始调谐             

    现在可以从网络中提取调优任务并开始调优。这里,提供了一个简单的实用函数来优化任务列表。这个函数只是一个按顺序调整它们的初始实现。将在将来引入更复杂的调优调度程序。

    # You can skip the implementation of this function for this tutorial.

    def tune_tasks(

        tasks,

        measure_option,

        tuner="xgb",

        n_trial=1000,

        early_stopping=None,

        log_filename="tuning.log",

        use_transfer_learning=True,

    ):

        # create tmp log file

        tmp_log_file = log_filename + ".tmp"

        if os.path.exists(tmp_log_file):

            os.remove(tmp_log_file)

     

        for i, tsk in enumerate(reversed(tasks)):

            prefix = "[Task %2d/%2d] " % (i + 1, len(tasks))

     

            # create tuner

            if tuner == "xgb" or tuner == "xgb-rank":

                tuner_obj = XGBTuner(tsk, loss_type="rank")

            elif tuner == "ga":

                tuner_obj = GATuner(tsk, pop_size=100)

            elif tuner == "random":

                tuner_obj = RandomTuner(tsk)

            elif tuner == "gridsearch":

                tuner_obj = GridSearchTuner(tsk)

            else:

                raise ValueError("Invalid tuner: " + tuner)

     

            if use_transfer_learning:

                if os.path.isfile(tmp_log_file):

                    tuner_obj.load_history(autotvm.record.load_from_file(tmp_log_file))

     

            # do tuning

            tsk_trial = min(n_trial, len(tsk.config_space))

            tuner_obj.tune(

                n_trial=tsk_trial,

                early_stopping=early_stopping,

                measure_option=measure_option,

                callbacks=[

                    autotvm.callback.progress_bar(tsk_trial, prefix=prefix),

                    autotvm.callback.log_to_file(tmp_log_file),

                ],

            )

     

        # pick best records to a cache file

        autotvm.record.pick_best(tmp_log_file, log_filename)

    os.remove(tmp_log_file)

    最后,启动优化作业并评估端到端性能。

    def tune_and_evaluate(tuning_opt):

        # extract workloads from relay program

        print("Extract tasks...")

        mod, params, input_shape, out_shape = get_network(network, batch_size=1)

        tasks = autotvm.task.extract_from_program(

            mod["main"], target=target, params=params, ops=(relay.op.get("nn.conv2d"),)

        )

     

        # run tuning tasks

        print("Tuning...")

        tune_tasks(tasks, **tuning_opt)

     

        # compile kernels with history best records

        with autotvm.apply_history_best(log_file):

            print("Compile...")

            with tvm.transform.PassContext(opt_level=3):

                lib = relay.build_module.build(mod, target=target, params=params)

     

            # load parameters

            ctx = tvm.context(str(target), 0)

            module = runtime.GraphModule(lib["default"](ctx))

            data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))

            module.set_input("data", data_tvm)

     

            # evaluate

            print("Evaluate inference time cost...")

            ftimer = module.module.time_evaluator("run", ctx, number=1, repeat=600)

            prof_res = np.array(ftimer().results) * 1000  # convert to millisecond

            print(

                "Mean inference time (std dev): %.2f ms (%.2f ms)"

                % (np.mean(prof_res), np.std(prof_res))

            )

     

     

    # We do not run the tuning in our webpage server since it takes too long.

    # Uncomment the following line to run it by yourself.

     

    # tune_and_evaluate(tuning_option)

    Sample Output

    调整需要编译许多程序并从中提取特性。因此建议使用高性能CPU。下面列出了一个示例输出。它需要大约4个小时来获得以下输出在一个32T的AMD Ryzen Threadripper。调谐目标是NVIDIA 1080TI。(在编译过程中可以看到一些错误。如果调谐没有卡住,就可以了。)

    Extract tasks...

    Tuning...

    [Task  1/12]  Current/Best:  541.83/3570.66 GFLOPS | Progress: (960/2000) | 1001.31 s Done.

    [Task  2/12]  Current/Best:    0.56/ 803.33 GFLOPS | Progress: (704/2000) | 608.08 s Done.

    [Task  3/12]  Current/Best:  103.69/1141.25 GFLOPS | Progress: (768/2000) | 702.13 s Done.

    [Task  4/12]  Current/Best: 2905.03/3925.15 GFLOPS | Progress: (864/2000) | 745.94 sterminate called without an active exception

    [Task  4/12]  Current/Best: 2789.36/3925.15 GFLOPS | Progress: (1056/2000) | 929.40 s Done.

    [Task  5/12]  Current/Best:   89.06/1076.24 GFLOPS | Progress: (704/2000) | 601.73 s Done.

    [Task  6/12]  Current/Best:   40.39/2129.02 GFLOPS | Progress: (1088/2000) | 1125.76 s Done.

    [Task  7/12]  Current/Best: 4090.53/5007.02 GFLOPS | Progress: (800/2000) | 903.90 s Done.

    [Task  8/12]  Current/Best:    4.78/1272.28 GFLOPS | Progress: (768/2000) | 749.14 s Done.

    [Task  9/12]  Current/Best: 1391.45/2325.08 GFLOPS | Progress: (992/2000) | 1084.87 s Done.

    [Task 10/12]  Current/Best: 1995.44/2383.59 GFLOPS | Progress: (864/2000) | 862.60 s Done.

    [Task 11/12]  Current/Best: 4093.94/4899.80 GFLOPS | Progress: (224/2000) | 240.92 sterminate called without an active exception

    [Task 11/12]  Current/Best: 3487.98/4909.91 GFLOPS | Progress: (480/2000) | 534.96 sterminate called without an active exception

    [Task 11/12]  Current/Best: 4636.84/4912.17 GFLOPS | Progress: (1184/2000) | 1381.16 sterminate called without an active exception

    [Task 11/12]  Current/Best:   50.12/4912.17 GFLOPS | Progress: (1344/2000) | 1602.81 s Done.

    [Task 12/12]  Current/Best: 3581.31/4286.30 GFLOPS | Progress: (736/2000) | 943.52 s Done.

    Compile...

    Evaluate inference time cost...

    Mean inference time (std dev): 1.07 ms (0.05 ms)

    作为参考基线,MXNet+TensorRT在resnet-18上的时间开销是1.30ms,所以要快一点。             

    注意             

    遇到困难?             

    自动调谐模块容易出错。如果总是看到“0.00/0.00 GFLOPS”,那么一定是出了什么问题。             

    首先,确保设置了正确的设备配置。然后,可以通过在脚本开头添加这些行来打印调试信息。它将打印每个测量结果,可以在其中找到有用的错误消息。             

    导入日志记录              

    logging.getLogger('autotvm').setLevel(logging.DEBUG)             

    最后,请随时向寻求帮助https://discus.tvm.apache.org             

    使用多个设备放大测量             

    如果有多个设备,可以使用所有设备进行测量。TVM使用RPC跟踪器来管理分布式设备。RPC跟踪器是一个集中的控制器节点。可以把所有设备注册到跟踪器上。例如,如果有10个GPU卡,可以将它们全部注册到跟踪器中,并并行运行10个测量,从而加快调谐过程。             

    要启动RPC跟踪器,在主机上运行此命令。在整个整定过程中需要跟踪器,所以需要为这个命令打开一个新的终端:

    python -m tvm.exec.rpc_tracker --host=0.0.0.0 --port=9190

    The expected output is

    INFO:RPCTracker:bind to 0.0.0.0:9190

    然后为RPC服务器打开另一个新终端。需要为每个专用设备启动一个服务器。使用字符串键来区分设备的类型。可以选一个你喜欢的名字。(注意:对于rocm后端,编译器有一些内部错误,需要在参数列表中add –no-fork。)

    python -m tvm.exec.rpc_server --tracker=0.0.0.0:9190 --key=1080ti

    After registering devices, we can confirm it by querying rpc_tracker

    python -m tvm.exec.query_rpc_tracker --host=0.0.0.0 --port=9190

    For example, if we have four 1080ti, two titanx and one gfx900, the output can be

    Queue Status
    ----------------------------------
    key          total  free  pending
    ----------------------------------
    1080ti       4      4     0
    titanx       2      2     0
    gfx900       1      1     0
    ----------------------------------

    最后,需要将调优选项更改为使用RPCRunner。使用下面的代码替换上面相应的部件。

    tuning_option = {
        "log_filename": log_file,
        "tuner": "xgb",
        "n_trial": 2000,
        "early_stopping": 600,
        "measure_option": autotvm.measure_option(
            builder=autotvm.LocalBuilder(timeout=10),
            runner=autotvm.RPCRunner(
                "1080ti",  # change the device key to your key
                "0.0.0.0",
                9190,
                number=20,
                repeat=3,
                timeout=4,
                min_repeat_ms=150,
            ),
        ),
    }

    https://tvm.apache.org/docs/tutorials/autotvm/tune_relay_cuda.html

    下载Python源代码:tune_relay_cuda.py             

    下载Jupyter笔记本:tune_relay_cuda.ipynbDownload Python source code: tune_relay_cuda.py

    Download Jupyter notebook: tune_relay_cuda.ipynb

    人工智能芯片与自动驾驶
  • 相关阅读:
    网络应用框架Apache MINA 一个应用样例(转)
    Linux安装JDK详细步骤(转)
    MySQL主从复制配置
    linux IP 设置
    Log4j的应用实例
    linux 远程复制
    MySQL外部访问配置
    Camshift算法2
    Canny边缘检测
    cvCopy()和cvCloneImage()的区别
  • 原文地址:https://www.cnblogs.com/wujianming-110117/p/14131379.html
Copyright © 2011-2022 走看看