  • Pytorch剖析器及Pytorch模型的逐层分析




    上下文管理器,用于管理autograd profiler状态并保存结果摘要。 在后台,它仅记录正在C ++中执行的函数的事件,并将这些事件公开给Python。 您可以将任何代码包装到其中,并且它只会报告PyTorch函数的运行时间。


    enabled (booloptional) – 将其设置为False将使该上下文管理器成为无操作。默认值:True。

    use_cuda (bool, optional) – 使用cudaEvent API启用CUDA事件的计时。 每个张量操作会增加大约4us的开销。 默认值:False

    record_shapes (bool, optional) – 如果设置了形状记录,则将收集有关输入尺寸的信息。这允许查看底层使用了哪些维度,并进一步使用prof.key_averages(group_by_input_shape=True)对它们进行分组。请注意,形状记录可能会使分析数据有偏差。对于最底部的事件(在嵌套函数调用的情况下),很可能是可以忽略的。但是对于更高级别的函数,由于形状的收集,总self cpu time可能会人为地增加。


    x = torch.randn((1, 1), requires_grad=True)
    with torch.autograd.profiler.profile() as prof:
    for _ in range(100):  # any normal python code, really!
      y = x ** 2
    # NOTE: some columns were removed for brevity


    ------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
    Name                                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Input Shapes                         
    ------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
    pow                                         64.76%           3.096ms          64.76%           3.096ms          3.096ms          1                []                                   
    struct torch::autograd::GraphRoot           0.37%            17.700us         0.37%            17.700us         17.700us         1                []                                   
    PowBackward0                                23.10%           1.104ms          23.10%           1.104ms          1.104ms          1                []                                   
    pow                                         1.37%            65.700us         1.37%            65.700us         65.700us         1                []                                   
    mul                                         10.11%           483.100us        10.11%           483.100us        483.100us        1                []                                   
    mul                                         0.13%            6.200us          0.13%            6.200us          6.200us          1                []                                   
    struct torch::autograd::AccumulateGrad      0.14%            6.500us          0.14%            6.500us          6.500us          1                []                                   
    detach                                      0.03%            1.500us          0.03%            1.500us          1.500us          1                []                                   
    ------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
    Self CPU time total: 4.780ms


    ------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
    Name                                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes                         
    ------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
    pow                                         29.13%           3.246ms          29.13%           3.246ms          3.246ms          31.62%           2.866ms          2.866ms          1                []                                   
    struct torch::autograd::GraphRoot           0.09%            9.600us          0.09%            9.600us          9.600us          0.02%            2.048us          2.048us          1                []                                   
    PowBackward0                                34.12%           3.803ms          34.12%           3.803ms          3.803ms          32.89%           2.982ms          2.982ms          1                []                                   
    pow                                         8.53%            950.500us        8.53%            950.500us        950.500us        2.63%            238.592us        238.592us        1                []                                   
    mul                                         16.06%           1.789ms          16.06%           1.789ms          1.789ms          19.44%           1.762ms          1.762ms          1                []                                   
    mul                                         8.94%            996.700us        8.94%            996.700us        996.700us        10.73%           972.864us        972.864us        1                []                                   
    struct torch::autograd::CopyBackwards       1.47%            163.900us        1.47%            163.900us        163.900us        1.31%            118.688us        118.688us        1                []                                   
    to                                          1.40%            155.900us        1.40%            155.900us        155.900us        1.27%            114.944us        114.944us        1                []                                   
    empty_strided                               0.09%            10.300us         0.09%            10.300us         10.300us         0.01%            1.023us          1.023us          1                []                                   
    struct torch::autograd::AccumulateGrad      0.13%            15.000us         0.13%            15.000us         15.000us         0.06%            5.281us          5.281us          1                []                                   
    detach                                      0.04%            4.700us          0.04%            4.700us          4.700us          0.02%            1.760us          1.760us          1                []                                   
    ------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
    Self CPU time total: 11.144ms
    CUDA time total: 9.066ms


    上下文管理器/函数装饰器,在运行autograd profiler时向Python代码(或函数)块添加标签。它在跟踪代码概要文件时非常有用。

    >>> x = torch.randn((1, 1), requires_grad=True)
    >>> with torch.autograd.profiler.profile() as prof:
    ...     y = x ** 2
    ...     with torch.autograd.profiler.record_function("label-z"): # label the block
    ...         z = y ** 3
    ...     y.backward()
    >>> # NOTE: some columns were removed for brevity
    >>> print(prof.key_averages().table(sort_by="self_cpu_time_total"))
    -----------------------------------  ---------------  ---------------  ---------------
    Name                                 Self CPU total %  CPU time avg     Number of Calls
    -----------------------------------  ---------------  ---------------  ---------------
    pow                                  60.77%           47.470us         3
    mul                                  21.73%           25.465us         2
    PowBackward0                         12.03%           121.891us        1
    torch::autograd::AccumulateGrad      2.70%            6.324us          1
    label-z                              2.13%            12.421us         1
    torch::autograd::GraphRoot           0.64%            1.503us          1
    -----------------------------------  ---------------  ---------------  ---------------
    Self CPU time total: 234.344us
    CUDA time total: 0.000us




    nvprof --profile-from-start off -o trace_name.prof -- <regular command here>

    不幸的是,无法强制nvprof将收集到的数据刷新到磁盘,因此对于CUDA分析,必须使用此上下文管理器注释nvprof跟踪并等待进程退出后再检查它们。 然后,可以使用NVIDIA Visual Profiler(nvvp)可视化时间轴,或者torch.autograd.profiler.load_nvprof()可以加载结果以进行检查,例如 在Python REPL中。

    >>> with torch.cuda.profiler.profile():
    ...     model(x) # Warmup CUDA memory allocator and profiler
    ...     with torch.autograd.profiler.emit_nvtx():
    ...         model(x)





    pip install torchprof
     1 import torch
     2 import torchvision
     3 import torchprof
     5 model = torchvision.models.alexnet(pretrained=False).cuda()
     6 x = torch.rand([1, 3, 224, 224]).cuda()
     8 with torchprof.Profile(model, use_cuda=True) as prof:
     9     model(x)
    11 print(prof.display(show_events=False)) # equivalent to `print(prof)` and `print(prof.display())`
    Module         | Self CPU total | CPU total | CUDA total | Occurrences
    AlexNet        |                |           |            |
    ├── features   |                |           |            |
    │├── 0         |        1.671ms |   6.589ms |    6.701ms |           1
    │├── 1         |       62.430us |  62.430us |   63.264us |           1
    │├── 2         |       62.909us | 109.948us |  112.640us |           1
    │├── 3         |      225.389us | 858.376us |    1.814ms |           1
    │├── 4         |       18.999us |  18.999us |   19.456us |           1
    │├── 5         |       29.560us |  52.720us |   54.272us |           1
    │├── 6         |      136.959us | 511.216us |  707.360us |           1
    │├── 7         |       18.480us |  18.480us |   18.624us |           1
    │├── 8         |       84.380us | 300.700us |  590.688us |           1
    │├── 9         |       18.249us |  18.249us |   17.632us |           1
    │├── 10        |       81.289us | 289.946us |  470.016us |           1
    │├── 11        |       17.850us |  17.850us |   18.432us |           1
    │└── 12        |       29.350us |  52.260us |   52.288us |           1
    ├── avgpool    |       41.840us |  70.840us |   76.832us |           1
    └── classifier |                |           |            |
     ├── 0         |       66.400us | 122.110us |  125.920us |           1
     ├── 1         |      293.658us | 293.658us |  664.704us |           1
     ├── 2         |       17.600us |  17.600us |   18.432us |           1
     ├── 3         |       27.920us |  49.030us |   51.168us |           1
     ├── 4         |       40.590us |  40.590us |  208.672us |           1
     ├── 5         |       17.570us |  17.570us |   18.432us |           1
     └── 6         |       40.489us |  40.489us |   81.920us |           1
    View Code


    Module                        | Self CPU total | CPU total | CUDA total | Occurrences
    AlexNet                       |                |           |            |
    ├── features                  |                |           |            |
    │├── 0                        |                |           |            |
    ││├── conv2d                  |       13.370us |   1.671ms |    1.698ms |           1
    ││├── convolution             |       12.730us |   1.658ms |    1.685ms |           1
    ││├── _convolution            |       30.660us |   1.645ms |    1.673ms |           1
    ││├── contiguous              |        6.970us |   6.970us |    7.136us |           1
    ││└── cudnn_convolution       |        1.608ms |   1.608ms |    1.638ms |           1
    │├── 1                        |                |           |            |
    ││└── relu_                   |       62.430us |  62.430us |   63.264us |           1
    │├── 2                        |                |           |            |
    ││├── max_pool2d              |       15.870us |  62.909us |   63.488us |           1
    ││└── max_pool2d_with_indices |       47.039us |  47.039us |   49.152us |           1
    View Code


    1 trace, event_lists_dict = prof.raw()
    2 print(trace[2])
    3 # Trace(path=('AlexNet', 'features', '0'), leaf=True, module=Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2)))
    5 print(event_lists_dict[trace[2].path][0])
    ---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
    Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes
    ---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
    conv2d                 0.80%            13.370us         100.00%          1.671ms          1.671ms          25.34%           1.698ms          1.698ms          1                []
    convolution            0.76%            12.730us         99.20%           1.658ms          1.658ms          25.15%           1.685ms          1.685ms          1                []
    _convolution           1.83%            30.660us         98.44%           1.645ms          1.645ms          24.97%           1.673ms          1.673ms          1                []
    contiguous             0.42%            6.970us          0.42%            6.970us          6.970us          0.11%            7.136us          7.136us          1                []
    cudnn_convolution      96.19%           1.608ms          96.19%           1.608ms          1.608ms          24.44%           1.638ms          1.638ms          1                []
    ---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
    Self CPU time total: 1.671ms
    CUDA time total: 6.701ms
    View Code


     1 model = torchvision.models.alexnet(pretrained=False)
     2 x = torch.rand([1, 3, 224, 224])
     4 # Layer does not have to be a leaf layer
     5 paths = [("AlexNet", "features", "3"), ("AlexNet", "classifier")]
     7 with torchprof.Profile(model, paths=paths) as prof:
     8     model(x)
    10 print(prof)
    Module         | Self CPU total | CPU total | CUDA total | Occurrences
    AlexNet        |                |           |            |
    ├── features   |                |           |            |
    │├── 0         |                |           |            |
    │├── 1         |                |           |            |
    │├── 2         |                |           |            |
    │├── 3         |        3.189ms |  12.717ms |    0.000us |           1
    │├── 4         |                |           |            |
    │├── 5         |                |           |            |
    │├── 6         |                |           |            |
    │├── 7         |                |           |            |
    │├── 8         |                |           |            |
    │├── 9         |                |           |            |
    │├── 10        |                |           |            |
    │├── 11        |                |           |            |
    │└── 12        |                |           |            |
    ├── avgpool    |                |           |            |
    └── classifier |       13.403ms |  14.011ms |    0.000us |           1
     ├── 0         |                |           |            |
     ├── 1         |                |           |            |
     ├── 2         |                |           |            |
     ├── 3         |                |           |            |
     ├── 4         |                |           |            |
     ├── 5         |                |           |            |
     └── 6         |                |           |            |
    View Code




