TVM量化路线图roadmap

zoukankan html css js c++ java

TVM量化路线图roadmap
TVM量化路线图roadmap

INT8量化方案

本文介绍了量化过程的原理概述，提出了在TVM中实现量化过程的建议。

l 介绍量子化的背景知识

l INT8量化-后端代码生成

l 这个线程只

量子开发

基于搜索的自动量化

提出了一种新的量化框架，将硬件和训练方法结合起来。

借鉴已有的一些量化框架的思想，选择采用注释annotation，校准calibration，实现热啊；realization三阶段设计。

l Annotation注释：

注释过程pass根据每个算子的重写函数，重写图形并插入模拟量化操作。

模拟量化操作，模拟从浮点量化到整数的舍入误差和饱和误差，

l Calibration校准：

校准过程pass，将调整模拟量化操作的阈值，以减少精度下降。

l Realization实现：

实现过程pass，将实际用float32计算的仿真图，转化为一个真正的低精度整数图。

TVM支持的量化框架

TF量化相关

TVM支持所有预量化TFLite托管

l 在Intel VNNI支持的C5.12xlarge Cascade lake机器上，对性能进行了评估

l 尚未自动调化整模型

PYTORCH量子化相关

如何通过relay将模型转换为量化模型？

如何为torch.quantization.getu defaultu qconfig（'fbgemm'）设置qconfig

量化模型精度基准：PyTorch vs TVM

如何将量化pytorch模型转换为tvm模型

比较resent18、resent5、mobilenet-v2、mobilenet-v3、inceptionu v3和googlenet的准确度和速度。

在PYTORCH中包含静态量化和eager模式：PYTORCH的量化turorial。

l gap量化

l PyTorch的GAP8导出和PyTorch量化module

l 包括squeezenet-v1.1的量化文件

MXNET RELATED

产品级神经网络推理模型量化

l 以下CPU性能来自AWS EC2 C5.24xlarge实例，该实例具有定制的第二代Intel Xeon Scalable Processors (Cascade Lake)。

l 模型量化提供了比所有模型更稳定的加速比，例如ResNet 50 v1为3.66倍，ResNet 101 v1为3.82倍，SSD-VG16为3.77倍，这非常接近INT8的理论4倍加速比。

l Apache/MXNet量化solution精度，非常接近FP32模型，不需要保留模式。在图8中，MXNet只确保了精度的小幅度降低，小于0.5%。

TENSOR CORE RELATED张量内核相关
- [RFC][Tensor Core] Optimization of CNNs on Tensor Core基于Tensor Core的CNNs优化
- [Perf] Enhance cudnn and cublas backend and enable TensorCore增强cudnn和cublas后端并启用TensorCore
RELATED COMMIT相关提交
- [OPT] Low-bit Quantization #2116低bit位量化
- [RFC][Quantization] Support quantized models from TensorflowLite#2351支持TensorflowLite的量化模型
- [TFLite] Support TFLite FP32 Relay frontend. #2365支持TFLite FP32 relay前端
- [Strategy] Support for Int8 schedules - CUDA/x86 #5031
- [Torch, QNN] Add support for quantized models via QNN #4977增加对量化模型的支持
- [QNN][Legalize] Specialize for Platforms w/o fast Int8 support #4307
  
  QNN - Conv2D/Dense Legalize for platforms with no fast Int8 units
- The inference time is longer after int8 quantization
  
  TVM-relay.quantize vs quantization of other Framework
  
  TVM FP32、TVM int8、TVM int8 quantization + AutoTVM，MXNet
SPEED UP

COMPARISON

AUTOMATIC INTEGER QUANTIZATION

Quantization int8 slower than int16 on skylake CPU
- The int8 is always slower than int16 before and after the auto-tuning
- Target: llvm -mcpu=skylake-avx512
- Problem is solved by creating the int8 task explicitly
  
  create the task topi_x86_conv2d_NCHWc_int8
  
  set output dtype to int32, input dtype=uint8, weight dtype=int8
TVM学习笔记–模型量化(int8)及其测试数据
- TVM FP32、TVM int8、TVM int8 quantization , MXNet, TF1.13
- 含测试代码
8bit@Cuda: AutoTVMvs TensorRT vs MXNet
- In this post, we show how to use TVM to automatically optimize of quantized deep learning models on CUDA.
ACCEPTING PRE-QUANTIZED INTEGER MODELS
- Is there any speed comparison of quantization on cpu
  
  discuss a lot about speed comparison among torch-fp32, torch-int8, tvm-fp32, tvm-int16, tvm-int8
SPEED PROFILE TOOLS
- How to profile speed in each layer with RPC?
  
  the debug runtime will give you some profiling information from the embedded device, e.g.:
```
        Node Name               Ops                                                                  Time(us)   Time(%)  Start Time       End Time         Shape                Inputs  Outputs
```
```
---------               ---                                                                  --------   -------  ----------       --------         -----                ------  -------
```
```
1_NCHW1c                fuse___layout_transform___4                                          56.52      0.02     15:24:44.177475  15:24:44.177534  (1, 1, 224, 224)     1       1
```
```
_contrib_conv2d_nchwc0  fuse__contrib_conv2d_NCHWc                                           12436.11   3.4      15:24:44.177549  15:24:44.189993  (1, 1, 224, 224, 1)  2       1
```
```
relu0_NCHW8c            fuse___layout_transform___broadcast_add_relu___layout_transform__    4375.43    1.2      15:24:44.190027  15:24:44.194410  (8, 1, 5, 5, 1, 8)   2       1
```
```
_contrib_conv2d_nchwc1  fuse__contrib_conv2d_NCHWc_1                                         213108.6   58.28    15:24:44.194440  15:24:44.407558  (1, 8, 224, 224, 8)  2       1
```
```
relu1_NCHW8c            fuse___layout_transform___broadcast_add_relu___layout_transform__    2265.57    0.62     15:24:44.407600  15:24:44.409874  (64, 1, 1)           2       1
```
```
_contrib_conv2d_nchwc2  fuse__contrib_conv2d_NCHWc_2                                         104623.15  28.61    15:24:44.409905  15:24:44.514535  (1, 8, 224, 224, 8)  2       1
```
```
relu2_NCHW2c            fuse___layout_transform___broadcast_add_relu___layout_transform___1  2004.77    0.55     15:24:44.514567  15:24:44.516582  (8, 8, 3, 3, 8, 8)   2       1
```
```
_contrib_conv2d_nchwc3  fuse__contrib_conv2d_NCHWc_3                                         25218.4    6.9      15:24:44.516628  15:24:44.541856  (1, 8, 224, 224, 8)  2       1
```
```
reshape1                fuse___layout_transform___broadcast_add_reshape_transpose_reshape    1554.25    0.43     15:24:44.541893  15:24:44.543452  (64, 1, 1)           2       1             
```
参考链接：

https://www.freesion.com/article/3155559638/
人工智能芯片与自动驾驶
查看全文

相关阅读:
c标签页面进行解析json
Android 简述touch事件中的MotionEvent
R中读取文件，找不到路径问题 No such file or directory
文章标题
 Codeforces Beta Round #2 C. Commentator problem
openfire 开发遇到的些问题
 BZOJ 刷题记录 PART 5
公司又裁人了……
最简单的基于FFmpeg的移动端样例：Android 视频转码器
 单片机: 简易计算器的实现（键盘）

原文地址：https://www.cnblogs.com/wujianming-110117/p/15028580.html

TVM量化路线图roadmap

TENSOR CORE RELATED张量内核相关

RELATED COMMIT相关提交

SPEED UP

COMPARISON

AUTOMATIC INTEGER QUANTIZATION

ACCEPTING PRE-QUANTIZED INTEGER MODELS

SPEED PROFILE TOOLS