【MCUNet】2020-NIPS-MCUNet Tiny Deep Learning on IoT Devices-论文阅读

zoukankan html css js c++ java

【MCUNet】2020-NIPS-MCUNet Tiny Deep Learning on IoT Devices-论文阅读
MCUNet

2020-NIPS-MCUNet Tiny Deep Learning on IoT Devices

来源：ChenBong 博客园
- Institute：MIT、NTU、MIT-IBM Watson AI Lab
- Author：Ji Lin、Song Han
- GitHub：/
- Citation：1
Introduction
- MCU（单片机）上的网络
  
  极低的内存（SRAM）和硬盘（Flash，read only）
  
  没有操作系统
- 目前的轻量化网络主要为移动端（如智能手机）设计，而单片机的价格（$5）比智能手机（$500）低了几个数量级，应用范围也更加广泛，同时性能也低了N个数量级，因此如何在MCU上部署神经网络是一个巨大的挑战。
- 我们提出了MCUNet，一种专为MCU设计的 model design（TinyNAS）与 inference library（TinyEngine）联合设计的方法，可在MCU上进行 ImageNet scale 的推理。
- 首次在MCU上达到 ImageNet 的 70.2% top-1 acc
DL in MCU

现有的框架：
- TF Lite Micro
- CMSIS-NN
- CMix-NN
- MicroTVM
缺点：
- 运行时编译 network graph，消耗大量的 SRAM 和 Flash
- layer-level optimization，没有利用整个网络的信息来进一步减少 memory usage（例如某些网络没有用到 5*5 conv，但 library 中依然保留这部分的功能以保证通用性）
Efficient Neural Network Design
- Model Compression
  
  Pruning
  
  Quantization
  
  Tensor decomposition
- Efficient Network Design
  
  MobileNet，EfficientNet
  
  NAS（dominate）
Method

TinyNAS: Two-Stage NAS for Tiny Memory Constraints
- first optimizes the search space
- then performs neural architecture search within the optimized space
Optimize Search Space

R = {48, 64, 80, ..., 192, 208, 224}

W = {0.2, 0.3, 0.4, ..., 1.0}

This leads to S = W×R = 12×9 = 108 possible search space

Each search space configuration contains (3.3 × 10^{25}) possible sub-networks

Our goal is to find the best search space configuration S* that contains the model with the highest accuracy while satisfying the resource constraints.

如何找到S*？
- Perform NAS on each of the search spaces and compare the final results
  
  Search Speace ==> (under memory constrain) Searching ==> Compare Best Acc?
- Evaluate the quality of the search space by randomly sampling m networks from the search space and comparing the distribution of satisfying networks
  
  Search Speace ==> (under memory constrain) Sample ==> Training ==> Compare Acc?
  
  (RegNet，一个 search space sample 500 model，训练10个epoch的acc 的 EDF，足以刻画 search space 的质量)
  
  我们使用评估策略：Search Speace ==> (under memory constrain) Sample ==> Compare FLOPs (No training!)
Assumption: A model with larger computation has a larger capacity, which is more likely to achieve higher accuracy.

We only collect the CDF of FLOPs:

TinyEngine: A Memory-Efficient Inference Library

compilation vs. interpreter

编译 vs. 解释

memory scheduling

layer-wise vs. model-wise

kernel specialization

the inner loop unrolling is also specialized for different kernel sizes (e.g., 9 repeated code segments for 3×3 kernel, and 25 for 5×5 ) to eliminate the branch instruction overheads

Operation fusion is performed for Conv+Padding+ReLU+BN layers.

Experiments

Setup
- Datasets
  
  ImageNet
  
  Visual Wake Words (VWW) 视觉唤醒词
  
  Speech Commands (V2) 音频唤醒词
  
  (did not use cifar)
- Deployment
  
  320kB SRAM / 1MB Flash
  
  512kB SRAM / 2MB Flash
Large-Scale Image Recognition

Co-design

Lower bit precision

Under the same memory constraints, 4-bit MCUNet outperforms 8-bit by 2.2% by fitting a larger model in the memory

Visual & Audio Wake Words

https://www.youtube.com/watch?v=YvioBgtec4U&feature=youtu.be

Analysis

Search space optimization

Sensitivity analysis on search space optimization
- x轴：Flash（硬盘，存储模型）512kB~2048kB
- y轴：SRAM（内存/显存，推理时存储 feature map）192kB~512kB
1-2： SRAM 增大，input 分辨率增加，但由于Flash的限制，模型参数不能增加，因此 width 没有增加

1-3： Flash 增大，模型参数可以增加，因此width增加，但 input 分辨率反而减少（由于模型宽度增大，卷积核变多，每层的 feature map 通道数也会增加，但由于 SRAM 不变，因此要减小 feature map 分辨率大小

Conclusion

Summary

To Read

Reference

https://mp.weixin.qq.com/s/v7fjLWqV4fqJqoPewlKgbA
查看全文

相关阅读:
华为交换机批量创建VLAN；端口组配置命令
 不同VLAN下实现网络互相通信（配置ip static静态路由进行数据转发）
不同VLAN下实现网络互相通信（配置port trunk pvid vlan进行数据转发）
不同VLAN下实现网络互相通信（配置Vlanif IP进行数据转发）
Lambda表达式
 mysql安装
 多线程聊天程序
 程序实现传输图片
 网络编程编程聊天小程序
 水印

原文地址：https://www.cnblogs.com/chenbong/p/13773984.html

【MCUNet】2020-NIPS-MCUNet Tiny Deep Learning on IoT Devices-论文阅读

MCUNet

Introduction

DL in MCU

Efficient Neural Network Design

Method

TinyNAS: Two-Stage NAS for Tiny Memory Constraints

Optimize Search Space

TinyEngine: A Memory-Efficient Inference Library

compilation vs. interpreter

memory scheduling

kernel specialization

Experiments

Setup

Large-Scale Image Recognition

Co-design

Lower bit precision

Visual & Audio Wake Words

Analysis

Search space optimization

Sensitivity analysis on search space optimization

Conclusion

Summary

To Read

Reference