Deep Residual Learning for Image Recognition

zoukankan html css js c++ java

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
Microsoft Research

{kahe, v-xiangz, v-shren, jiansun}@microsoft.com

Abstract摘要

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8×deeper than VGG nets [41] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.

更深的神经网络更难以训练，我们提供了一种残差学习框架来减轻更深网络的训练。我们明确地将层重新定义为相对于输入层的学习残差函数，而不是学习未经引用的函数。我们提供了全面的经验证据表明，这些残差网络更容易优化，而且能从显著增加的深度上提高精度。在ImageNet数据集上我们测试了152层残差网络，比VGG网络深8倍，但是仍然具有低的复杂度。这些残差网络集在ImageNet测试集上达到了3.57%的误差。该结果在ILSVRC2015分类任务中赢得了第1名。我们还展示了在CIFAR-10上100和1000层网络的分析。

The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions 1 , where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

表示的深度对于许多视觉识别任务来说是至关重要的。仅由于我们极其深的表示，我们获得了28%的相对改善COCO对象检测数据集。深度残差网络是我们提交给ILSVRC&COCO2015竞赛1的基础，其中我们在ImageNet检测、ImageNet定位、COCO检测和COCO分割中赢得了第1名。

1. Introduction 简介

Deep convolutional neural networks [22, 21] have led to a series of breakthroughs for image classiﬁcation [21,50, 40]. Deep networks naturally integrate low/mid/high-level features [50] and classiﬁers in an end-to-end multi-layer fashion, and the “levels” of features can be enriched by the number of stacked layers (depth). Recent evidence [41, 44] reveals that network depth is of crucial importance, and the leading results [41, 44, 13, 16] on the challenging ImageNet dataset [36] all exploit “very deep” [41] models, with a depth of sixteen [41] to thirty [16]. Many other non-trivial visual recognition tasks [8, 12, 7, 32, 27] have also greatly beneﬁted from very deep models.

深度卷积神经网络〔22, 21〕为图像分类带来了一系列突破[21，50，40 ]。深度网络自然地融合了低/中/高级的特征和分类器，以一种端-到-端的多层方式，而特征的“级别”可以通过叠加层(深度)得到丰富。最近的证据（41, 44）揭示了网络深度是至关重要的，并且在挑战性IMANET数据集（36）上的领先结果[ 41, 44, 13，16 ]都利用了“非常深”的[41 ]模型，深度为十六（41）到三十（16）。许多其他非平凡视觉识别任务（8, 12, 7，32, 27）也从非常深的模型中得到极大的启发。

Driven by the signiﬁcance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization [23, 9, 37, 13] and intermediate normalization layers[16], which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with backpropagation [22].

在深度概念的驱动下，一个问题出现了：学习更好的网络就像堆叠更多的层一样简单吗？回答这个问题的一个障碍是消失/爆炸梯度（1, 9）的一个臭名昭著的问题，它从一开始就阻碍收敛。然而，这个问题主要通过归一化初始化[ 23, 9, 37，13 ]和中间归一化层[16 ]来解决，这使得具有几十层的网络开始与反向传播[S]的随机梯度下降（SGD）汇聚[22 ]。

When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overﬁtting, and adding more layers to a suitably deep model leads to higher training error, as reported in [11, 42] and thoroughly veriﬁed by our experiments. Fig. 1 shows a typical example.

当更深的网络能够开始收敛时，一个退化问题已经暴露出来：随着网络深度的增加，精度变得饱和（这可能并不令人惊讶），然后迅速退化。出乎意料的是，这样的退化不是由过度引起的，并且在适当的深度模型中添加更多的层导致更高的训练误差，如在我们的实验中（11, 42）和完全veri所示。图1示出了一个典型的例子。

The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart. But experiments show that our current solvers on hand are unable to ﬁnd solutions that are comparably good or better than the constructed solution(or unable to do so in feasible time).

(训练精度的)退化揭示了并不是所有的系统都是容易优化的。让我们考虑一个浅的结构和它的更深的同类物(添加更多层到里面)。在更深层次的模型中存在一个解决方案：添加层是身份映射，而其他层是从学习的较浅模型复制的。这个构造的解决方案的存在表明更深的模型不应该产生比它较浅的对应物更高的训练误差。但是实验表明，我们目前的求解器无法解决比构造好的解决方案好或更好的解决方案（或者在可行的时间内不能这样做）。

In this paper, we address the degradation problem by introducing a deep residual learning framework. Instead of hoping each few stacked layers directly ﬁt a desired underlying mapping, we explicitly let these layers ﬁt a residual mapping. Formally, denoting the desired underlying mapping as H(x), we let the stacked nonlinear layers ﬁt another mapping of F(x) := H(x) x. The original mapping is recast into F(x)+x. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to ﬁt an identity mapping by a stack of nonlinear layers.

在本文中，我们通过引入一个深刻的残差学习框架来解决退化问题。而不是希望每一个堆叠层直接的期望的基础映射，我们明确地让这些层的残差映射。形式上，将期望的基础映射表示为H（x），我们让堆叠的非线性层成为f（x）＝h（x）x的另一映射。将原始映射重铸成f（x）+x。我们假设优化残差映射比优化原始的、未引用的映射更容易。在极端情况下，如果一个身份映射是最优的，那么将残差推到零比由一堆非线性层进行身份映射更容易。

The formulation of F(x)+x can be realized by feedforward neural networks with “shortcut connections” (Fig. 2).Shortcut connections [2, 34, 49] are those skipping one or more layers. In our case, the shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers (Fig. 2). Identity shortcut connections add neither extra parameter nor computational complexity. The entire network can still be trained end-to-end by SGD with backpropagation, and can be easily implemented using common libraries (e.g., Caffe [19]) without modifying the solvers.

We present comprehensive experiments on ImageNet [36] to show the degradation problem and evaluate our method. We show that: 1) Our extremely deep residual nets are easy to optimize, but the counterpart “plain” nets (that simply stack layers) exhibit higher training error when the depth increases; 2) Our deep residual nets can easily enjoy accuracy gains from greatly increased depth, producing results substantially better than previous networks.

Similar phenomena are also shown on the CIFAR-10 set [20], suggesting that the optimization difﬁculties and the effects of our method are not just akin to a particular dataset. We present successfully trained models on this dataset with over 100 layers, and explore models with over 1000 layers.

On the ImageNet classiﬁcation dataset [36], we obtain excellent results by extremely deep residual nets. Our 152-layer residual net is the deepest network ever presented on ImageNet, while still having lower complexity than VGG nets [41]. Our ensemble has 3.57% top-5 error on the ImageNet test set, and won the 1st place in the ILSVRC 2015 classiﬁcation competition. The extremely deep representations also have excellent generalization performance on other recognition tasks, and lead us to further win the 1st places on: ImageNet detection, ImageNet localization,COCO detection, and COCO segmentation in ILSVRC & COCO 2015 competitions. This strong evidence shows that the residual learning principle is generic, and we expect that it is applicable in other vision and non-vision problems.

2. Related Work 相关工作

Residual Representations. In image recognition, VLAD [18] is a representation that encodes by the residual vectors with respect to a dictionary, and Fisher Vector [30] can be formulated as a probabilistic version [18] of VLAD. Both of them are powerful shallow representations for image retrieval and classiﬁcation [4, 48]. For vector quantization, encoding residual vectors [17] is shown to be more effective than encoding original vectors.

残差表示。在图像识别中，VLAD是

In low-level vision and computer graphics, for solving Partial Differential Equations (PDEs), the widely used Multigrid method [3] reformulates the system as subproblems at multiple scales, where each subproblem is responsible for the residual solution between a coarser and a ﬁner scale. An alternative to Multigrid is hierarchical basis preconditioning [45, 46], which relies on variables that represent residual vectors between two scales. It has been shown [3, 45, 46] that these solvers converge much faster than standard solvers that are unaware of the residual nature of the solutions. These methods suggest that a good reformulation or preconditioning can simplify the optimization.

Shortcut Connections. Practices and theories that lead to shortcut connections [2, 34, 49] have been studied for a long time. An early practice of training multi-layer perceptrons (MLPs) is to add a linear layer connected from the network input to the output [34, 49]. In [44, 24], a few intermediate layers are directly connected to auxiliary classiﬁers for addressing vanishing/exploding gradients. The papers of [39, 38, 31, 47] propose methods for centering layer responses, gradients, and propagated errors, implemented by shortcut connections. In [44], an “inception” layer is composed of a shortcut branch and a few deeper branches.

Concurrent with our work, “highway networks” [42, 43] present shortcut connections with gating functions [15]. These gates are data-dependent and have parameters, in contrast to our identity shortcuts that are parameter-free. When a gated shortcut is “closed” (approaching zero), the layers in highway networks represent non-residual functions. On the contrary, our formulation always learns residual functions; our identity shortcuts are never closed, and all information is always passed through, with additional residual functions to be learned. In addition, highway networks have not demonstrated accuracy gains with extremely increased depth (e.g., over 100 layers).

查看全文

相关阅读:
基于51单片机数码管显示经ADC0808转化1K电位器分压5V模拟量为数字量的项目工程
 基于51单片机数码管显示经ADC0808转化1K电位器分压5V模拟量为0V-5V数字量的项目工程
 浅谈移动端过长文本溢出显示省略号的实现方案
 浅谈自动化测试
 Tomcat 优雅关闭之路
 InnoDB 事务加锁分析
 Tomcat 9.0.26 高并发场景下DeadLock问题排查与修复
 Kotlin 协程真的比 Java 线程更高效吗？
Spark 数据倾斜及其解决方案
 大数据平台架构设计探究

原文地址：https://www.cnblogs.com/2008nmj/p/9119579.html