Fully Convolutional Networks for Semantic Segmentation 译文
Abstract
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks,explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet [19], the VGG net [31], and GoogLeNet [32]) into fully convolutional networks and transfer their learned representations by fine-tuning [4] to the segmentation task. We then define a novel architecture that combines semantic information from a deep,coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFTFlow,while inference takes less than one fifth of a second for a typical image.
卷积网络在特征分层领域是非常强大的视觉模型。我们证明了经过端到端、像素到像素训练的卷积网络并且超过目前语义分割中最先进的技术。我们的核心观点是建立“全卷积”网络,输入任意尺寸,经过有效的推理和学习产生相应尺寸的输出。我们定义并指定全卷积网络的空间,解释它们在空间范围内dense prediction任务(dense prediction:预测每个像素所属的类别)和获取与以前模型联系。我们改编当前的分类网络(AlexNet,the VGG net , and GoogLeNet)到完全卷积网络和通过微调(fine-tune) 传递它们的学习表现到分割任务中。然后我们定义了一个跳跃式的架构(skip layers),结合来自深、粗层的语义信息(深层次的存储图片的全局信息,相对来说比较注重粗糙,即整体部分)和来自浅、细层的表征信息(浅层次的存储图片的局部信息,相对来说比较注重细节,即边缘部分)来产生准确和精细的分割。我们的完全卷积网络成为了在PASCAL VOC最出色的分割方式(在2012年相对62.2%的平均IU提高了20%),并且对NYUDv2和SIFT Flow数据集的一个典型图像推理只需要花费不到0.2秒的时间。 (PASCAL VOC、NYUDv2和SIFT Flow均为数据集)
1.Introduction
Convolutional networks are driving advances in recognition. Convnets are not only improving for whole-image classification [19, 31, 32], but also making progress on local tasks with structured output. These include advances in bounding box object detection [29, 12, 17], part and key-point prediction [39, 24], and local correspondence [24, 9].
卷积网络在识别领域前进势头很猛。卷积网不仅在整个图片的分类上有所提高 ,也在结构化输出的局部任务上取得了进步。包括在目标检测边界框、部分和关键点预测和局部通信的进步。
The natural next step in the progression from coarse to fine inference is to make a prediction at every pixel. Prior approaches have used convnets for semantic segmentation [27,2,8,28,16,14,11],in which each pixel is labeled with the class of its enclosing object or region, but with shortcomings that this work addresses.
在从粗糙到精细推理的进展中下一步自然是对每一个像素进行预测。早前的方法已经将卷积网络用于语义分割,其中每个像素被标记为其封闭对象或区域的类别,但是有个缺点就是这项工作的定位问题。
We show that a fully convolutional network (FCN), trained end-to-end, pixels-to-pixels on semantic segmentation exceeds the state-of-the-art without further machinery. To our knowledge, this is the first work to train FCNs end-to-end (1) for pixelwise prediction and (2) from supervised pre-training. Fully convolutional versions of existing networks predict dense outputs from arbitrary-sized inputs. Both learning and inference are performed whole-image-at--a-time by dense feedforward computation and backpropagation. In-network upsampling layers enable pixelwise prediction and learning in nets with subsampled pooling.
我们证明了经过端到端、像素到像素训练的的全卷积网络,它超过语义分割中在没有更多机械的情况下超过了最先进的水平 。我们认为,这是第一次训练端到端(1)的FCN在像素级别的预测,而且来自监督式预处理(2)。全卷积在现有的网络基础上从任意尺寸的输入到预测密集输出。学习和推理能在整个图片通过密集的前馈计算和反向传播一次执行。在神经网络中上采样层能在像素级别预测和通过下采样池化学习。
This method is efficient, both asymptotically and absolutely,and precludes the need for the complications in other works. Patchwise training is common [27, 2, 8, 28, 11], but lacks the efficiency of fully convolutional training. Our approach doesnot make use of pre- and post-processing complications,including superpixels[8,16],proposals[16,14], or post-hoc refinement by random fields or local classifiers [8, 16]. Our model transfers recent success in classification [19, 31, 32] to dense prediction by reinterpreting classification nets as fully convolutional and fine-tuning from their learned representations. In contrast, previous works have applied small convnets without supervised pre-training [8, 28, 27].
这种方法非常有效,无论是渐进地还是完全地,消除了在其他方法中的并发问题。Patchwise训练(可以理解传入神经网络的数据并非是整个图片,而是对图片感兴趣的局部,这样做的目的是避免完整图像训练的冗余 )是常见的,但是缺少了全卷积训练的有效性。我们的方法不是利用预处理或者后期处理解决并发问题,包括超像素,proposals(需要看下面的引用),或者对通过随机域事后细化或者局部分类。我们的模型通过重新解释分类网到全卷积网络和微调它们的学习表现将最近在分类上的成功移植到dense prediction。与此相反,先前的工作应用的是小规模、没有超像素预处理的卷积网。
Semantic segmentation faces an inherent tension between semantics and location: global information resolves what while local information resolves where. Deep feature hierarchies jointly encode location and semantics in a local-to-global pyramid. We define a novel “skip” architecture to combine deep,coarse,semantic information and shallow, fine, appearance information in Section 4.2 (see Figure 3).
语义分割面临在语义和位置的内在张力问题:全局信息解决的“是什么”,而局部信息解决的是“在哪里”。深层特征通过非线性的局部到全局金字塔结构来编码了位置和语义信息。我们在4.2节(见图3)定义了一种利用集合了深、粗层的语义信息和浅、细层的表征信息的特征谱的跨层架构(即skip layers)。
In the next section,we review related work on deep classification nets, FCNs, and recent approaches to semantic segmentation using convnets. The following sections explain FCN design and dense prediction tradeoffs, introduce our architecture with in-network upsampling and multilayer combinations, and describe our experimental framework. Finally, we demonstrate state-of-the-art results on PASCAL VOC 2011-2, NYUDv2, and SIFT Flow.
在下一节,我们回顾深层分类网、FCNs和最近一些利用卷积网解决语义分割的相关工作。接下来的章节将解释FCN设计和密集预测(dense prediction)权衡,介绍我们的网内上采样和多层结合架构,描述我们的实验框架。最后,我们展示了最先进技术在PASCAL VOC 2011-2, NYUDv2, 和SIFT Flow上的实验结果。
2.Related work
Our approach draws on recent successes of deep nets for image classification [19, 31, 32] and transfer learning [4, 38]. Transfer was first demonstrated on various visual recognition tasks [4, 38], then on detection, and on both instance and semantic segmentation in hybrid proposal classifier models[12,16,14]. We now re-architect and fine-tune classification nets to direct,dense prediction of semantic segmentation. We chart the space of FCNs and situate prior models, both historical and recent, in this framework.
我们的方法是基于最近深层网络在图像分类上的成功和迁移学习。迁移第一次被证明在各种视觉识别任务,然后是检测,不仅在实例还有融合proposal-classification模型的语义分割 。我们现在重新构建和微调直接的、dense prediction语义分割的分类网。在这个框架里我们绘制FCNs的空间并将过去的或是最近的先验模型置于其中。
Fully convolutional networks To our knowledge, the idea of extending a convnet to arbitrary-sized inputs first appeared in Matan et al. [25], which extended the classic LeNet [21] to recognize strings of digits. Because their net was limited to one-dimensional input strings, Matan et al. used Viterbi decoding to obtain their outputs. Wolf and Platt [37] expand convnet outputs to 2-dimensional maps of detection scores for the four corners of postal address blocks. Both of these historical works do inference and learning fully convolutionally for detection. Ning et al. [27] define a convnet for coarse multiclass segmentation of C. elegans tissues with fully convolutional inference.
全卷积网络 据我们所知,第一次将卷积网扩展到任意尺寸的输入的是Matan等人,它将经典的LeNet扩展到识别数字串 。因为他们的网络结构限制在一维的输入串,Matan等人利用译码器译码获得输出。Wolf和Platt [40] 将卷积网输出扩展到来检测邮政地址块的四角得分的二维图。这些先前工作做的是推理和用于检测的全卷积式学习。Ning等人定义了一种基于全卷积推理的卷积网络用于秀丽线虫组织的粗糙的、多分类分割。
Fully convolutional computation has also been exploited in the present era of many-layered nets. Sliding window detection by Sermanet et al. [29], semantic segmentation by Pinheiro and Collobert [28], and image restoration by Eigen et al. [5] do fully convolutional inference. Fully convolutional training is rare, but used effectively by Tompson et al. [35] to learn an end-to-end part detector and spatial model for pose estimation, although they do not exposit on or analyze this method.
全卷积计算也被用在现在的一些多层次的网络结构中。Sermanet等人的滑动窗口检测,Pinherio 和Collobert的语义分割,Eigen等人的图像修复都做了全卷积式推理。全卷积训练很少,但是被Tompson等人用来学习一种端到端的局部检测和姿态估计的空间模型非常有效,尽管他们没有解释或者分析这种方法。
Alternatively, He et al. [17] discard the non-convolutional portion of classification nets to make a feature extractor. They combine proposals and spatial pyramid pooling to yield a localized, fixed-length feature for classification. While fast and effective, this hybrid model cannot be learned end-to-end.
此外,He等人在特征提取时丢弃了分类网的无卷积部分。他们结合proposals和空间金字塔池来产生一个局部的、固定长度的特征用于分类。尽管快速且有效,但是这种混合模型不能进行端到端的学习。
Dense prediction with convnets Several recent works have applied convnets to dense prediction problems,including semantic segmentation by Ning et al.[27],Farabet et al. [8], and Pinheiro and Collobert [28]; boundary prediction for electron microscopy by Ciresanetal.[2]and for natural images by a hybrid neural net/nearest neighbor model by Ganin and Lempitsky[11];and image restoration and depth estimation by Eigenetal.[5,6]. Common elements of these approaches include
- small models restricting capacity and receptive fields;
- patchwise training [27, 2, 8, 28, 11];
- post-processing by superpixel projection,random field regularization, filtering, or local classification [8, 2, 11];
- input shifting and output interlacing for dense output [28, 11] as introduced by OverFeat [29];
- multi-scale pyramid processing [8, 28, 11];
- saturating tanh nonlinearities [8, 5, 28]; and
- ensembles [2, 11],
whereas our method does without this machinery. However, we do study patchwise training 3.4 and “shift-and-stitch” dense output 3.2 from the perspective of FCNs. We also discuss in-network upsampling 3.3, of which the fully connected prediction by Eigen et al. [6] is a special case.
基于卷积网的dense prediction 近期的一些工作已经将卷积网应用于dense prediction问题,包括Ning等人的语义分割,Farabet等人以及Pinheiro和Collobert;Ciresan等人的电子显微镜边界预测以及Ganin和Lempitsky的通过混合卷积网和最邻近模型的处理自然场景图像;还有Eigen等人的图像修复和深度估计。这些方法的相同点包括如下:
- 限制容量和接收域的小模型
- patchwise训练
- 超像素投影的预处理,随机场正则化、滤波或局部分类
- 输入移位和dense输出的隔行交错输出
- 多尺度金字塔处理
- 饱和双曲线正切非线性
- 集成
然而我们的方法确实没有这种机制。但是我们研究了patchwise训练 (3.4节)和从FCNs的角度出发的“shift-and-stitch”dense输出(3.2节)。我们也讨论了神经网络内上采样(3.3节),其中Eigen等人[7]的全连接预测是一个特例。
Unlike these existing methods,we adapt and extend deep classification architectures,using image classification as supervised pre-training, and fine-tune fully convolutionally to learn simply and efficiently from whole image inputs and whole image ground thruths.
和这些现有的方法不同的是,我们改编和扩展了深度分类架构,使用图像分类作为监督预处理,和从全部图像的输入和ground truths(用于有监督训练的训练集的分类准确性,即已经标注好的分割图片)通过全卷积微调进行简单且高效的学习。
Hariharanetal.[16]andGuptaetal.[14]likewiseadapt deep classification nets to semantic segmentation, but do so in hybrid proposal-classifier models. These approaches fine-tune an R-CNN system [12] by sampling bounding boxes and/or region proposals for detection, semantic segmentation, and instance segmentation. Neither method is learned end-to-end.
Hariharan等人和Gupta等人也改编深度分类网到语义分割,但是也在混合proposal-classifier模型中这么做了。这些方法通过采样边界框和region proposal进行微调了R-CNN系统,用于检测、语义分割和实例分割。这两种办法都不能进行端到端的学习。他们分别在PASCAL VOC和NYUDv2实现了最好的分割效果,所以在第5节中我们直接将我们的独立的、端到端的FCN和他们的语义分割结果进行比较。
They achieve state-of-the-art results on PASCAL VOC segmentation and NYUDv2 segmentation respectively, so we directly compare our standalone, end-to-end FCN to their semantic segmentation results in Section 5.
我们通过跨层和融合特征来定义一种非线性的局部到整体的表述用来协调端到端。在现今的工作中Hariharan等人也在语义分割的混合模型中使用了多层。
3.Fullyconvolutionalnetworks
Each layer of data in a convnet is a three-dimensional array of size h×w×d, where h and w are spatial dimensions, and d is the feature or channel dimension. The first layer is the image, with pixel size h×w, and d color channels. Locations in higher layers correspond to the locations in the image they are path-connected to, which are called their receptive fields.
卷积网的每层数据是一个h×w×d的三维数组,其中h和w是空间维度,d是特征或通道维数。第一层是像素尺寸为h×w、颜色通道数为d的图像。高层中的位置信息和图像中它们连通的位置信息相对应,被称为感受野。
Convnets are built on translation invariance. Their basic components (convolution, pooling, and activation functions) operate on local input regions, and depend only on relative spatial coordinates. Writing $ x_i^j$ for the data vector at location ( i , j ) in a particular layer,and $ y_{ij}$ for the following layer, these functions compute outputs $ y_{ij}$ by
where $ k$ is called the kernel size, $ s$ is the stride or subsampling factor, and $ f_{ks}$ determines the layer type: a matrix multiplication for convolution or average pooling, a spatial max for max pooling,or an elementwise nonlinearity for an activation function, and so on for other types of layers.
卷积网是以平移不变形作为基础的。其基本组成部分(卷积,池化和激励函数)作用在局部输入域,只依赖相对空间坐标。在特定层记 $ x_i^j$ 为在坐标(i,j)的数据向量,在下一层有 $ y_{ij}$, $ y_{ij}$的计算公式如下:
其中k为卷积核尺寸,s是步长或下采样因素,(f_{ks})决定了层的类型:一个卷积的矩阵乘或者是平均池化,用于最大池的最大空间值或者是一个激励函数的一个非线性元素,亦或是层的其他种类等等 。
This functional form is maintained under composition, with kernel size and stride obeying the transformation rule
While a general deep net computes a general nonlinear function, a net with only layers of this form computes a nonlinear filter, which we call a deep filter or fully convolutional network. An FCN naturally operates on an input of any size,and produces an output of corresponding(possibly resampled) spatial dimensions.
当卷积核尺寸和步长遵从转换规则,这个函数形式被表述为如下形式:
当一个普通深度的网络计算一个普通的非线性函数,一个网络只有这种形式的层计算非线性滤波,我们称之为深度滤波或全卷积网络。FCN理应可以计算任意尺寸的输入并产生相应(或许重采样)空间维度的输出。
A real-valued loss function composed with an FCN defines a task. If the loss function is a sum over the spatial dimensions of the final layer $ iota (X; heta)=Sigma_{ij}iota'(X_{ij}; heta)$,its gradient will be a sum over the gradients of each of its spatial components. Thus stochastic gradient descent on (iota`)computed on whole images will be the same as stochastic gradient descent on (iota`),taking all of the final layer receptive fields as a minibatch.
一个实值损失函数有FCN定义了task。如果损失函数是一个最后一层的空间维度总和, $ iota (X; heta)=Sigma_{ij}iota'(X_{ij}; heta)$ ,它的梯度将是它的每层空间组成梯度总和。所以在全部图像上的基于l的随机梯度下降计算将和基于l'的梯度下降结果一样,将最后一层的所有接收域作为minibatch(分批处理)。
When these receptive fields overlap significantly, both feedforward computation and backpropagation are much more efficient when computed layer-by-layer over an entire image instead of independently patch-by-patch.
在这些接收域重叠很大的情况下,前反馈计算和反向传播计算整图的叠层都比独立的patch-by-patch有效的多。
We next explain how to convert classification nets into fully convolutional nets that produce coarse output maps. For pixelwise prediction, we need to connect these coarse outputs back to the pixels. Section 3.2 describes a trick that OverFeat [29] introduced for this purpose. We gain insight into this trick by reinterpreting it as an equivalent network modification. As an efficient, effective alternative, we introducedeconvolutionlayersforupsamplinginSection3.3. In Section 3.4 we consider training by patchwise sampling, and give evidence in Section4.3 that our whole image training is faster and equally effective.
我们接下来将解释怎么将分类网络转换到能产生粗输出图的全卷积网络。对于像素级预测,我们需要连接这些粗略的输出结果到像素。3.2节描述了一种技巧,快速扫描因此被引入。我们通过将它解释为一个等价网络修正而获得了关于这个技巧的一些领悟。作为一个高效的替换,我们引入了去卷积层用于上采样见3.3节。在3.4节,我们考虑通过patchwise取样训练,便在4.3节证明我们的全图式训练更快且同样有效。
3.1.Adaptingclassifiersfordenseprediction
Typical recognition nets, including LeNet [21], AlexNet [19], and its deeper successors [31, 32], ostensibly take fixed-sized inputs and produce nonspatial outputs. The fully connected layers of these nets have fixed dimensions and throw away spatial coordinates. However, these fully connected layers can also be viewed as convolutions with kernels that cover their entire input regions. Doing so casts them into fully convolutional networks that take input of any size and output classification maps. This transformation is illustrated in Figure 2. (By contrast, nonconvolutional nets, such as the one by Le et al. [20], lack this capability.)
Figure 2. Transforming fully connected layers into convolution layers enables a classification net to output a heatmap. Adding layers and a spatial loss (as in Figure 1) produces an efficient machine for end-to-end dense learning.
典型的识别网络,包括LeNet, AlexNet, 和一些后继者,表面上采用的是固定尺寸的输入产生了非空间的输出。这些网络的全连接层有确定的位数并丢弃空间坐标。然而,这些全连接层也被看做是覆盖全部输入域的核卷积。需要将它们加入到可以采用任何尺寸输入并输出分类图的全卷积网络中。这种转换如图2所示。
图2 将全连接层转化到卷积层能使一个分类网络输出heatmap(热图)。添加层和一个空间损失(如图一所示)产生了一个高效的端到端的dense学习机制
Furthermore, while the resulting maps are equivalent to the evaluation of the original net on particular input patches, the computation is highly amortized over the overlapping regions of those patches. For example,while AlexNet takes 1.2 ms (on a typical GPU) to produce the classification scores of a 227 × 227 image, the fully convolutional version takes 22 ms to produce a 10×10 grid of outputs from a 500×500 image, which is more than 5 times faster than the naive approach[1].
此外,当作为结果的图在特殊的输入patches上等同于原始网络的估计,计算是高度摊销的在那些patches的重叠域上。例如,当AlexNet花费了1.2ms(在标准的GPU上)推算一个227227图像的分类得分,全卷积网络花费22ms从一张500500的图像上产生一个10*10的输出网格,比朴素法快了5倍多。
The spatial output maps of these convolutionalized models make them a natural choice for dense problems like semantic segmentation. With ground truth available at every output cell, both the forward and backward passes are straightforward, and both take advantage of the inherent computational efficiency (and aggressive optimization) of convolution.
这些卷积化模式的空间输出图可以作为一个很自然的选择对于dense问题,比如语义分割。每个输出单元ground truth可用,正推法和逆推法都是直截了当的,都利用了卷积的固有的计算效率(和可极大优化性)。
The corresponding backward times for the AlexNet example are 2.4 ms for a single image and 37 ms for a fully convolutional 10 × 10 output map, resulting in a speedup similar to that of the forward pass. This dense backpropagation is illustrated in Figure 1.
对于AlexNet例子相应的逆推法的时间为单张图像时间2.4ms,全卷积的10*10输出图为37ms,结果是相对于顺推法速度加快了。 这种密集的反向传播如图1所示。
While our reinterpretation of classification nets as fully convolutional yields output maps for inputs of any size, the output dimensions are typically reduced by subsampling. The classification nets subsample to keep filters small and computational requirements reasonable. This coarsens the output of a fully convolutional version of these nets, reducing it from the size of the input by a factor equal to the pixel stride of the receptive fields of the output units.
当我们将分类网络重新解释为任意输出尺寸的全卷积域输出图,输出维数也通过下采样显著的减少了。分类网络下采样使filter保持小规模同时计算要求合理。这使全卷积式网络的输出结果变得粗糙,通过输入尺寸因为一个和输出单元的接收域的像素步长等同的因素来降低它。
3.2.Shift-and-stitchisfilterrarefactio
Input shifting and output interlacing is a trick that yields dense predictions from coarse outputs without interpolation, introduced by OverFeat [29]. If the outputs are downsampled by a factor of f,the input is shifted(by left and top padding) x pixels to the right and y pixels down, once for every value of (x,y) ∈ {0,...,f −1}×{0,...,f −1}. These f2 inputs are each run through the convnet, and the outputs are interlaced so that the predictions correspond to the pixels at the centers of their receptive fields.
dense prediction能从粗糙输出中通过从输入的平移版本中将输出拼接起来获得。如果输出是因为一个因子f降低采样,平移输入的x像素到左边,y像素到下面,一旦对于每个(x,y)满足0<=x,y<=f.处理f^2个输入,并将输出交错以便预测和它们接收域的中心像素一致。
Changing only the filters and layer strides of a convnet can produce the same output as this shift-and-stitch trick. Consider a layer (convolution or pooling) with input stride s, and a following convolution layer with filter weights fij (eliding the feature dimensions,irrelevant here). Setting the lower layer’s input stride to 1 upsamples its output by a factor of s, just like shift-and-stitch. However, convolving the original filter with the upsampled output does not produce the same result as the trick, because the original filter only sees a reduced portion of its (now upsampled) input. To reproduce the trick, rarefy the filter by enlarging it as
(with i and j zero-based). Reproducing the full net output of the trick involves repeating this filter enlargement layerby-layer until all subsampling is removed.
尽管单纯地执行这种转换增加了f^2的这个因素的代价,有一个非常有名的技巧用来高效的产生完全相同的结果,这个在小波领域被称为多孔算法。考虑一个层(卷积或者池化)中的输入步长s,和后面的滤波权重为f_ij的卷积层(忽略不相关的特征维数)。设置更低层的输入步长到l上采样它的输出影响因子为s。然而,将原始的滤波和上采样的输出卷积并没有产生和shift-and-stitch相同的结果,因为原始的滤波只看得到(已经上采样)输入的简化的部分。为了重现这种技巧,通过扩大来稀疏滤波,如下:
如果s能除以i和j,除非i和j都是0。重现该技巧的全网输出需要重复一层一层放大这个filter知道所有的下采样被移除。(在练习中,处理上采样输入的下采样版本可能会更高效。)
Simply decreasing subsampling within a net is a tradeoff: the filters see finer information, but have smaller receptive fields and take longer to compute. We have seen that the shift-and-stitch trick is another kind of tradeoff: the output is made denser without decreasing the receptive field sizes of the filters, but the filters are prohibited from accessing information at a finer scale than their original design.
在网内减少二次采样是一种折衷的做法:filter能看到更细节的信息,但是接受域更小而且需要花费很长时间计算。Shift-and -stitch技巧是另外一种折衷做法:输出更加密集且没有减小filter的接受域范围,但是相对于原始的设计filter不能感受更精细的信息。
Although we have done preliminary experiments with shift-and-stitch, we do not use it in our model. We find learning through upsampling, as described in the next section, to be more effective and efficient, especially when combined with the skip layer fusion described later on.
尽管我们已经利用这个技巧做了初步的实验,但是我们没有在我们的模型中使用它。正如在下一节中描述的,我们发现从上采样中学习更有效和高效,特别是接下来要描述的结合了跨层融合。
3.3.Upsamplingisbackwardsstridedconvolution
Another way to connect coarse outputs to dense pixels is interpolation. For instance, simple bilinear interpolation computes each output (y_{ij})from the nearest four inputs by a linear map that depends only on the relative positions of the input and output cells.
另一种连接粗糙输出到dense像素的方法就是插值法。比如,简单的双线性插值计算每个输出(y_{ij})来自只依赖输入和输出单元的相对位置的线性图最近的四个输入。
In a sense, upsampling with factor f is convolution with a fractional input stride of 1/f. So long as f is integral, a natural way to upsample is therefore backwards convolution (sometimes called deconvolution) with an output stride of f. Such an operation is trivial to implement, since it simply reverses the forward and backward passes of convolution.Thus upsampling is performed in-network for end-to-end learning by backpropagation from the pixelwise loss.
从某种意义上,伴随因子f的上采样是对步长为1/f的分数式输入的卷积操作。只要f是整数,一种自然的方法进行上采样就是向后卷积(有时称为去卷积)伴随输出步长为f。这样的操作实现是不重要的,因为它只是简单的调换了卷积的顺推法和逆推法。所以上采样在网内通过计算像素级别的损失的反向传播用于端到端的学习
Note that the deconvolution filter in such a layer need not be fixed (e.g., to bilinear upsampling), but can be learned. A stack of deconvolution layers and activation functions can even learn a nonlinear upsampling.
需要注意的是去卷积滤波在这种层面上不需要被固定不变(比如双线性上采样)但是可以被学习。一堆反褶积层和激励函数甚至能学习一种非线性上采样。
In our experiments, we find that in-network upsampling is fast and effective for learning dense prediction. Our best segmentation architecture uses these layers to learn to upsample for refined prediction in Section 4.2.
在我们的实验中,我们发现在网内的上采样对于学习dense prediction是快速且有效的。我们最好的分割架构利用了这些层来学习上采样用以微调预测,见4.2节。
3.4.Patchwisetrainingislosssampling
In stochastic optimization, gradient computation is driven by the training distribution. Both patchwise training and fully-convolutional training can be made to produce any distribution, although their relative computational efficiency depends on overlap and minibatch size. Whole image fully convolutional training is identical to patchwise training where each batch consists of all the receptive fields of the units below the loss for an image (or collection of images). While this is more efficient than uniform sampling of patches,it reduces the number of possible batches. However, random selection of patches within an image may be recovered simply. Restricting the loss to a randomly sampled subset of its spatial terms (or, equivalently applying a DropConnect mask [36] between the output and the loss) excludes patches from the gradient computation.
在随机优化中,梯度计算是由训练分布支配的。patchwise 训练和全卷积训练能被用来产生任意分布,尽管他们相对的计算效率依赖于重叠域和minibatch的大小。在每一个由所有的单元接受域组成的批次在图像的损失之下(或图像的集合)整张图像的全卷积训练等同于patchwise训练。当这种方式比patches的均匀取样更加高效的同时,它减少了可能的批次数量。然而在一张图片中随机选择patches可能更容易被重新找到。限制基于它的空间位置随机取样子集产生的损失(或者可以说应用输入和输出之间的DropConnect mask [39] )排除来自梯度计算的patches。
If the kept patches still have significant overlap, fully convolutional computation will still speed up training. If gradients are accumulated over multiple backward passes, batches can include patches from several images.2
如果保存下来的patches依然有重要的重叠,全卷积计算依然将加速训练。如果梯度在多重逆推法中被积累,batches能包含几张图的patches。
Sampling in patchwise training can correct class imbalance [27, 8, 2] and mitigate the spatial correlation of dense patches [28, 16]. In fully convolutional training, class balance can also be achieved by weighting the loss, and loss sampling can be used to address spatial correlation.
patcheswise训练中的采样能纠正分类失调和减轻密集空间相关性的影响。在全卷积训练中,分类平衡也能通过给损失赋权重实现,对损失采样能被用来标识空间相关。
We explore training with sampling in Section4.3,and do not find that it yields faster or better convergence for dense prediction. Whole image training is effective and efficient.
我们研究了4.3节中的伴有采样的训练,没有发现对于dense prediction它有更快或是更好的收敛效果。全图式训练是有效且高效的。
4.SegmentationArchitecture
We cast ILSVRC classifiers into FCNs and augment them for dense prediction with in-network upsampling and a pixelwise loss. We train for segmentation by fine-tuning. Next, we build a novel skip architecture that combines coarse, semantic and local, appearance information to refine prediction.
我们将ILSVRC分类应用到FCNs增大它们用于dense prediction结合网内上采样和像素级损失。我们通过微调为分割进行训练。接下来我们增加了跨层来融合粗的、语义的和局部的表征信息。这种跨层式架构能学习端到端来改善输出的语义和空间预测。
For this investigation, we train and validate on the PASCAL VOC 2011 segmentation challenge [7]. We train with a per-pixel multinomial logistic loss and validate with the standard metric of mean pixel intersection over union, with the mean taken over all classes, including background. The training ignores pixels that are masked out (as ambiguous or difficult) in the ground truth.
为此,我们训练和在PASCAL VOC 2011分割挑战赛中验证。我们训练逐像素的多项式逻辑损失和验证标准度量的在集合中平均像素交集还有基于所有分类上的平均接收,包括背景。这个训练忽略了那些在groud truth中被遮盖的像素(模糊不清或者很难辨认)。
4.1.FromclassifiertodenseFCN
We begin by convolutionalizing proven classification architectures as in Section 3. We consider the AlexNet3 architecture [19] that won ILSVRC12, as well as the VGG nets [31] and the GoogLeNet4 [32] which did exceptionally well in ILSVRC14. We pick the VGG 16-layer net5, which we found to be equivalent to the 19-layer net on this task. For GoogLeNet, we use only the final loss layer, and improve performance by discarding the final average pooling layer. We decapitate each net by discarding the final classifier layer, and convert all fully connected layers to convolutions. We append a 1 × 1 convolution with channel dimension 21 to predict scores for each of the PASCAL classes (including background) at each of the coarse output locations, followed by a deconvolution layer to bilinearly upsample the coarse outputs to pixel-dense outputs as described in Section 3.3. Table 1 compares the preliminary validation results along with the basic characteristics of each net. We report the best results achieved after convergence at a fixed learning rate (at least 175 epochs).
我们在第3节中以卷积证明分类架构的。我们认为拿下了ILSVRC12的AlexNet3架构和VGG nets、GoogLeNet4一样在ILSVRC14上表现的格外好。我们选择VGG 16层的网络5,发现它和19层的网络在这个任务(分类)上相当。对于GoogLeNet,我们仅仅使用的最后的损失层,通过丢弃了最后的平均池化层提高了表现能力。我们通过丢弃最后的分类切去每层网络头,然后将全连接层转化成卷积层。我们附加了一个1*1的、通道维数为21的卷积来预测每个PASCAL分类(包括背景)的得分在每个粗糙的输出位置,后面紧跟一个去卷积层用来双线性上采样粗糙输出到像素密集输出如3.3.节中描述。表1将初步验证结果和每层的基础特性比较。我们发现最好的结果在以一个固定的学习速率得到(最少175个epochs)。
(表格1 我们改变并扩展了分类卷积网络,通过对PASCAL VOC 2011有效数据集上的平局交叉和推理时间(NVIDIA Tesla K40c上20组500*500输入的测试的平均时间)进行比较。我们细化这个改变后的网络框架用来dense prediction;参数层的数量,输入单元接收域的大小和网内的粗糙步长。(在一个个固定的学习速率下这些数字有最好的表现,可能表现最好的)
Fine-tuning from classification to segmentation gave reasonable predictions for each net. Even the worst model achieved ∼ 75% of state-of-the-art performance. The segmentation-equippped VGG net (FCN-VGG16) already appears to be state-of-the-art at 56.0 mean IU on val, compared to 52.6 on test [16]. Training on extra data raises performance to 59.4 mean IU on a subset of val7. Training details are given in Section 4.3.
从分类到分割的微调对每层网络有一个合理的预测。甚至最坏的模型也能达到大约75%的良好表现。内设分割的VGG网络(FCN-VGG16)已经在val上平均IU 达到了56.0取得了最好的成绩,相比于52.6 [17] 。在额外数据上的训练将FCN-VGG16提高到59.4,将FCN-AlexNet提高到48.0。
Despite similar classification accuracy, our implementation of GoogLeNet did not match this segmentation result.
尽管相同的分类准确率,我们的用GoogLeNet并不能和VGG16的分割结果相比较。
4.2.Combining what and where
We define a new fully convolutional net (FCN) for segmentation that combines layers of the feature hierarchy and refines the spatial precision of the output. See Figure 3.
我们定义了一个新的全卷积网用于结合了特征层级的分割并提高了输出的空间精度,见图3。
(图3 我的有DAG(有向无环图)网络学习讲粗的高层信息和细的底层信息结合。池化和预测层以能表现出相对空间粒度的网络显示,于此同时中间过渡层作为铅垂线。第一行(FCN-32s):我们的单一流网络,如图4.1节描述,在一个单一的步骤中上采样步长为32预测回像素。第二行(FCN-16s):结合最后一层和pool4层的预测,步长为16,让我们的网络预测出更精细的细节,同时保留了高层语义信息。第三行(FCN-8s):pool3的附加预测,步长为8,精度进一步提高。)
While fully convolutionalized classifiers can be finetuned to segmentation as shown in 4.1, and even score highly on the standard metric,their output is dissatisfyingly coarse(seeFigure4). The32pixelstrideatthefinalprediction layer limits the scale of detail in the upsampled output.
当全卷积分类能被微调用于分割如4.1节所示,甚至在标准度量上得分更高,它们的输出不是很粗糙(见图4)。最后预测层的32像素步长限制了上采样输入的细节的尺寸。
We address this by adding links that combine the final prediction layer with lower layers with finer strides. This turnsalinetopologyintoaDAG,withedgesthatskipahead from lower layers to higher ones (Figure 3). As they see fewer pixels, the finer scale predictions should need fewer layers, so it makes sense to make them from shallower net outputs. Combining fine layers and coarse layers lets the model make local predictions that respect global structure. By analogy to the multiscale local jet of Florack et al. [10], we call our nonlinear local feature hierarchy the deep jet.
我们提出增加结合了最后预测层和有更细小步长的更低层的跨层信息,将一个线划拓扑结构转变成DAG(有向无环图),并且边界将从更底层向前跳跃到更高(图3)。因为它们只能获取更少的像素点,更精细的尺寸预测应该需要更少的层,所以从更浅的网中将它们输出是有道理的。结合了精细层和粗糙层让模型能做出遵从全局结构的局部预测。与Koenderick 和an Doorn [21]的jet类似,我们把这种非线性特征层称之为deep jet。
We first divide the output stride in half by predicting from a 16 pixel stride layer. We add a 1 × 1 convolution layer on top of pool4 to produce additional class predictions. We fuse this output with the predictions computed on top of conv7 (convolutionalized fc7) at stride 32 by adding a 2× upsampling layer and summing6 both predictions. (See Figure 3). We initialize the 2× upsampling to bilinearinterpolation,butallowtheparameterstobelearned asdescribedinSection3.3. Finally,thestride16predictions areupsampledbacktotheimage. WecallthisnetFCN-16s. FCN-16s is learned end-to-end, initialized with the parameters of the last, coarser net, which we now call FCN-32s. Thenewparametersactingonpool4arezero-initializedso thatthenetstartswithunmodifiedpredictions. Thelearning rate is decreased by a factor of 100.
我们首先将输出步长分为一半,通过一个16像素步长层预测。我们增加了一个1*1的卷积层在pool4的顶部来产生附加的类别预测。我们将输出和预测融合在conv7(fc7的卷积化)的顶部以步长32计算,通过增加一个2×的上采样层和预测求和(见图3)。我们初始化这个2×上采样到双线性插值,但是允许参数能被学习,如3.3节所描述、最后,步长为16的预测被上采样回图像,我们把这种网结构称为FCN-16s。FCN-16s用来学习端到端,能被最后的参数初始化。这种新的、在pool4上生效的参数是初始化为0 的,所以这种网结构是以未变性的预测开始的。这种学习速率是以100倍的下降的。
Learning this skip net improves performance on the validation set by 3.0 mean IU to 62.4. Figure 4 shows improvement in the fine structure of the output. We compared thisfusionwithlearningonlyfromthepool4layer(which resulted in poor performance), and simply decreasing the learning rate without adding the extra link (which results in an insignificant performance improvement, without improving the quality of the output).
学习这种跨层网络能在3.0平均IU的有效集合上提高到62.4。图4展示了在精细结构输出上的提高。我们将这种融合学习和仅仅从pool4层上学习进行比较,结果表现糟糕,而且仅仅降低了学习速率而没有增加跨层,导致了没有提高输出质量的没有显著提高表现。
We continue in this fashion by fusing predictions from pool3 with a 2× upsampling of predictions fused from pool4 and conv7, building the net FCN-8s. We obtain a minor additional improvement to 62.7 mean IU, and find a slight improvement in the smoothness and detail of our output. At this point our fusion improvements have met diminishing returns, both with respect to the IU metric which emphasizes large-scale correctness, and also in terms of the improvement visible e.g. inFigure4,so we do not continue fusing even lower layers.
我们继续融合pool3和一个融合了pool4和conv7的2×上采样预测,建立了FCN-8s的网络结构。在平均IU上我们获得了一个较小的附加提升到62.7,然后发现了一个在平滑度和输出细节上的轻微提高。这时我们的融合提高已经得到了一个衰减回馈,既在强调了大规模正确的IU度量的层面上,也在提升显著度上得到反映,如图4所示,所以即使是更低层我们也不需要继续融合。
Refinement by other means Decreasing the stride of pooling layers is the most straightforward way to obtain finer predictions. However, doing so is problematic for our VGG16-based net. Setting the pool5 layer to have stride 1 requires our convolutionalized fc6 to have a kernel size of 14×14 in order to maintain its receptive field size. In addition to their computational cost, we had difficulty learning such large filters. We made an attempt to re-architect the layers above pool5 with smaller filters, but were not successful in achieving comparable performance; one possible explanation is that the initialization from ImageNet-trained weights in the upper layers is important.
其他方式精炼化 减少池层的步长是最直接的一种得到精细预测的方法。然而这么做对我们的基于VGG16的网络带来问题。设置pool5的步长到1,要求我们的卷积fc6核大小为14*14来维持它的接收域大小。另外它们的计算代价,通过如此大的滤波器学习非常困难。我们尝试用更小的滤波器重建pool5之上的层,但是并没有得到有可比性的结果;一个可能的解释是ILSVRC在更上层的初始化时非常重要的。
Another way to obtain finer predictions is to use the shift-and-stitch trick described in Section 3.2. In limited experiments, we found the cost to improvement ratio from this method to be worse than layer fusion.
另一种获得精细预测的方法就是利用3.2节中描述的shift-and-stitch技巧。在有限的实验中,我们发现从这种方法的提升速率比融合层的方法花费的代价更高。
4.3.Experimentalframework
Optimization We train by SGD with momentum. We use a minibatch size of 20 images and fixed learning rates of (10^{-3}), (10^{-4}) and (5^{-5}) for FCN-AlexNet, FCN-VGG16, and FCN-GoogLeNet, respectively, chosen by line search. We use momentum 0.9, weight decay of (5^{-4}) or (2^{-4}), and doubled the learning rate for biases,although we found training to be insensitive to these parameters (but sensitive to the learning rate). We zero-initialize the class scoring convolution layer, finding random initialization to yield neither better performance nor faster convergence. Dropout was included where used in the original classifier nets.
**优化 ** 我们利用momentum训练了GSD。我们利用了一个minibatch大小的20张图片,然后固定学习速率为(10^{-3}),(10^{-4}),和5-5用于FCN-AlexNet, FCN-VGG16,和FCN-GoogLeNet,通过各自的线性搜索选择。我们利用了0.9的momentum,权值衰减在(5^{-4})或是 (2^{-4}),而且对于偏差的学习速率加倍了,尽管我们发现训练对单独的学习速率敏感。我们零初始化类的得分层,随机初始化既不能产生更好的表现也没有更快的收敛。Dropout被包含在用于原始分类的网络中。
Fine-tuning We fine-tune all layers by backpropagation through the whole net. Fine-tuning the output classifier alone yields only 70% of the full finetuning performance as compared in Table 2. Training from scratch is not feasible considering the time required to learn the base classification nets. (Note that the VGG net is trained in stages, while we initialize from the full 16-layer version.) Fine-tuning takes three days on a single GPU for the coarse FCN-32s version, and about one day each to upgrade to the FCN-16s and FCN-8s versions.
微调 我们通过反向传播微调整个网络的所有层。经过表2的比较,微调单独的输出分类表现只有全微调的70%。考虑到学习基础分类网络所需的时间,从scratch中训练不是可行的。(注意VGG网络的训练是阶段性的,当我们从全16层初始化后)。对于粗糙的FCN-32s,在单GPU上,微调要花费三天的时间,而且大约每隔一天就要更新到FCN-16s和FCN-8s版本。
Patch Sampling As explained in Section 3.4, our full image training effectively batches each image into a regular grid of large, overlapping patches. By contrast, prior work randomly samples patches over a full dataset [27, 2, 8, 28, 11], potentially resulting in higher variance batches that may accelerate convergence [22]. We study this tradeoff by spatially sampling the loss in the manner described earlier, making an independent choice to ignore each final layercellwithsomeprobability1−p. To avoid changing the effective batch size,we simultaneously increase the number of images per batch by a factor 1/p. Note that due to the efficiency of convolution, this form of rejection sampling is still faster than patchwise training for large enough values of p (e.g., at least for p > 0.2 according to the numbers in Section 3.1). Figure 5 shows the effect of this form of sampling on convergence. We find that sampling does not have a significant effect on convergence rate compared to whole image training, but takes significantly more time due to the larger number of images that need to be considered per batch. We therefore choose unsampled, whole image training in our other experiments.
patch取样 正如3.4节中解释的,我们的全图有效地训练每张图片batches到常规的、大的、重叠的patches网格。相反的,先前工作随机样本patches在一整个数据集,可能导致更高的方差batches,可能加速收敛。我们通过空间采样之前方式描述的损失研究这种折中,以1-p的概率做出独立选择来忽略每个最后层单元。为了避免改变有效的批次尺寸,我们同时以因子1/p增加每批次图像的数量。注意的是因为卷积的效率,在足够大的p值下,这种拒绝采样的形式依旧比patchwose训练要快(比如,根据3.1节的数量,最起码p>0.2)图5展示了这种收敛的采样的效果。我们发现采样在收敛速率上没有很显著的效果相对于全图式训练,但是由于每个每个批次都需要大量的图像,很明显的需要花费更多的时间。
Class Balancing Fully convolutional training can balance classes by weighting or sampling the loss. Although our labels are mildly unbalanced (about 3/4 are background), we find class balancing unnecessary.
分类平衡 全卷积训练能通过按权重或对损失采样平衡类别。尽管我们的标签有轻微的不平衡(大约3/4是背景),我们发现类别平衡不是必要的。
Dense Prediction The scores are upsampled to the input dimensions by deconvolution layers within the net. Final layer deconvolutional filters are fixed to bilinear interpolation, while intermediate upsampling layers are initialized to bilinear upsampling, and then learned. Shift-and-stitch (Section 3.2), or the filter rarefaction equivalent, are not used.
Augmentation We tried augmenting the training data by randomly mirroring and “jittering” the images by translating them up to 32 pixels(the coarsest scale of prediction) in each direction. This yielded no noticeable improvement.
数据增强 我们尝试通过随机镜像和“抖动”图像来增加训练数据,方法是在每个方向将它们转换为32个像素(最粗略的预测尺度)。这没有明显的改善。
More Training Data The PASCAL VOC 2011segmentation challenge training set, which we used for Table 1, labels 1112 images. Hariharan et al. [15] have collected labels for a much larger set of 8498 PASCAL training images, which was used to train the previous state-of-the-art system, SDS. This training data improves the FCNVGG16 validation score7 by 3.4 points to 59.4 mean IU.
**更多训练集 ** PASCAL VOC 2011分割挑战训练集,我们用于表1,标签1112图像。Hariharan等人收集了一组更大的8498 PASCAL VOC 训练图像的标签,用于训练先前的最先进系统SDS。此训练数据将FCNVGG16验证分数7提高了3.4点,达到59.4平均IU。
Implementation All models are trained and tested with Caffe [18] on a single NVIDIA Tesla K40c. The models and code will be released open-source on publication.
实施所有的模型都是在一个nvidia tesla k40c上用caffe进行训练和测试的,模型和代码将在发布时开源发布。
5. Results
We test our FCN on semantic segmentation and scene parsing, exploring PASCAL VOC, NYUDv2, and SIFT Flow. Although these tasks have historically distinguished between objects and regions, we treat both uniformly as pixel prediction. We evaluate our FCN skip architecture8 on each of these datasets, and then extend it to multi-modal input for NYUDv2 and multi-task prediction for the semantic and geometric labels of SIFT Flow.
我们训练FCN在语义分割和场景解析,研究了PASCAL VOC, NYUDv2和 SIFT Flow。尽管这些任务在以前主要是用在物体和区域上,我们都一律将它们视为像素预测。我们在这些数据集中都进行测试用来评估我们的FCN跨层式架构,然后对于NYUDv2将它扩展成一个多模型的输出,对于SIFT Flow则扩展成多任务的语义和集合标签。
Metrics We report four metrics from common semantic segmentation and scene parsing evaluations that are variations on pixel accuracy and region intersection over union (IU). Let (n_{ij}) be the number of pixels of class (i) predicted to belong to class (j) , where there are (n_{cl}) different classes, and let $t_i = Sigma_j n_{ij} $ be the total number of pixels of class (i). We compute:
- pixel accuracy: (Sigma_i n_{ij}/Sigma_i t_i)
- mean accuraccy:((1/n_{cl})Sigma_i n_{ii}/t_i)
- mean IU:((1/n_{cl})Sigma_in_{ii}/(t_i+Sigma_j n_{ji}-n_{ii}))
- frequency weighted IU:((Sigma_k t_k)^{-1}Sigma_it_in_{ii}/(t_i+Sigma_jn_{ji}-n_{ii}))
度量 我们从常见的语义分割和场景解析评估中提出四种度量,它们在像素准确率和在联合的区域交叉上是不同的。令(n_{ij})为类别i的被预测为类别j的像素数量,有(n_{ij})个不同的类别,令 $t_i = Sigma_j n_{ij} $ 为类别i的像素总的数量。我们将计算:
- 像素准确率: (Sigma_i n_{ij}/Sigma_i t_i)
- 平局准确率:((1/n_{cl})Sigma_i n_{ii}/t_i)
- 平局 IU:((1/n_{cl})Sigma_in_{ii}/(t_i+Sigma_j n_{ji}-n_{ii}))
- 加权频数 IU:((Sigma_k t_k)^{-1}Sigma_it_in_{ii}/(t_i+Sigma_jn_{ji}-n_{ii}))
PASCAL VOC Table 3 gives the performijance of our FCN-8s on the test sets of PASCAL VOC 2011 and 2012, and compares it to the previous state-of-the-art, SDS [16], and the well-known R-CNN [12]. We achieve the best results on mean IU9 by a relative margin of 20%. Inference time is reduced 114 (convnet only, ignoring proposals and refinement) or 286 (overall).
PASCAL VOC 表3给出了我们的FCN-8s的在PASCAL VOC2011和2012测试集上的表现,然后将它和之前的先进方法SDS[17]和著名的R-CNN进行比较。我们在平均IU上取得了最好的结果相对提升了20%。推理时间被降低了114×(只有卷积网,没有proposals和微调)或者286×(全部都有)。
NYUDv2 [30] is an RGB-D dataset collected using the Microsoft Kinect. It has 1449 RGB-D images, with pixelwise labels that have been coalesced into a 40 class semantic segmentation task by Gupta et al. [13]. We report results on the standard split of 795 training images and 654 testing images. (Note: all model selection is performed on PASCAL 2011 val.) Table 4 gives the performance of our model in several variations. First we train our unmodified coarse model (FCN-32s) on RGB images. To add depth information, we train on a model upgraded to take four-channel RGB-D input (early fusion). This provides little benefit,perhaps due to the difficultly of propagating meaningful gradients all the way through the model. Following the success of Gupta et al. [14], we try the three-dimensional HHA encoding of depth, training nets on just this information, as well as a “late fusion” of RGB and HHA where the predictions from both nets are summed at the final layer, and the resulting two-stream net is learned end-to-end. Finally we upgrade this late fusion net to a 16-stride version.
NVUDv2 是一种通过利用Microsoft Kinect收集到的RGB-D数据集,含有已经被合并进Gupt等人的40类别的语义分割任务的pixelwise标签。我们报告结果基于标准分离的795张图片和654张测试图片。(注意:所有的模型选择将展示在PASCAL 2011 val上)。表4给出了我们模型在一些变化上的表现。首先我们在RGB图片上训练我们的未经修改的粗糙模型(FCN-32s)。为了添加深度信息,我们训练模型升级到能采用4通道RGB-Ds的输入(早期融合)。这提供了一点便利,也许是由于模型一直要传播有意义的梯度的困难。紧随Gupta等人的成功,我们尝试3维的HHA编码深度,只在这个信息上(即深度)训练网络,和RGB与HHA的“后期融合”一样来自这两个网络中的预测将在最后一层进行总结,结果的双流网络将进行端到端的学习。最后我们将这种后期融合网络升级到16步长的版本。
SIFT Flow is a dataset of 2,688 images with pixel labels for 33 semantic categories (“bridge”, “mountain”, “sun”),as well as three geometric categories (“horizontal”, “vertical”, and “sky”). An FCN can naturally learn a joint representation that simultaneously predicts both types of labels. We learn a two-headed version of FCN-16s with semantic and geometric prediction layers and losses. The learned model performs as well on both tasks as two independently trained models, while learning and inference are essentially as fast as each independent model by itself. The results in Table 5, computed on the standard split into 2,488 training
and 200 test images,10 show state-of-the-art performance on both tasks.
SIFT Flow是一个带有33语义范畴(“桥”、“山”、“太阳”)的像素标签的2688张图片的数据集和3个几何分类(“水平”、“垂直”和“sky")一样。一个FCN能自然学习共同代表权,即能同时预测标签的两种类别。我们学习FCN-16s的一种双向版本结合语义和几何预测层和损失。这种学习模型在这两种任务上作为独立的训练模型表现很好,同时它的学习和推理基本上和每个独立的模型一样快。表5的结果显示,计算在标准分离的2488张训练图片和200张测试图片上计算,在这两个任务上都表现的极好。
6. Conclusion
Fully convolutional networks are a rich class of models,of which modern classification convnets are a special case. Recognizing this, extending these classification nets to segmentation, and improving the architecture with multi-resolution layer combinations dramatically improves the state-of-the-art, while simultaneously simplifying and speeding up learning and inference.
全卷积网络是模型非常重要的部分,是现代化分类网络中一个特殊的例子。认识到这个,将这些分类网络扩展到分割并通过多分辨率的层结合显著提高先进的技术,同时简化和加速学习和推理。
Acknowledgements This work was supported in part by DARPA’s MSEE and SMISC programs, NSF awards IIS-1427425, IIS-1212798, IIS-1116411, and the NSF GRFP,Toyota, and the Berkeley Vision and Learning Center. We gratefully acknowledge NVIDIA for GPU donation. We thank Bharath Hariharan and Saurabh Gupta for their advice and dataset tools. We thank Sergio Guadarrama for reproducing GoogLeNet in Caffe. We thank Jitendra Malik for his helpful comments. Thanks to Wei Liu for pointing out an issue wth our SIFT Flow mean IU computation and an error in our frequency weighted mean IU formula.
鸣谢 这项工作有以下部分支持DARPA's MSEE和SMISC项目,NSF awards IIS-1427425, IIS-1212798, IIS-1116411, 还有NSF GRFP,Toyota, 还有 Berkeley Vision和Learning Center。我们非常感谢NVIDIA捐赠的GPU。我们感谢Bharath Hariharan 和Saurabh Gupta的建议和数据集工具;我们感谢Sergio Guadarrama 重构了Caffe里的GoogLeNet;我们感谢Jitendra Malik的有帮助性评论;感谢Wei Liu指出了我们SIFT Flow平均IU计算上的一个问题和频率权重平均IU公式的错误。
A. Upper Bounds on IU
In this paper, we have achieved good performance on the mean IU segmentation metric even with coarse semantic prediction. To better understand this metric and the limits of this approach with respect to it, we compute approximate upper bounds on performance with prediction at various scales. We do this by downsampling ground truth images and then upsampling them again to simulate the best results obtainable with a particular downsampling factor. The following table gives the mean IU on a subset of PASCAL 2011 val for various downsampling factors.
在这篇论文中,我们已经在平均IU分割度量上取到了很好的效果,即使是粗糙的语义预测。为了更好的理解这种度量还有关于这种方法的限制,我们在计算不同的规模上预测的表现的大致上界。我们通过下采样ground truth图像,然后再次对它们进行上采样,来模拟可以获得最好的结果,其伴随着特定的下采样因子。下表给出了不同下采样因子在PASCAL2011 val的一个子集上的平均IU。pixel-perfect预测很显然在取得最最好效果上不是必须的,而且,相反的,平均IU不是一个好的精细准确度的测量标准。
B. More Results
We further evaluate our FCN for semantic segmentation.
PASCAL-Context [26] provides whole scene annotations of PASCAL VOC 2010. While there are over 400 distinct classes, we follow the 59 class task defined by [26] that picks the most frequent classes. We train and evaluate on the training and val sets respectively. In Table 6, we compare to the joint object + stuff variation of Convolutional Feature Masking [3] which is the previous state-of-the-art on this task. FCN-8s scores 35.1 mean IU for an 11% relative improvement.
我们将我们的FCN用于语义分割进行了更进一步的评估。
PASCAL-Context [29] 提供了PASCAL VOC 2011的全部场景注释。有超过400中不同的类别,我们遵循了 [29] 定义的被引用最频繁的59种类任务。我们分别训练和评估了训练集和val集。在表6中,我们将联合对象和Convolutional Feature Masking [4] 的stuff variation进行比较,后者是之前这项任务中最好的方法。FCN-8s在平均IU上得分为37.8,相对提高了20%
Changelog
The arXiv version of this paper is kept up-to-date with corrections and additional relevant material. The following gives a brief history of changes.v2 Add Appendix A giving upper bounds on mean IU and Appendix B with PASCAL-Context results. Correct PASCAL validation numbers (previously, some val images were included in train), SIFT Flow mean IU (which used an inappropriately strict metric), and an error in the frequency weighted mean IU formula. Add link to models and update timing numbers to reflect improved implementation (which is publicly available).
论文的arXiv版本保持着最新的修正和其他的相关材料,接下来给出一份简短的变更历史。v2 添加了附录A和附录B。修正了PASCAL的有效数量(之前一些val图像被包含在训练中),SIFT Flow平均IU(用的不是很规范的度量),还有频率权重平均IU公式的一个错误。添加了模型和更新时间数字来反映改进的实现的链接(公开可用的)。
References
略