Abstract—We present a simple and effective architecture for fine-grained visual recognition called Bilinear Convolutional Neural Networks (B-CNNs). These networks represent an image as a pooled outer product of features derived from two CNNs and capture localized feature interactions in a translationally invariant manner. B-CNNs belong to the class of orderless texture representations but unlike prior work they can be trained in an end-to-end manner. Our most accurate model obtains 84.1%, 79.4%, 86.9% and 91.3% per-image accuracy on the Caltech-UCSD birds [67], NABirds [64], FGVC aircraft [42], and Stanford cars [33] dataset respectively and runs at 30 frames-per-second on a NVIDIA Titan X GPU. We then present a systematic analysis of these networks and show that (1) the bilinear features are highly redundant and can be reduced by an order of magnitude in size without significant loss in accuracy, (2) are also effective for other image classification tasks such as texture and scene recognition, and (3) can be trained from scratch on the ImageNet dataset offering consistent improvements over the baseline architecture. Finally, we present visualizations of these models on various datasets using top activations of neural units and gradient-based inversion techniques. The source code for the complete system is available at http://vis-www.cs.umass.edu/bcnn
我们提出了一种简单有效的细粒视觉识别体系结构,称为双线性卷积神经网络。
网络(B-CNNs)。这些网络代表一个图像,作为来自两个CNNs和捕获的特征的汇集的外部产品。
平移不变的局部特征相互作用。B-CNNs属于无规则纹理表示类,但
与先前的工作不同,它们可以以端到端的方式进行训练。我们最精确的模型得到84.1%、79.4%、86.9%和91.3%。
每个图像精度加州理工学院UCSD鸟[ 67 ],NabrdS [ 64 ],FGVC飞机[ 42 ],斯坦福汽车[33 ]数据集分别和
在NVIDIA TITAN X GPU上每秒运行30帧。然后,我们对这些网络进行系统分析,并显示:(1)
双线性特征是高度冗余的,并且可以在数量上减少一个数量级而没有显著的精度损失,(2)
对于其他图像分类任务(如纹理和场景识别),以及(3)可以从头开始进行训练。
IMANET数据集提供了与基线体系结构一致的改进。最后,我们展示了这些模型的可视化。
在各种数据集使用神经单元的顶部激活和基于梯度的反演技术。完整的源代码
系统在HTTP:/VIS-www. csUMAS.EDU/BCNN中可用。
FINE-GRAINED recognition involves classification of instances
within a subordinate category. Examples include
recognition of species of birds, models of cars, or
breeds of dogs. These tasks often require recognition of
highly localized attributes of objects while being invariant
to their pose and location in the image.
For example, distinguishing a “California gull” from a “Ringed-bill gull”
requires the recognition of patterns on their bill, or subtle color differences of their feathers [1].
There are two broad classes of techniques that are effective for these tasks.
Partbased models construct representations by localizing parts
and extracting features conditioned on their detected locations.
This makes subsequent reasoning about appearance
easier since the variations due to location, pose, and
viewpoint changes are factored out. Holistic models on the
other hand construct a representation of the entire image
directly. These include classical image representations, such
as Bag-of-Visual-Words [12] and their variants popularized
for texture analysis. Most modern approaches are based
on representations extracted using Convolutional Neural Networks (CNNs) pre-trained on the ImageNet dataset [54].
While part-based models based on CNNs are more accurate,
they require part annotations during training. This makes
them less applicable in domains where such annotations are
difficult or expensive to obtain, including categories without
a clearly defined set of parts such as textures and scenes
细粒度识别涉及实例分类在从属范畴中。例子包括鸟类、汽车模型的识别,或狗的品种这些任务往往需要承认。
不变的对象的高度本地化属性他们在图像中的姿势和位置。
例如,将“加利福尼亚鸥”与“环鸥鸥”区分开来。需要承认他们的账单上的图案,或者他们羽毛的细微颜色差异(1)。
有两大类对这些任务有效的技术。基于零件的模型通过本地化零件来构造表示并根据检测到的位置提取特征。
这就产生了关于外观的后续推理。
由于位置、姿势和变化而变得更容易观点的变化因素。整体模型另一方面构造整个图像的表示。
直接。这些包括经典图像表示,等等。作为视觉文字袋(12)及其变体推广用于纹理分析。大多数现代方法都是基于
关于使用卷积神经网络(CNNs)在IMANET数据集上预训练提取的表示〔54〕。
而基于CNNs的基于部分的模型更精确,他们在培训过程中需要部分注释。这使得它们不适用于这样的注释的域。
难以获得或昂贵,包括类别没有一组清晰的部件,如纹理和场景
In this paper we argue that the effectiveness of partbased
reasoning is due to their invariance to position and
pose of the object. Texture representations are translationally
invariant by design as they are based on aggregation of
local image features in an orderless manner While classical
texture representations based on SIFT [40] and their recent
extensions based on CNNs [11], [24], have been shown to be
effective at fine-grained recognition, they have not matched
the performance of part-based approaches. A potential reason
for this gap is that the underlying features in texture
representations are not learned in an end-to-end manner
and are likely to be suboptimal for the recognition task.
在本文中,我们认为,基于部分的有效性
推理是由于它们对位置的不变性和
物体的姿态。纹理表示是翻译的
不变的设计,因为它们是基于聚合
经典的局部图像特征
基于SIFT(40)的纹理表示及其最新进展
基于CNNs〔11〕,〔24〕的扩展已被证明是
在细粒度识别有效时,它们没有匹配。
基于部分的方法的性能。潜在原因
因为这个差距是纹理的底层特征
表示不是以端到端的方式学习的。
并且对于识别任务可能是次优的。
We present Bilinear CNNs (B-CNNs) that address several
drawbacks of existing deep texture representations. Our key insight is that several widely-used texture representations can be written as a pooled outer product of two suitably designed features.
When these features are based on CNN the resulting architecture consists of standard CNN units for feature extraction, followed by a specially designed
bilinear layer and a pooling layer.
The output is a fixed high-dimensional representation which can be combined
with a fully-connected layer to predict class labels.
The simplest bilinear layer is one where two identical features
are combined with an outer product.
This is closely related to the Second-Order Pooling approach of Carreira et
al. [8] popularized for semantic image segmentation.
We also show that other texture representations can be written as B-CNNs once suitable non-linearities are applied to the underlying features.
This results in a family of layers which can be plugged into existing CNNs for end-to-end training on large datasets, or domain-specific fine-tuning for transfer learning.
B-CNNs outperform existing models, including
those trained with part-level supervision, on a variety of
fine-grained recognition datasets.
Moreover, these models are fairly efficient. Our most accurate model implemented in MatConvNet [66] runs at 30 frames-per-second on a
NVIDIA Titan X GPU and obtains 84.1%, 79.4%, 86.9%
and 91.3% per-image accuracy on Caltech-UCSD birds [67],
NABirds [64], FGVC aircraft [42], and Stanford cars [33]
dataset respectively
我们提出了几个解决几个问题的双线性CNN(B-CNN)
现有深度纹理表示的缺点。我们的主要观点是,几种广泛使用的纹理表示可以写成两个适当设计的特征的汇集外部产品。
当这些功能基于CNN时,最终的架构包括用于特征提取的标准CNN单元,然后是专门设计的
双线性层和汇集层。
输出是固定的高维表示,可以组合
使用完全连接的图层来预测类标签。
最简单的双线性层是两个相同特征的层
与外部产品结合。
这与Carreira等人的二阶汇集方法密切相关
人。 [8]推广用于语义图像分割。
我们还表明,一旦将合适的非线性应用于基础特征,其他纹理表示可以被写为B-CNN。
这导致一系列层可以插入到现有的CNN中,用于大型数据集的端到端训练,或者用于传输学习的特定领域微调。
B-CNN优于现有模型,包括
受过部分级别监督培训的人员
细粒度识别数据集。
而且,这些模型相当有效。我们在MatConvNet [66]中实现的最准确的模型以每秒30帧的速度运行
NVIDIA Titan X GPU获得84.1%,79.4%,86.9%
加州理工学院 - 加州大学圣克鲁斯分校鸟类的每图像精确度为91.3%[67],
NABirds [64],FGVC飞机[42]和斯坦福汽车[33]
数据集分别。