SSD:Single Shot Multibox Detector
以下内容,仅代表个人感受!
1. Abstract
论文十分的硬核,对于第一段而言,没有什么废话,直奔主题,第一句话就是我们提出了一个新的检测模型,第一段后半段,diss了faster rcnn以及yolo,我比yolo准,比faster rcnn还快,呵,最后一句话,丢上了一个caffe代码的链接.(太猛了,告辞)!在PASCAL VOC, COCO, and ILSVRC等数据集上进行了验证, 加点sao操作,我们可以达到sota。。。emmm!稳!
2. introduction
继续对比faster rcnn,yolo,抛出来自己的指标,SSD 59 FPS with mAP 74.3% on VOC2007 test, vs. Faster R-CNN 7 FPS with mAP 73.2% or YOLO 45 FPS with mAP 63.4%,速度当时的sota,以及sota的map。这部分主要还是围绕着精度以及时间来进行展开的。
contributions:
- ssd又快又准。
- ssd的核心是采用卷积核,在feature maps上设定了一系列的default box,来进行分类以及坐标的回归任务。
- 为了实现高精度的检测,我们在不同大小的feature maps上设立了不同的aspect ratio的default box,小框检测小物体,大框检测大物体(skr..)。
- 夸了一下自己的结构,在小分辨率的feature maps上照样liubility,且达到了精度与速度的trade-off.
- 在公开的数据集上达到了sota!
3. SSD的介绍
主要的网络结构图(截取自论文中):
其中,输入的是300 * 300的3channels的图片,通过backbone网络(vgg16),取出Conv5_3layer的feature map,后续接了几个卷积层,这个地方需要注意1*1的卷积层,1*1的感受野是1,它对空间位置上的信息没有进行任何改变,它完成的是维度信息的整合。在作者加入的后半部分也用到了1*1的卷积层,并且发挥了1*1卷积层降维的作用。在第一个Classifier层,使用一层(3,3,(4*(Classes+4)))卷积进行卷积(其中的Classes是识别的物体的种类数,4为x,y,w,h坐标,乘号前边的4为default box的数量,各个不同的feature map是不同的),同时在Conv7、Conv8_2、Conv9_2、Conv10_2和Conv11_2都存在这样的卷积层,因此最后提取到6个feature map层。而所有的操作,都是基于这6个feature map之中。
从上图的结构图中我们可以看到,每个feature map的每个pixel,会对应着不同的aspect ratio的default box(与faster rcnn很类似,faster rcnn后续会进行分析), 如conv5-3layer上,存在着38 * 38 * 6个default box,从图中可知,六层中每层的每个中心点产生的k分别为4、6、6、6、4、4。所以6层中的每层取一个特征图共产生38*38*4+19*19*6+10*10*6+5*5*6+3*3*4+1*1*4=8732个默认框。所以,最后一共会有8732个default box。而每层feature map具有不同的scale以及asppect ratio都是不同的,后续有公式进行计算(憋问为啥,人为设计)。
接下来的操作,就是围绕着这一堆default box进行各种处理了,主要需要进行论文与代码对齐着看。
4. Training
Training的部分,我们主要参考文章,以及代码的实现,我们主要参考的代码是: SSD源码 https://github.com/amdegroot/ssd.pytorch.git,其中的代码结构十分漂亮。
主要目录结构:
├── ckpt
│ └── ssd300_mAP_77.43_v2.pth
├── data
│ ├── coco_labels.txt
│ ...
│ ├── __init__.py
│ ├── scripts
│ │ ├── COCO2014.sh
│ │ ├── VOC2007.sh
│ │ └── VOC2012.sh
│ └── voc0712.py
├── demo
│ ├── demo.ipynb
│ ├── __init__.py
│ ├── live.py
│ ├── ...
│ ├── test.py
├── doc
│ ├── detection_example2.png
│ ├── detection_example.png
│ ├── detection_examples.png
│ ├── SSD.jpg
│ └── ssd.png
├── eval.py
├── layers
│ ├── box_utils.py
│ ├── functions
│ │ ├── detection.py
│ │ ├── __init__.py
│ │ ├── prior_box.py
│ │ └── __pycache__
│ │ ├── detection.cpython-35.pyc
│ │ ├── __init__.cpython-35.pyc
│ │ └── prior_box.cpython-35.pyc
│ ├── __init__.py
│ ├── modules
│ │ ├── __init__.py
│ │ ├── l2norm.py
│ │ ├── multibox_loss.py
│ │ └── __pycache__
│ │ ├── __init__.cpython-35.pyc
│ │ ├── l2norm.cpython-35.pyc
│ │ └── multibox_loss.cpython-35.pyc
│ └── __pycache__
│ ├── box_utils.cpython-35.pyc
│ └── __init__.cpython-35.pyc
├── LICENSE
├── __pycache__
│ └── ssd.cpython-35.pyc
├── README.md
├── ssd.py
├── tags
├── test.py
├── train.py
│ ├── CHANGES
│ ├── color.c
│ ├── color.o
│ ├── doc
│ │ ├── tree.1
│ │ ├── tree.1.fr
│ │ └── xml.dtd
└── utils
├── augmentations.py
└── __init__.py
4.1 数据部分
在data/scripts/中存在三个脚本文件,都是进行数据集下载的,是VOC以及COCO数据集,COCO数据集相对较大,咱们后续主要在VOC的数据集上进行实验。对于VOC以及COCO数据集中,我们不多做介绍,有很多大佬们对其进行了介绍。以voc2007为例,主要的结构如下:
├── Annotations 进行 detection 任务时的标签文件,xml 形式,文件名与图片名一一对应
├── ImageSets 包含三个子文件夹 Layout、Main、Segmentation,其中 Main 存放的是分类和检测的数据集分割文件
├── JPEGImages 存放 .jpg 格式的图片文件
├── SegmentationClass 存放按照 class 分割的图片
└── SegmentationObject 存放按照 object 分割的图片
├── Main
│ ├── train.txt 写着用于训练的图片名称, 共 2501 个
│ ├── val.txt 写着用于验证的图片名称,共 2510 个
│ ├── trainval.txt train与val的合集。共 5011 个
│ ├── test.txt 写着用于测试的图片名称,共 4952 个
对于检测任务,我们主要参考其中的xml标注文件。
<annotation> <folder>VOC2007</folder> <filename>000001.jpg</filename> # 文件名 <source> <database>The VOC2007 Database</database> <annotation>PASCAL VOC2007</annotation> <image>flickr</image> <flickrid>341012865</flickrid> </source> <owner> <flickrid>Fried Camels</flickrid> <name>Jinky the Fruit Bat</name> </owner> <size> # 图像尺寸, 用于对 bbox 左上和右下坐标点做归一化操作 <width>353</width> <height>500</height> <depth>3</depth> </size> <segmented>0</segmented> # 是否用于分割 <object> <name>dog</name> # 物体类别 <pose>Left</pose> # 拍摄角度:front, rear, left, right, unspecified <truncated>1</truncated> # 目标是否被截断(比如在图片之外),或者被遮挡(超过15%) <difficult>0</difficult> # 检测难易程度,这个主要是根据目标的大小,光照变化,图片质量来判断 <bndbox> <xmin>48</xmin> <ymin>240</ymin> <xmax>195</xmax> <ymax>371</ymax> </bndbox> </object> <object> <name>person</name> <pose>Left</pose> <truncated>1</truncated> <difficult>0</difficult> <bndbox> <xmin>8</xmin> <ymin>12</ymin> <xmax>352</xmax> <ymax>498</ymax> </bndbox> </object> </annotation>
对于读取数据而言,首先看下config.py,因为用到了其中的部分信息:
data/config.py(简化办ssd_pascal.py):
#专门的配置文件
# config.py import os.path # 跨平台的得到家目录 HOME = os.path.expanduser("~") # 为了画图的颜色 COLORS = ((255, 0, 0, 128), (0, 255, 0, 128), (0, 0, 255, 128), (0, 255, 255, 128), (255, 0, 255, 128), (255, 255, 0, 128)) # 均值,进行归一化 MEANS = (104, 117, 123) # SSD300 CONFIGS voc = { 'num_classes': 21, #类别数 'lr_steps': (80000, 100000, 120000), #根据lr_steps进行学习率衰减 'max_iter': 120000, #最大迭代数 'feature_maps': [38, 19, 10, 5, 3, 1], #从上面ssd图中可以看到,是feature maps的大小 'min_dim': 300, #文章中是300 * 300以及512 * 512两种大小 'steps': [8, 16, 32, 64, 100, 300], #math.ceil(300/ 38) = 8,进行计算的,提前算好,便于计算映射关系,进行保存 'min_sizes': [30, 60, 111, 162, 213, 264], #根据公式计算的,下面会说,是根据ssd_pascal.py进行计算得到的 'max_sizes': [60, 111, 162, 213, 264, 315],#同上 'aspect_ratios': [[2], [2, 3], [2, 3], [2, 3], [2], [2]], 'variance': [0.1, 0.2], 'clip': True, 'name': 'VOC', } coco = { 'num_classes': 201, 'lr_steps': (280000, 360000, 400000), 'max_iter': 400000, 'feature_maps': [38, 19, 10, 5, 3, 1], 'min_dim': 300, 'steps': [8, 16, 32, 64, 100, 300], 'min_sizes': [21, 45, 99, 153, 207, 261], 'max_sizes': [45, 99, 153, 207, 261, 315], 'aspect_ratios': [[2], [2, 3], [2, 3], [2, 3], [2], [2]], 'variance': [0.1, 0.2], 'clip': True, 'name': 'COCO', }
ssd_pascal.py(参考https://blog.csdn.net/xunan003/article/details/79186162)
#参数生成先验 #输入图像的最小尺寸 min_dim = 300 #######维度 # conv4_3 ==> 38 x 38 # fc7 ==> 19 x 19 # conv6_2 ==> 10 x 10 # conv7_2 ==> 5 x 5 # conv8_2 ==> 3 x 3 # conv9_2 ==> 1 x 1 mbox_source_layers = ['conv4_3', 'fc7', 'conv6_2', 'conv7_2', 'conv8_2', 'conv9_2'] #####prior_box来源层,可以更改。很多改进都是基于此处的调整。 # in percent % min_ratio = 20 ####这里即是论文中所说的Smin=0.2,Smax=0.9的初始值,经过下面的运算即可得到min_sizes,max_sizes。 max_ratio = 90 ####math.floor()函数表示:求一个最接近它的整数,它的值小于或等于这个浮点数。 step = int(math.floor((max_ratio - min_ratio) / (len(mbox_source_layers) - 2)))####取一个间距步长,即在下面for循环给ratio取值时起一个间距作用。可以用一个具体的数值代替,这里等于17。 min_sizes = [] ###经过以下运算得到min_sizes和max_sizes。 max_sizes = [] for ratio in xrange(min_ratio, max_ratio + 1, step): ####从min_ratio至max_ratio+1每隔step=17取一个值赋值给ratio。注意xrange函数的作用。 ########min_sizes.append()函数即把括号内部每次得到的值依次给了min_sizes。 min_sizes.append(min_dim * ratio / 100.) max_sizes.append(min_dim * (ratio + step) / 100.) min_sizes = [min_dim * 10 / 100.] + min_sizes max_sizes = [min_dim * 20 / 100.] + max_sizes steps = [8, 16, 32, 64, 100, 300] ##相当于从feature map位置映射回原图位置,比如conv4_3输出特征图大小为38*38,而输入的图片为300*300,所以38*8约等于300,所以映射步长为8。这是针对300*300的训练图片。 aspect_ratios = [[2], [2, 3], [2, 3], [2, 3], [2], [2]] #######这里指的是横纵比,六种尺度对应六个产生prior_box的卷积层。
#所以最终计算出来 'min_sizes': [30, 60, 111, 162, 213, 264],
# 'max_sizes': [45, 99, 153, 207, 261, 315],
最终计算结果:
min_size | max_size | |
conv4_3 | 30 | 60 |
fc7 | 60 | 111 |
conv6_2 | 111 | 162 |
conv7_2 | 162 | 213 |
conv8_2 | 213 | 264 |
conv9_2 | 264 | 315 |
对于ar(aspect ratio):
原paper中,给出的$$ar = {1, 2, 3, 1/2, 1/3 }$$, 代码中给出的是aspect ratios= [[2], [2, 3], [2, 3], [2, 3] [2], [2]], 其实是一个意思,2表示ar取到了1,2. 1/2。3表示ar取到了1, 2, 3.,1/2,1/3. 根据上表中,我们可以看到各层default box有相对应的min_size以及max_size.当ascpect取1表示这层的feature map上的每个点只有两个default box,大小为$$min\_size * min\_size$$以及$$sqrt{min\_size * max\_size}$$, 当添加一个aspect ratio之后,就会增加两个长方形的default box。
大小为, $$frac{1}{sqrt{aspect\_ratio} * min\_size}$$ height: $${sqrt{aspect\_ratio} * min\_size}$$ 再进行90度顺时针旋转,交换width跟height形成另一个长方形default box。示意图如下:
从图中也可以直观的感受到,这是4个default box,ar取1, 2, 1/2. 同理6个default box也是这样的。
对于default box,在代码中,我们主要也是看prior_box.py文件中的PriorBox类,priorbox就是论文中的default box,在下面看到网络结构的时候进行讨论。
所以在整个数据读取之中,主要看data/voc0712.py文件(数据处理部分):
"""VOC Dataset Classes Original author: Francisco Massa https://github.com/fmassa/vision/blob/voc_dataset/torchvision/datasets/voc.py Updated by: Ellis Brown, Max deGroot """ from .config import HOME import os.path as osp import sys import torch import torch.utils.data as data import cv2 import numpy as np if sys.version_info[0] == 2: import xml.etree.cElementTree as ET else: import xml.etree.ElementTree as ET VOC_CLASSES = ( # always index 0 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor') # note: if you used our download scripts, this should be right VOC_ROOT = osp.join(HOME, "data/VOCdevkit/") class VOCAnnotationTransform(object): """Transforms a VOC annotation into a Tensor of bbox coords and label index Initilized with a dictionary lookup of classnames to indexes Arguments: class_to_ind (dict, optional): dictionary lookup of classnames -> indexes (default: alphabetic indexing of VOC's 20 classes) keep_difficult (bool, optional): keep difficult instances or not (default: False) height (int): height width (int): width """ def __init__(self, class_to_ind=None, keep_difficult=False): self.class_to_ind = class_to_ind or dict( zip(VOC_CLASSES, range(len(VOC_CLASSES)))) #{'person': 14, 'pottedplant': 15, 'dog': 11, 'aeroplane': 0, 'chair': 8, 'horse': 12, 'diningtable': 10, ... 'bicycle': 1} self.keep_difficult = keep_difficult def __call__(self, target, width, height): """ Arguments: target (annotation) : the target annotation to be made usable will be an ET.Element Returns: a list containing lists of bounding boxes [bbox coords, class name] """ res = [] for obj in target.iter('object'): difficult = int(obj.find('difficult').text) == 1 if not self.keep_difficult and difficult: continue name = obj.find('name').text.lower().strip() bbox = obj.find('bndbox') pts = ['xmin', 'ymin', 'xmax', 'ymax'] bndbox = [] for i, pt in enumerate(pts): cur_pt = int(bbox.find(pt).text) - 1 # scale height or width cur_pt = cur_pt / width if i % 2 == 0 else cur_pt / height bndbox.append(cur_pt) label_idx = self.class_to_ind[name] bndbox.append(label_idx) res += [bndbox] # [xmin, ymin, xmax, ymax, label_ind] # img_id = target.find('filename').text[:-4] return res # [[xmin, ymin, xmax, ymax, label_ind], ... ] class VOCDetection(data.Dataset): """VOC Detection Dataset Object input is image, target is annotation Arguments: root (string): filepath to VOCdevkit folder. image_set (string): imageset to use (eg. 'train', 'val', 'test') transform (callable, optional): transformation to perform on the input image target_transform (callable, optional): transformation to perform on the target `annotation` (eg: take in caption string, return tensor of word indices) dataset_name (string, optional): which dataset to load (default: 'VOC2007') """ def __init__(self, root, image_sets=[('2007', 'trainval'), ('2012', 'trainval')], transform=None, target_transform=VOCAnnotationTransform(), dataset_name='VOC0712'): self.root = root self.image_set = image_sets self.transform = transform self.target_transform = target_transform self.name = dataset_name self._annopath = osp.join('%s', 'Annotations', '%s.xml') self._imgpath = osp.join('%s', 'JPEGImages', '%s.jpg') self.ids = list() for (year, name) in image_sets: rootpath = osp.join(self.root, 'VOC' + year) for line in open(osp.join(rootpath, 'ImageSets', 'Main', name + '.txt')): self.ids.append((rootpath, line.strip())) def __getitem__(self, index): im, gt, h, w = self.pull_item(index) return im, gt def __len__(self): return len(self.ids) def pull_item(self, index): img_id = self.ids[index] target = ET.parse(self._annopath % img_id).getroot() img = cv2.imread(self._imgpath % img_id) height, width, channels = img.shape if self.target_transform is not None: target = self.target_transform(target, width, height) if self.transform is not None: target = np.array(target) img, boxes, labels = self.transform(img, target[:, :4], target[:, 4]) # to rgb img = img[:, :, (2, 1, 0)] # img = img.transpose(2, 0, 1) target = np.hstack((boxes, np.expand_dims(labels, axis=1))) return torch.from_numpy(img).permute(2, 0, 1), target, height, width # return torch.from_numpy(img), target, height, width def pull_image(self, index): '''Returns the original image object at index in PIL form Note: not using self.__getitem__(), as any transformations passed in could mess up this functionality. Argument: index (int): index of img to show Return: PIL img ''' img_id = self.ids[index] return cv2.imread(self._imgpath % img_id, cv2.IMREAD_COLOR) def pull_anno(self, index): '''Returns the original annotation of image at index Note: not using self.__getitem__(), as any transformations passed in could mess up this functionality. Argument: index (int): index of img to get annotation of Return: list: [img_id, [(label, bbox coords),...]] eg: ('001718', [('dog', (96, 13, 438, 332))]) ''' img_id = self.ids[index] anno = ET.parse(self._annopath % img_id).getroot() gt = self.target_transform(anno, 1, 1) return img_id[1], gt def pull_tensor(self, index): '''Returns the original image at an index in tensor form Note: not using self.__getitem__(), as any transformations passed in could mess up this functionality. Argument: index (int): index of img to show Return: tensorized version of img, squeezed ''' return torch.Tensor(self.pull_image(index)).unsqueeze_(0)
4.2 SSD网络结构
backbone (VGG)
网络首先使用的是VGG作为base的backbone网络来进行,也可以选用别的backbone,如squeezenet等小网络来设计小型的网络。
#VGG_BASE
#This function is derived from torchvision VGG make_layers()主要是从torchvision来扒出来的vgg
#cfg = [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'C', 512, 512, 512, 'M', 512, 512, 512]
def vgg(cfg, i, batch_norm=False): layers = [] in_channels = i for v in cfg: if v == 'M': layers += [nn.MaxPool2d(kernel_size=2, stride=2)] elif v == 'C': layers += [nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True)] else: conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1) if batch_norm: layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)] else: layers += [conv2d, nn.ReLU(inplace=True)] in_channels = v pool5 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1) conv6 = nn.Conv2d(512, 1024, kernel_size=3, padding=6, dilation=6) conv7 = nn.Conv2d(1024, 1024, kernel_size=1) layers += [pool5, conv6, nn.ReLU(inplace=True), conv7, nn.ReLU(inplace=True)] return layers
extra(for feature scaling)
#cfg: [256, 'S', 512, 128, 'S', 256, 128, 256, 128, 256]
def add_extras(cfg, i, batch_norm=False): # Extra layers added to VGG for feature scaling layers = [] in_channels = i flag = False for k, v in enumerate(cfg): if in_channels != 'S': if v == 'S': layers += [nn.Conv2d(in_channels, cfg[k + 1], kernel_size=(1, 3)[flag], stride=2, padding=1)] else: layers += [nn.Conv2d(in_channels, v, kernel_size=(1, 3)[flag])] flag = not flag in_channels = v return layers
Multibox
#Multibox函数得到的每个特征图的默认box的位置以及分类
#cfg: [4, 6, 6, 6, 4, 4]
#选取的特征图在输入的两个list(vgg和add_extras)中的索引:
#vgg:21,-2
#conv4_3去掉relu的末端;conv7relu之前的1x1卷积;
#add_extras:1,3,5,7
#conv8_2末端;conv9_2末端;conv10_2末端;conv11_2末端
#cfg = [4, 6, 6, 6, 4, 4] # 每个特征图中各个像素定义的default boxes数量
def multibox(vgg, extra_layers, cfg, num_classes): loc_layers = [] conf_layers = [] vgg_source = [21, -2] for k, v in enumerate(vgg_source): loc_layers += [nn.Conv2d(vgg[v].out_channels, cfg[k] * 4, kernel_size=3, padding=1)] conf_layers += [nn.Conv2d(vgg[v].out_channels, cfg[k] * num_classes, kernel_size=3, padding=1)] for k, v in enumerate(extra_layers[1::2], 2): loc_layers += [nn.Conv2d(v.out_channels, cfg[k] * 4, kernel_size=3, padding=1)] conf_layers += [nn.Conv2d(v.out_channels, cfg[k] * num_classes, kernel_size=3, padding=1)] return vgg, extra_layers, (loc_layers, conf_layers)
整体的网络结构在ssd.py之中
class SSD(nn.Module): def __init__(self, phase, size, base, extras, head, num_classes): #phase:"train"/"test"; size:输入图像尺寸,300; #base, extras, head:分别为上文中三个函数的输出 super(SSD, self).__init__() self.phase = phase self.num_classes = num_classes self.cfg = (coco, voc)[num_classes == 21] self.priorbox = PriorBox(self.cfg) #默认框的获取,将在其他博客中分析 self.priors = Variable(self.priorbox.forward(), volatile=True)#0.4.1之后,取消了Variable,都是Tensor,这句就是便是不参与更新的Tensor self.size = size # SSD network self.vgg = nn.ModuleList(base) # Layer learns to scale the l2 normalized features from conv4_3 self.L2Norm = L2Norm(512, 20) self.extras = nn.ModuleList(extras) self.loc = nn.ModuleList(head[0]) self.conf = nn.ModuleList(head[1]) if phase == 'test': self.softmax = nn.Softmax(dim=-1) self.detect = Detect(num_classes, 0, 200, 0.01, 0.45) def forward(): sources = list() #6张特征图 loc = list() #所有默认框的位置预测结果,列表中一个元素对应一张特征图 conf = list() #所有默认框的分类预测结果,列表中一个元素对应一张特征图 # 前向传播vgg至conv4_3 relu 得到第1个特征图 for k in range(23): x = self.vgg[k](x) s = self.L2Norm(x) sources.append(s) # 继续前向传播vgg至fc7得到第2个特征图 for k in range(23, len(self.vgg)): x = self.vgg[k](x) sources.append(x) # 在extra layers中前向传播得到另外4个特征图 for k, v in enumerate(self.extras): x = F.relu(v(x), inplace=True) if k % 2 == 1: sources.append(x) # 将各个特征图中的定位和分类预测结果append进列表中 for (x, l, c) in zip(sources, self.loc, self.conf): loc.append(l(x).permute(0, 2, 3, 1).contiguous()) #6*(N,C,H,W)->6*(N,H,W,C) C=k*4 conf.append(c(x).permute(0, 2, 3, 1).contiguous()) #6*(N,C,H,W)->6*(N,H,W,C) C=k*num_class loc = torch.cat([o.view(o.size(0), -1) for o in loc], 1) #[N,-1] conf = torch.cat([o.view(o.size(0), -1) for o in conf], 1) #[N,-1] if self.phase == "test": #如果是测试阶段需要对定位和分类的预测结果进行分析得到最终的预测框 output = self.detect( loc.view(loc.size(0), -1, 4), # loc preds ->[N,num_priors,4] self.softmax(conf.view(conf.size(0), -1, self.num_classes)), # conf preds [N,num_priors,num_classes] 最后一维softmax self.priors.type(type(x.data)) # default boxes [num_priors,4] 4:[cx,cy,w,h] ) #output: [N,num_classes,num_remain*5] else: #如果是训练阶段则直接输出定位和分类预测结果以计算损失函数 output = ( loc.view(loc.size(0), -1, 4), #[N,num_priors,4] conf.view(conf.size(0), -1, self.num_classes), #[N,num_priors,num_classes] self.priors #[num_priors,4] ) return output
Priorbox.py
class PriorBox(object): """Compute priorbox coordinates in center-offset form for each source feature map. """ def __init__(self, cfg): super(PriorBox, self).__init__() self.image_size = cfg['min_dim'] # number of priors for feature map location (either 4 or 6) self.num_priors = len(cfg['aspect_ratios']) self.variance = cfg['variance'] or [0.1] #用于后续回归计算的权重值 self.feature_maps = cfg['feature_maps'] self.min_sizes = cfg['min_sizes'] self.max_sizes = cfg['max_sizes'] self.steps = cfg['steps'] self.aspect_ratios = cfg['aspect_ratios'] self.clip = cfg['clip'] self.version = cfg['name'] for v in self.variance: if v <= 0: raise ValueError('Variances must be greater than 0') def forward(self): """需要用到的参数: min_dim = 300 "输入图最短边的尺寸" feature_maps = [38, 19, 10, 5, 3, 1] steps = [8, 16, 32, 64, 100, 300] "共有6个特征图: feature_maps指的是在某一层特征图中,遍历一行/列需要的步数 steps指特征图中两像素点相距n则在原图中相距steps[k]*n 由于steps由于网络结构所以为固定,所以作者应该是由300/steps[k]得到 feature_maps" min_sizes = [30, 60, 111, 162, 213, 264] max_sizes = [60, 111, 162, 213, 264, 315] "min_sizes和max_sizes共同使用为用于计算aspect_ratios=1时 rel size: sqrt(s_k * s_(k+1))时所用" aspect_ratios = [[2], [2, 3], [2, 3], [2, 3], [2], [2]] "各层除1以外的aspect_ratios,可以看出是各不相同的, 这样每层特征图的每个像素点分别有[4,6,6,6,4,4]个default boxes 作者也在原文中提到这个可以根据自己的场景适当调整" """ mean = [] #对于每一个特征图生成box for k, f in enumerate(feature_maps): #对特定特征图的每一个像素点生成适当数量的default boxes for i, j in product(range(f), repeat=2): f_k = image_size / steps[k] #f_k 是第k个特征图的大小 """每个default box的中心点,从论文以及代码复现可知0<cx,cy<1 即对应于原图的一个比例""" cx = (j + 0.5) / f_k cy = (i + 0.5) / f_k #第一种情形: # aspect_ratio: 1 # rel size: min_size s_k = min_sizes[k]/image_size mean += [cx, cy, s_k, s_k] #第二种情形: # aspect_ratio: 1 # rel size: sqrt(s_k * s_(k+1)) s_k_prime = sqrt(s_k * (max_sizes[k]/image_size)) mean += [cx, cy, s_k_prime, s_k_prime] # 剩余情形 for ar in aspect_ratios[k]: mean += [cx, cy, s_k*sqrt(ar), s_k/sqrt(ar)] mean += [cx, cy, s_k/sqrt(ar), s_k*sqrt(ar)] # back to torch land output = torch.Tensor(mean).view(-1, 4) #[num_priors,4] (cx,cy,w,h)
对于anchor based的目标检测中,我们都会涉及到nms,(非极大值抑制,进行框的过滤)
def nms(boxes, scores, overlap=0.7, top_k=200): """ 输入: boxes: 存储一个图片的所有预测框。[num_positive,4]. scores:置信度。如果为多分类则需要将nms函数套在一个循环内。[num_positive]. overlap: nms抑制时iou的阈值. top_k: 先选取置信度前top_k个框再进行nms. 返回: nms后剩余预测框的索引. """ keep = scores.new(scores.size(0)).zero_().long() # 保存留下来的box的索引 [num_positive] # 函数new(): 构建一个有相同数据类型的tensor #如果输入box为空则返回空Tensor if boxes.numel() == 0: return keep x1 = boxes[:, 0] y1 = boxes[:, 1] x2 = boxes[:, 2] y2 = boxes[:, 3] area = torch.mul(x2 - x1, y2 - y1) #并行化计算所有框的面积 v, idx = scores.sort(0) # 升序排序 idx = idx[-top_k:] # 前top-k的索引,从小到大 xx1 = boxes.new() yy1 = boxes.new() xx2 = boxes.new() yy2 = boxes.new() w = boxes.new() h = boxes.new() count = 0 while idx.numel() > 0: i = idx[-1] # 目前最大score对应的索引 keep[count] = i #存储在keep中 count += 1 if idx.size(0) == 1: #跳出循环条件:box被筛选完了 break idx = idx[:-1] # 去掉最后一个 #剩下boxes的信息 torch.index_select(x1, 0, idx, out=xx1) torch.index_select(y1, 0, idx, out=yy1) torch.index_select(x2, 0, idx, out=xx2) torch.index_select(y2, 0, idx, out=yy2) # 计算当前最大置信框与其他剩余框的交集,作者这段代码写的不好,容易误导 xx1 = torch.clamp(xx1, min=x1[i]) #max(x1,xx1) yy1 = torch.clamp(yy1, min=y1[i]) #max(y1,yy1) xx2 = torch.clamp(xx2, max=x2[i]) #min(x2,xx2) yy2 = torch.clamp(yy2, max=y2[i]) #min(y2,yy2) w.resize_as_(xx2) h.resize_as_(yy2) w = xx2 - xx1 #w=min(x2,xx2)−max(x1,xx1) h = yy2 - yy1 #h=min(y2,yy2)−max(y1,yy1) w = torch.clamp(w, min=0.0) #max(w,0) h = torch.clamp(h, min=0.0) #max(h,0) inter = w*h #计算当前最大置信框与其他剩余框的IOU # IoU = i / (area(a) + area(b) - i) rem_areas = torch.index_select(area, 0, idx) # 剩余的框的面积 union = rem_areas + area[i]- inter #并集 IoU = inter/union # 计算iou # 选出IoU <= overlap的boxes(注意le函数的使用) idx = idx[IoU.le(overlap)] return keep, count#[num_remain], num_remain
4.3 训练策略
4.3.1 Match的规则,匹配出相对应的positive样本
def match(threshold, truths, priors, variances, labels, loc_t, conf_t, idx): """ 输入: threshold:匹配boxes的阈值. truths: Ground truth boxes [num_objects,4] priors: Prior boxes from priorbox layers, [num_priors,4]. variances: bbox回归时需要用到的参数,[num_priors, 4]. labels: Ground truth boxes的类别标签, [num_objects,1]. loc_t: 存储匹配后各default boxes的offset信息 [batch, num_priors, 4] conf_t: 存储匹配后各default boxes的真实类别标记 [batch, num_priors] idx: (int) current batch index 返回: 函数本身不返回值,但它会把匹配框的位置和置信信息存储在loc_t, conf_t两个tensor中。 """ overlaps = jaccard( #[num_objects,num_priors] truth,defaults truths, point_form(priors) ) # 互相匹配 # [num_objects,1]每个真实框对应的默认框 best_prior_overlap, best_prior_idx = overlaps.max(1, keepdim=True) # [1,num_priors]每个默认框对应的真实框 best_truth_overlap, best_truth_idx = overlaps.max(0, keepdim=True) best_truth_idx.squeeze_(0) best_truth_overlap.squeeze_(0) #[num_priors] best_prior_idx.squeeze_(1) best_prior_overlap.squeeze_(1) #[num_objects] #各个best_prior的best_truth_overlap都修改为2,确保匹配的框不会因为阈值太低被过滤掉 best_truth_overlap.index_fill_(0, best_prior_idx, 2) #每一个真实框覆盖其匹配到的默认框匹配的真实框 for j in range(best_prior_idx.size(0)): best_truth_idx[best_prior_idx[j]] = j matches = truths[best_truth_idx] #[num_priors,4] 对每一个默认框都匹配一个真实框 conf = labels[best_truth_idx] + 1 #[num_priors] 每个默认框匹配到的真实框的类 conf[best_truth_overlap < threshold] = 0 #如果匹配的iou小于阈值则定为背景 loc = encode(matches, priors, variances) #编码,用于训练阶段,生成matched和默认框之间的offset,后续在进行整理 loc_t[idx] = loc # [num_priors,4] encoded offsets to learn conf_t[idx] = conf # [num_priors] top class label for each prior
Loss: 损失函数
egin{equation}
L(x, c, l, g)=frac{1}{N}left(L_{c o n f}(x, c)+alpha L_{l o c}(x, l, g)
ight)
end{equation}
其中,I表示预测的框,N表示匹配到的default boxes, g表示gt(ground truth).
对于位置损失函数,与faster rcnn一致:
egin{equation}
L_{l o c}(x, l, g)=sum_{i in P o s}^{N} sum_{m in{c x, c y, w, h}} x_{i j}^{k} operatorname{smooth}_{L 1}left(l_{i}^{m}-hat{g}_{j}^{m}
ight)
hat{g}_{j}^{c x}=left(g_{j}^{c x}-d_{i}^{c x}
ight) / d_{i}^{w} quad hat{g}_{j}^{c y}=left(g_{j}^{c y}-d_{i}^{c y}
ight) / d_{i}^{h}
hat{g}_{j}^{w}=log left(frac{g_{j}^{w}}{d_{i}^{w}}
ight) quad hat{g}_{j}^{h}=log left(frac{g_{j}^{h}}{d_{i}^{h}}
ight)
end{equation}
在计算位置损失loss的过程中,我们只计算positive样本。
而smooth L1的计算方法如下:
egin{equation}
operatorname{smooth}_{L_{1}}(x)=left{egin{array}{ll}{0.5 x^{2}} & { ext { if }|x|<1} \ {|x|-0.5} & { ext { otherwise }}end{array}
ight.
end{equation}
其中,面试知识点来了: (请问为什么使用smooth l1 loss?答:在x相对较大的i情况下,梯度正负1,防止离群点训练的梯度过大,而在靠近0的时候,梯度为x,减小了梯度,使训练变得平缓)
而对于分类函数而言,我们需要考虑正负样本,进行分类的loss计算。
egin{equation}
L_{ ext {conf}}(x, c)=-sum_{i in P o s}^{N} x_{i j}^{p} log left(hat{c}_{i}^{p}
ight)-sum_{i in N e g} log left(hat{c}_{i}^{0}
ight) quad ext { where } quad hat{c}_{i}^{p}=frac{exp left(c_{i}^{p}
ight)}{sum_{p} exp left(c_{i}^{p}
ight)}
end{equation}
其中:
egin{equation}
x_{i j}^{p}={1,0}
end{equation}
这其中的1表示两者match,而0表示为负样本。计算交叉熵,表示分类的loss。
Hard negative example mining
对于网络训练过程中,default box中,正负样本不太均衡,负样本很多,对于网络训练过程中,就会占据大量的比例,因此训练的效果就不会太好,主要需要降低负样本的损失函数。我们在此针对confidence loss进行排序,按照pos:neg = 1:3的比例,找到最高的那些负样本最为最终的负样本进行优化训练。
4.4 训练代码
#-*- coding:utf-8 -*- from data import * #data是一个目录,主要引入的是data中的__init__.py。其中记录了coco以及voc数据集处理的操作 from utils.augmentations import SSDAugmentation #记录了ssd训练过程中的一系列数据增强的工作 from layers.modules import MultiBoxLoss #MultiBoxLoss from ssd import build_ssd #如何进行ssd网络的初始化 import os import sys import time import torch from torch.autograd import Variable #0.4.1之后向后兼容 import torch.nn as nn import torch.optim as optim import torch.backends.cudnn as cudnn import torch.nn.init as init import torch.utils.data as data import numpy as np import argparse #将str类型转换成bool类型 def str2bool(v): return v.lower() in ("yes", "true", "t", "1") parser = argparse.ArgumentParser( description='Single Shot MultiBox Detector Training With Pytorch') train_set = parser.add_mutually_exclusive_group() #选择性的定义数据集,VOC or COCO parser.add_argument('--dataset', default='VOC', choices=['VOC', 'COCO'], type=str, help='VOC or COCO') parser.add_argument('--dataset_root', default=VOC_ROOT, help='Dataset root directory path') parser.add_argument('--basenet', default='vgg16_reducedfc.pth', help='Pretrained base model') parser.add_argument('--batch_size', default=32, type=int, help='Batch size for training') parser.add_argument('--resume', default=None, type=str, help='Checkpoint state_dict file to resume training from') parser.add_argument('--start_iter', default=0, type=int, help='Resume training at this iter') parser.add_argument('--num_workers', default=4, type=int, help='Number of workers used in dataloading') parser.add_argument('--cuda', default=True, type=str2bool, help='Use CUDA to train model') parser.add_argument('--lr', '--learning-rate', default=1e-3, type=float, help='initial learning rate') parser.add_argument('--momentum', default=0.9, type=float, help='Momentum value for optim') #学习率衰减 parser.add_argument('--weight_decay', default=5e-4, type=float, help='Weight decay for SGD') parser.add_argument('--gamma', default=0.1, type=float, help='Gamma update for SGD') #可视化的操作 parser.add_argument('--visdom', default=False, type=str2bool, help='Use visdom for loss visualization') parser.add_argument('--save_folder', default='weights/', help='Directory for saving checkpoint models') args = parser.parse_args() #定义网络中tensor的类型 if torch.cuda.is_available(): if args.cuda: torch.set_default_tensor_type('torch.cuda.FloatTensor') if not args.cuda: print("WARNING: It looks like you have a CUDA device, but aren't " + "using CUDA. Run with --cuda for optimal training speed.") torch.set_default_tensor_type('torch.FloatTensor') else: torch.set_default_tensor_type('torch.FloatTensor') if not os.path.exists(args.save_folder): os.mkdir(args.save_folder) def train(): #进行数据集的选择 if args.dataset == 'COCO': if args.dataset_root == VOC_ROOT: if not os.path.exists(COCO_ROOT): parser.error('Must specify dataset_root if specifying dataset') print("WARNING: Using default COCO dataset_root because " + "--dataset_root was not specified.") args.dataset_root = COCO_ROOT cfg = coco dataset = COCODetection(root=args.dataset_root, transform=SSDAugmentation(cfg['min_dim'], MEANS)) #MEANS = (104, 117, 123) elif args.dataset == 'VOC': if args.dataset_root == COCO_ROOT: parser.error('Must specify dataset if specifying dataset_root') cfg = voc #在utils中的augmentations.py中有相対應的及其複雜的數據增強的工作 dataset = VOCDetection(root=args.dataset_root, transform=SSDAugmentation(cfg['min_dim'], MEANS)) if args.visdom: import visdom viz = visdom.Visdom() #用于ssd网络的初始化 ssd_net = build_ssd('train', cfg['min_dim'], cfg['num_classes']) net = ssd_net #多卡并行计算 if args.cuda: net = torch.nn.DataParallel(ssd_net) cudnn.benchmark = True #进行初始化以及将网络放到gpu上 if args.resume: print('Resuming training, loading {}...'.format(args.resume)) ssd_net.load_weights(args.resume) else: vgg_weights = torch.load(args.save_folder + args.basenet) print('Loading base network...') ssd_net.vgg.load_state_dict(vgg_weights) if args.cuda: net = net.cuda() if not args.resume: print('Initializing weights...') # initialize newly added layers' weights with xavier method ssd_net.extras.apply(weights_init) ssd_net.loc.apply(weights_init) ssd_net.conf.apply(weights_init) optimizer = optim.SGD(net.parameters(), lr=args.lr, momentum=args.momentum, weight_decay=args.weight_decay) #计算loss,这部分的代码需要仔细的看下,里面涉及到很多不太好明白的代码,可以使用print进行调试下,还包括了难样本挖掘的操作。
criterion = MultiBoxLoss(cfg['num_classes'], 0.5, True, 0, True, 3, 0.5, False, args.cuda) #表示訓練的狀態 net.train() # loss counters loc_loss = 0 conf_loss = 0 epoch = 0 print('Loading the dataset...') epoch_size = len(dataset) // args.batch_size print('Training SSD on:', dataset.name) print('Using the specified args:') print(args) step_index = 0 if args.visdom: vis_title = 'SSD.PyTorch on ' + dataset.name vis_legend = ['Loc Loss', 'Conf Loss', 'Total Loss'] iter_plot = create_vis_plot('Iteration', 'Loss', vis_title, vis_legend) epoch_plot = create_vis_plot('Epoch', 'Loss', vis_title, vis_legend) #读取数据集 data_loader = data.DataLoader(dataset, args.batch_size, num_workers=args.num_workers, shuffle=True, collate_fn=detection_collate, pin_memory=True) # create batch iterator batch_iterator = iter(data_loader) for iteration in range(args.start_iter, cfg['max_iter']): if args.visdom and iteration != 0 and (iteration % epoch_size == 0): update_vis_plot(epoch, loc_loss, conf_loss, epoch_plot, None, 'append', epoch_size) # reset epoch loss counters loc_loss = 0 conf_loss = 0 epoch += 1 #学习率衰减 if iteration in cfg['lr_steps']: step_index += 1 adjust_learning_rate(optimizer, args.gamma, step_index) ####warning,进行修改########## # load train data try: images, targets = next(batch_iterator) except StopIteration: batch_iterator = iter(data_loader) images, targets = next(batch_iterator) ##否则会出现loss下降很快但结果不好 if args.cuda: images = Variable(images.cuda()) with torch.no_grad(): target = [variable(ann.cuda()) for ann in targets] #targets = [Variable(ann.cuda(), volatile=True) for ann in targets] else: images = Variable(images) with torch.no_grad(): targets = [Variable(ann) for ann in targets] #targets = [Variable(ann, volatile=True) for ann in targets] # forward t0 = time.time() #out包括三個部分 #out[0] is loc==>: size is [batch_size, 8732, 4] #out[1] is conf=>: size is [batch_size, 8732, 21] #out[2] is priors=>: size is [8732, 4] out = net(images) # backprop optimizer.zero_grad() loss_l, loss_c = criterion(out, targets) loss = loss_l + loss_c loss.backward() optimizer.step() t1 = time.time() loc_loss += loss_l.item() conf_loss += loss_c.item() if iteration % 10 == 0: print('timer: %.4f sec.' % (t1 - t0)) print('iter ' + repr(iteration) + ' || Loss: %.4f ||' % (loss.item()), end=' ') if args.visdom: update_vis_plot(iteration, loss_l.item(), loss_c.item(), iter_plot, epoch_plot, 'append') if iteration != 0 and iteration % 5000 == 0: print('Saving state, iter:', iteration) torch.save(ssd_net.state_dict(), 'weights/ssd300_COCO_' + repr(iteration) + '.pth') torch.save(ssd_net.state_dict(), args.save_folder + '' + args.dataset + '.pth') #学习率衰减 def adjust_learning_rate(optimizer, gamma, step): """Sets the learning rate to the initial LR decayed by 10 at every specified step # Adapted from PyTorch Imagenet example: # https://github.com/pytorch/examples/blob/master/imagenet/main.py """ lr = args.lr * (gamma ** (step)) for param_group in optimizer.param_groups: param_group['lr'] = lr def xavier(param): init.xavier_uniform(param) def weights_init(m): if isinstance(m, nn.Conv2d): xavier(m.weight.data) m.bias.data.zero_() def create_vis_plot(_xlabel, _ylabel, _title, _legend): return viz.line( X=torch.zeros((1,)).cpu(), Y=torch.zeros((1, 3)).cpu(), opts=dict( xlabel=_xlabel, ylabel=_ylabel, title=_title, legend=_legend ) ) def update_vis_plot(iteration, loc, conf, window1, window2, update_type, epoch_size=1): viz.line( X=torch.ones((1, 3)).cpu() * iteration, Y=torch.Tensor([loc, conf, loc + conf]).unsqueeze(0).cpu() / epoch_size, win=window1, update=update_type ) # initialize epoch plot on first iteration if iteration == 0: viz.line( X=torch.zeros((1, 3)).cpu(), Y=torch.Tensor([loc, conf, loc + conf]).unsqueeze(0).cpu(), win=window2, update=True ) if __name__ == '__main__': train()
#具体在细节的代码,可以进入相对应的文件进行print调试,接下来对ssd进行下总结(背掉面试用)。
1. SSD的conv4_3中采用了一个L2_norm来将每个像素点在channel层面进行了归一化,不像bn,需要在[batch_size, height, width]层面进行归一化。后采用了6个特征图用于提取检测
2. SSD的网络输出包括两个部分,类别置信度,以及边框的位置,对6个feature maps采用一次3 *3卷积进行完成的,其中令k是特征图所采用的的先眼眶数据,类别所需要用到的卷积核数量是
kc,其中c表示目标类别数加1,而边框需要采用k*4.SSD300一共可以预测8732个预测框。
3. SSD300可以做到real time。
4. 先验框于gt匹配的规则:主要是分为两步,显示GT找到预期IOU最大的先验框,同时每个先验框找到与其阈值大于0.5的GT。但是这样,正样本还是少,所以,SSD采用了hard negtive mining,
对负样本进行了采样,按照置信度误差进行降序排列,选择误差较大的top-k作为训练的负样本,以保证正负样本的比例为1:3。
5. 预测过程比较简单,对于每个预测框,首先根据类别置信度确定其类别(置信度最大者)与置信度值,并过滤掉属于背景的预测框。然后根据置信度阈值(如0.5)过滤掉阈值较低的预测框。对于留下的预测框进行解码,根据先验框得到其真实的位置参数(解码后一般还需要做clip,防止预测框位置超出图片)。解码之后,一般需要根据置信度进行降序排列,然后仅保留top-k(如400)个预测框。最后就是进行NMS算法,过滤掉那些重叠度较大的预测框。最后剩余的预测框就是检测结果了。