zoukankan html css js c++ java

YOLO V2学习总结

写在前面

YOLO的升级版有两种：YOLOv2和YOLO9000。作者采用了一系列的方法优化了YOLO的模型结构，产生了YOLOv2，在快速的同时准确率达到目前最好的结果（state of the art）。然后，作者提出了一种目标分类与检测的联合训练方法，通过WordTree来混合检测数据集与识别数据集之中的数据，同时在COCO和ImageNet数据集中进行训练得到YOLO9000，实现9000多种物体的实时检测。

YOLO V2是原作者在V1基础上做出改进后提出的，论文的名称就已经表达了作者的工作内容：

Better 指的是和YOLO相比，YOLO V2有更好的精度
Faster 指的是修改了网络结构，其检测更快
Stronger 指的就是YOLO 9000,使用联合训练的方法，同时使用目标检测和图像分类的数据集，训练YOLO V2，训练出来的模型能够实时的识别多达9000种目标，所以也称为YOLO9000。

遵循原论文的结构，本文将从Better，Faster和Stronger三个方面对YOLO V2进行解读。

一. Better

在YOLO V1的基础上，作者提出了不少的改进来进一步提升算法的性能（mAP），主要改进措施包括网络结构的改进（第1，3，5，6条）和Anchor Box的引进（第3，4，5条）以及训练方法（第2，7条）。

1.1 引入BN层（Batch Normalization）

Batch Normalization能够加快模型收敛，并提供一定的正则化。作者在每个conv层都加上了了BN层，同时去掉了原来模型中的drop out部分，实验证明可以提高2%的mAP。

BN层进行如下变换：①对该批样本的各特征量（对于中间层来说，就是每一个神经元）分别进行归一化处理，分别使每个特征的数据分布变换为均值0，方差1。从而使得每一批训练样本在每一层都有类似的分布。这一变换不需要引入额外的参数。②对上一步的输出再做一次线性变换，假设上一步的输出为Z，则Z1=γZ + β。这里γ、β是可以训练的参数。增加这一变换是因为上一步骤中强制改变了特征数据的分布，可能影响了原有数据的信息表达能力。增加的线性变换使其有机会恢复其原本的信息。

关于批规一化的更多信息可以参考 Batch Normalization原理与实战。

1.2 高分辨率分类器（High Resolution Classifier）

这里要先清楚相比图像的分类任务，目标检测需要更高的图像分辨率。另外，训练网络时一般都不会从随机初始化所有的参数来开始的，一般都是用预训练好的网络来fine-tuning自己的网络，预训练的网络一般是在ImageNet上训练好的分类网络。

YOLOV1预训练的时候使用224x224的输入，检测的时候采用的是448x448的输入，这会导致分类切换到检测的时候，模型需要适应图像分辨率的改变。
YOLOV2中将预训练分成两步：①：先用224x224的输入来训练大概160个epoch，然后再把输入调整到448x448再训练10个epoch，然后再与训练好的模型上进行fine-tuning，检测的时候用448x448就可以顺利过渡了。

这个方法提高了3.7%的mAP.

1.3 引入先验框（Anchor Box）

在YOLO中在最后网络的全连接层直接预测目标边框的坐标，在YOLO V2中借鉴 Fast R-CNN中的Anchor的思想。

去掉了YOLO网络的全连接层和最后的池化层，使提取特征的网络能够得到更高分辨率的特征。
使用416×416代替448×448作为网络的输入。这是因为希望得到的特征图的尺寸为奇数。奇数大小的宽和高会使得每个特征图在划分cell的时候就只有一个center cell（比如可以划分成7x7或9x9个cell，center cell只有一个，如果划分成8x8或10x10的，center cell就有4个）。为什么希望只有一个center cell呢？因为大的object一般会占据图像的中心，所以希望用一个center cell去预测，而不是4个center cell去预测。网络最终将416x416的输入变成13x13大小的feature map输出，也就是缩小比例为32。（5个池化层，每个池化层将输入的尺寸缩小1/2）。
Anchor Boxes 在YOLO中，每个grid cell只预测两个bbox，最终只能预测98个bbox（7×7×2=98），而在Faster RCNN在输入大小为1000×600时的boxes数量大概是6000，在SSD300中boxes数量是8732。显然增加box数量是为了提高object的定位准确率。过少的bbox显然影响了YOLO的定位的精度，在YOLO V2中引入了Anchor Boxes的思想，其预测的bbox则会超过千个（以输出的feature map为13×13为例，每个grid cell有9个anchor box的话，其预测的bbox数量为13×13×9=1521个）。

引入anchor box之后，相对YOLO1的81%的召回率，YOLO2的召回率大幅提升到88%。同时mAP有0.2%的轻微下降。

1.4 引入聚类提取先验框尺度（Dimension Cluster）

在引入anchor box后，一个问题就是如何确定anchor的位置和大小？Faster RCNN中是手工选定的，每隔stride设定一个anchor，并根据不同的面积比例和长宽比例产生9个(3种大小，3种形状共9种)anchor box。设想能否一开始就选择了更好的、更有代表性的先验Boxes维度，那么网络就应该更容易学到准确的预测位置。作者的解决办法就是统计学习中的K-means聚类方法，通过对数据集中的Ground True Box做聚类，找到Ground True Box的统计规律。以聚类个数k为Anchor Boxs个数，以k个聚类中心Box的宽高维度为Anchor Box的维度。

如果按照标准K-means使用欧式距离函数，大Boxes比小Boxes产生更多Error。但是，我们真正想要的是产生好的IOU得分的Boxes（与Box的大小无关）。因此采用了如下距离度量：

[d(box,centroid)=1−IoU(box,centroid) ]

图1是在VOC和COCO上的聚类结果：

图1：Clustering dimensions on VOC and COCO

实验结论：

采用聚类分析得到的先验框比手动设置的平均的IOU值更高，模型更容易训练和学习。
随着K的增加，平均的IOU是增加的。但是为了综合考虑模型的复杂度和召回率。最终选择K=5。使用5个聚类框就已经达到61 Avg IOU，相当于9个手工设置的先验框60.9 Avg IOU。

作者还发现：The cluster centroids are significantly different than hand-picked anchor boxes. There are fewer short, wide boxes and more tall, thin boxes.这个是个无关紧要的结论了。

1.5 直接位置预测（Direct Location Prediction）

在引入anchor box后，另一个问题就是模型不稳定，特别是在训练前期，作者认为这种不稳定是因为边界框（bounding box）中心位置的预测不够成功。

基于候选框的网络一般是通过预测相对于anchor box中心的偏移值来预测边界框的的中心坐标。公式如下：

[x=(t_x*w_a)+x_a\y=(t_y*h_a)+y_a ]

其中 ((x_a,y_a)) 是anchor box的中心坐标，(w_a、h_a) 是anchor box的宽和高， ((t_x,t_y)) 表示预测的偏移值， ((x,y)) 表示预测的边界框的中心坐标，这个公式对于 ((t_x,t_y)) 没有限制，这就表示预测的边界框容易向任何一个方向偏移，比如当 (t_x=1) 时，边界框就会向右偏移一个anchor box的宽度。所以，每一个预测的边界框可能处于图片中的任意位置，这就导致了模型的不稳定。

YOLI V2沿用了V1中的做法，预测边界框的中心点相对于对应网格左上角的偏移值，每个网格有5个anchor box来预测5个边界框，每个边界框预测得到5个值：(t_x,t_y,t_w,t_h,t_o) ，前四个是边界框的坐标和边长信息，(t_o) 则类似于YOLO V1中的置信度，((c_x,c_y)) 是当前网格相对于图像左上角的坐标，anchor box的先验宽度和高度为 (p_w,p_h) ，那么参照图10，预测的公式为：

[b_x=delta(t_x)+c_x\ b_y=delta(t_y)+c_y\ b_w=p_we^{(t_w)}\ b_h=p_he^{(t_h)}\ Pr(object)*IOU(b,object)=sigma(t_o) ]

为了将边界框的中心约束到当前网格中，利用sigmoid函数将 (t_x,t_y) 进行归一化处理，使得模型更加稳定。

图2：Bounding boxes with dimension priors and location prediction

通过对比实验发现，采用维度聚类与直接位置预测比单纯使用anchor box的方法在精度能多出5%。

1.6 细粒度特征（Fine-Gained Features）

YOLO V2最后一层卷积层输出的是13x13的特征图，检测时也是遵循的这个分辨率。这个分辨率对于大尺寸目标的检测是足够了，但是对于小目标则需要更细粒度的特征，因为越小的物体在经过层层池化后，体现在最终特征图中的可能性越小。

Faser R-CNN和SSD都在不同层次的特征图上产生区域建议以获得多尺度的适应性，YOLO V2则开创性地引入了直通层(passthrough layer)，这个直通层有点类似ResNet的dentity mappings结构，将浅层和深层两种不同尺寸的特征连接起来。在这里是将前一层高分辨率的特征图连接到低分辨率的特征图上：前一层的特征图的维度为26x26x512，在最后一个pooling之前将其1拆4形成4个13x13x512大小的特征图，然后将其与最后一层特征图（13x13x1024）连接成13x13x(1024+2048)的特征图，最后在此特征图上进行卷积预测（详细过程见下图3）。相当于做了一次特征融合，有利于检测小目标。

1.7 多尺度训练（Multi-Scale Training）

在实际应用时，输入的图像大小有可能是变化的。我们也将这一点考虑进来。因为我们的网络是全卷积神经网络，只有conv和pooling层，没有全连接层，这样就可以处理任意尺寸的图像。为了应对不同尺寸的图像，YOLO V2中在训练的时候使用不同的尺寸图像。

具体来说，在训练的时候，每隔一定的epoch（例如10）后就会微调网络，随机改变网络的输入图像大小。YOLO V2共进行5次最大池化，即最终的降采样参数为32，所以随机生成的图像大小为32的倍数，即{320,352,…,608}，最终最小的尺寸为320×320，最大的尺寸为608×608。

该训练规则强迫模型取适应不同的输入分辨率。模型对于小尺寸的输入处理速度更快，因此YOLOv2可以按照需求调节速度和准确率。在低分辨率情况下（288×288），YOLOv2可以在保持和Fast R-CNN持平的准确率的情况下，处理速度可以达到90FPS。在高分辨率情况下，YOLOv2在VOC2007数据集上准确率可以达到state of the art（78.6mAP）

对于目前流行的检测方法（Faster RCNN，SSD，YOLO）的精度和帧率之间的关系，见下图4。可以看到，作者在30fps处画了一条竖线，这是算法能否达到实时处理的分水岭。Faster RCNN败下阵来，而YOLO V2的不同点代表了不同输入图像分辨率下算法的表现。对于详细数据，见表格1对比（VOC 2007上进行测试）。

表1：YOLOv2与其它模型在VOC 2007数据集上的性能对比

小结

YOLO V2针对YOLO定位不准确以及召回率低的问题，进行一些改变。主要是借鉴Faster R-CNN的思想，引入了Anchor box。并且使用k-means的方法，通过聚类得到每个Anchor应该生成的Anchor box的的大小和形状。为了是提取到的特征有更细的粒度，其网络中借鉴ResNet的思想，将浅层的高分辨率特征和深层的特征进行了融合，这样能够更好的检测小的目标。最后，由于YOLO V2的网络是全卷积网络，能够处理任意尺寸的图像，在训练的时候使用不同尺度的图像，以应对图像尺寸的变换。

在Better这部分的末尾，作者给出了一个表格，指出了主要提升性能的措施。

二. Faster

为了精度与速度并重，作者在速度上也作了一些改进措施。大多数检测网络依赖于VGG-16作为特征提取网络，VGG-16是一个强大而准确的分类网络，但是确过于复杂。224*224的图片进行一次前向传播，其卷积层就需要多达306.9亿次浮点数运算。

YOLO使用的是基于Googlenet的自定制网络，比VGG-16更快，一次前向传播仅需85.2亿次运算，不过它的精度要略低于VGG-16。224*224图片取Single-Crop, Top-5 Accuracy，YOLO的定制网络得到88%（VGG-16得到90%）。

2.1 Darknet-19

YOLOv2使用了一个新的分类网络作为特征提取部分，参考了前人的工作经验。类似于VGG，网络使用了较多的33卷积核，在每一次池化操作后把通道数翻倍。借鉴了Network In Network的思想，网络使用了全局平均池化（Global Average Pooling）做预测，把11的卷积核置于3*3的卷积核之间，用来压缩特征。使用Batch Normalization稳定模型训练，加速收敛，正则化模型。

最终得出的基础模型就是Darknet-19，包含19个卷积层、5个最大值池化层（Max Pooling Layers ）。Darknet-19处理一张照片需要55.8亿次运算，Imagenet的Top-1准确率为72.9%，Top-5准确率为91.2%。具体的网络结构见表3。

2.2 分类任务训练（Training For Classiﬁcation）

作者采用ImageNet1000类数据集来训练分类模型。训练过程中，采用了 random crops, rotations, and hue, saturation, and exposure shifts等data augmentation方法。预训练后，作者采用高分辨率图像（448×448）对模型进行finetune。高分辨率下训练的分类网络Top-1准确率76.5%，Top-5准确率93.3%。

2.3 检测任务训练（Training For Detection）

为了把分类网络改成检测网络，作者将分类模型的最后一层卷积层去除，替换为三层卷积层（3×3,1024 filters），最后一层为1×1卷积层，输出维度filters为需要检测的数目。对于VOC数据集，预测5种Boxes，每个Box包含5个坐标值和20个类别，所以总共是5 * （5+20）= 125个输出维度。因此，输出为125（5x20+5x5） filters。最后还加入了passthough 层，从最后3 x 3 x 512的卷积层连到倒数第二层，使模型有了细粒度特征。

三. Stronger

如之前所说，物体分类，是对整张图片打标签，比如这张图片中含有人，另一张图片中的物体为狗；而物体检测不仅对物体的类别进行预测，同时需要框出物体在图片中的位置。物体分类的数据集，最著名的ImageNet，物体类别有上万个，而物体检测数据集，例如coco，只有80个类别，因为物体检测、分割的打标签成本比物体分类打标签成本要高很多。所以在这里，作者提出了分类、检测训练集联合训练的方案。

3.1 Joint Classification And Detection（联合分类和检测）

使用检测数据集的图片去学习检测相关的信息，例如Bounding Box 坐标预测，是否包含物体以及属于各个物体的概率。使用仅有类别标签的分类数据集图片去扩展可以检测的种类。训练过程中把监测数据和分类数据混合在一起。基本的思路是，如果是检测样本，训练时其Loss包括分类误差和定位误差，如果是分类样本，则Loss只包括分类误差。当然，一般的训练策略为，先在检测数据集上训练一定的epoch，待预测框的loss基本稳定后，再联合分类数据集、检测数据集进行交替训练，同时为了分类、检测数据量平衡，作者对coco数据集进行了上采样，使得coco数据总数和ImageNet大致相同。

联合分类与检测数据集，这里不同于将网络的backbone在ImageNet上进行预训练，预训练只能提高卷积核的鲁棒性，而分类检测数据集联合，可以扩充识别物体种类。比如狗，ImageNet上就包含超过100多类品种的狗。如果要联合训练，需要将这些标签进行合并。

大部分分类方法采用softmax输出所有类别的概率。采用softmax的前提假设是类别之间不相互包含（比如，犬和牧羊犬就是相互包含）。因此，我们需要一个多标签的模型来综合数据集，使类别之间不相互包含。

作者最后采用WordTree来整合数据集，解决了ImageNet与coco之间的类别问题。

3.2 Dataset combination with WordTree

可以使用WordTree把多个数据集整合在一起。只需要把数据集中的类别映射到树结构中的同义词集合（Synsets）。使用WordTree整合ImageNet和COCO的标签如图5所示：

树结构表示物体之间的从属关系非常合适，第一个大类，物体，物体之下有动物、人工制品、自然物体等，动物中又有更具体的分类。此时，在类别中，不对所有的类别进行softmax操作，而对同一层级的类别进行softmax：

如图6中所示，同一颜色的位置，进行softmax操作，使得同一颜色中只有一个类别预测分值最大。在预测时，从树的根节点开始向下检索，每次选取预测分值最高的子节点，直到所有选择的节点预测分值连乘后小于某一阈值时停止。在训练时，如果标签为人，那么只对人这个节点以及其所有的父节点进行loss计算，而其子节点，男人、女人、小孩等，不进行loss计算。

最后的结果是，Yolo v2可以识别超过9000个物体，作者美其名曰Yolo9000。当然原文中也提到，只有当父节点在检测集中出现过，子节点的预测才会有效。如果子节点是裤子、T恤、裙子等，而父节点衣服在检测集中没有出现过，那么整条预测类别支路几乎都是检测失效的状态。这也合理，给神经网络看的都是狗，让它去预测猫，目前神经网络还没有这么智能。

四. 源程序

tensorflow版本为1.14 。代码结构如图7所示。

图7：代码结构

训练图集为COOC数据集，为了方便，我直接使用的yolo2_coco_checkpoint权重文件。

4.1 基于图片的目标检测

yolo_pic

import tensorflow as tf
import numpy as np
from cv2 import cv2 as cv2
from keras import backend as K


def leaky_relu(x):    #leaky relu激活函数，leaky_relu激活函数一般用在比较深层次神经网络中
    return tf.maximum(0.1*x,x)

class yolov2(object):

    def __init__(self,cls_name):

        self.anchor_size = [[0.57273, 0.677385], #coco
                           [1.87446, 2.06253],
                           [3.33843, 5.47434],
                           [7.88282, 3.52778],
                           [9.77052, 9.16828]]
        self.num_anchors = len(self.anchor_size)
        if cls_name == 'coco':
            self.CLASS = ['person', 'bicycle', 'car', 'motorbike', 'aeroplane', 'bus', 'train',
                          'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign',
                          'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep',
                          'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella',
                          'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
                          'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard',
                          'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon',
                          'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot',
                          'hot dog', 'pizza', 'donut', 'cake', 'chair', 'sofa', 'pottedplant',
                          'bed', 'diningtable', 'toilet', 'tvmonitor', 'laptop', 'mouse',
                          'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster',
                          'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear',
                          'hair drier', 'toothbrush']  #coco
            self.f_num = 425

        else:
            self.CLASS = ["aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow", "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"]
            self.f_num = 125

        self.num_class = len(self.CLASS)
        self.feature_map_size = (13,13)
        self.object_scale = 5. #'物体位于gird cell时计算置信度的修正系数'
        self.no_object_scale = 1.   #'物体位于gird cell时计算置信度的修正系数'
        self.class_scale = 1.  #'计算分类损失的修正系数'
        self.coordinates_scale = 1.  #'计算坐标损失的修正系数'


#################################NewWork

    def conv2d(self,x,filters_num,filters_size,pad_size=0,stride=1,batch_normalize=True,activation=leaky_relu,use_bias=False,name='conv2d'):

        if pad_size > 0:
            x = tf.pad(x,[[0,0],[pad_size,pad_size],[pad_size,pad_size],[0,0]])

        out = tf.layers.conv2d(x,filters=filters_num,kernel_size=filters_size,strides=stride,padding='VALID',activation=None,use_bias=use_bias,name=name)
        # BN应该在卷积层conv和激活函数activation之间,
        # (后面有BN层的conv就不用偏置bias，并激活函数activation在后)
        if batch_normalize:
            out = tf.layers.batch_normalization(out,axis=-1,momentum=0.9,training=False,name=name+'_bn')
        if activation:
            out = activation(out)
        return out

    def maxpool(self,x, size=2, stride=2, name='maxpool'):
        return tf.layers.max_pooling2d(x, pool_size=size, strides=stride,name=name)

    # passthrough
    def passthrough(self,x, stride):
        return tf.space_to_depth(x, block_size=stride)
        #或者tf.extract_image_patches

    def darknet(self):

        x = tf.placeholder(dtype=tf.float32,shape=[None,416,416,3])

        net = self.conv2d(x, filters_num=32, filters_size=3, pad_size=1,
                     name='conv1')
        net = self.maxpool(net, size=2, stride=2, name='pool1')

        net = self.conv2d(net, 64, 3, 1, name='conv2')
        net = self.maxpool(net, 2, 2, name='pool2')

        net = self.conv2d(net, 128, 3, 1, name='conv3_1')
        net = self.conv2d(net, 64, 1, 0, name='conv3_2')
        net = self.conv2d(net, 128, 3, 1, name='conv3_3')
        net = self.maxpool(net, 2, 2, name='pool3')

        net = self.conv2d(net, 256, 3, 1, name='conv4_1')
        net = self.conv2d(net, 128, 1, 0, name='conv4_2')
        net = self.conv2d(net, 256, 3, 1, name='conv4_3')
        net = self.maxpool(net, 2, 2, name='pool4')

        net = self.conv2d(net, 512, 3, 1, name='conv5_1')
        net = self.conv2d(net, 256, 1, 0, name='conv5_2')
        net = self.conv2d(net, 512, 3, 1, name='conv5_3')
        net = self.conv2d(net, 256, 1, 0, name='conv5_4')
        net = self.conv2d(net, 512, 3, 1, name='conv5_5')  #

        # 这一层特征图，要进行后面passthrough
        shortcut = net
        net = self.maxpool(net, 2, 2, name='pool5')  #

        net = self.conv2d(net, 1024, 3, 1, name='conv6_1')
        net = self.conv2d(net, 512, 1, 0, name='conv6_2')
        net = self.conv2d(net, 1024, 3, 1, name='conv6_3')
        net = self.conv2d(net, 512, 1, 0, name='conv6_4')
        net = self.conv2d(net, 1024, 3, 1, name='conv6_5')


        # 训练检测网络时去掉了分类网络的网络最后一个卷积层，
        # 在后面增加了三个卷积核尺寸为3 * 3，卷积核数量为1024的卷积层，并在这三个卷积层的最后一层后面跟一个卷积核尺寸为1 * 1
        # 的卷积层，卷积核数量是（B * （5 + C））。
        # 对于VOC数据集，卷积层输入图像尺寸为416 * 416
        # 时最终输出是13 * 13
        # 个栅格，每个栅格预测5种boxes大小，每个box包含5个坐标值和20个条件类别概率，所以输出维度是13 * 13 * 5 * （5 + 20）= 13 * 13 * 125。
        #
        # 检测网络加入了passthrough layer，从最后一个输出为26 * 26 * 512
        # 的卷积层连接到新加入的三个卷积核尺寸为3 * 3
        # 的卷积层的第二层，使模型有了细粒度特征。

        # 下面这部分主要是training for detection
        net = self.conv2d(net, 1024, 3, 1, name='conv7_1')
        net = self.conv2d(net, 1024, 3, 1, name='conv7_2')

        # shortcut增加了一个中间卷积层，先采用64个1*1卷积核进行卷积，然后再进行passthrough处理
        # 这样26*26*512 -> 26*26*64 -> 13*13*256的特征图
        shortcut = self.conv2d(shortcut, 64, 1, 0, name='conv_shortcut')
        shortcut = self.passthrough(shortcut, 2)

        # 连接之后，变成13*13*（1024+256）
        net = tf.concat([shortcut, net],-1)  # channel整合到一起，concatenated with the original features，passthrough层与ResNet网络的shortcut类似，以前面更高分辨率的特征图为输入，然后将其连接到后面的低分辨率特征图上，
        net = self.conv2d(net, 1024, 3, 1, name='conv8')

        # detection layer: 最后用一个1*1卷积去调整channel，该层没有BN层和激活函数，变成: S*S*(B*(5+C))，在这里为：13*13*425
        output = self.conv2d(net, filters_num=self.f_num, filters_size=1, batch_normalize=False, activation=None,
                        use_bias=True, name='conv_dec')

        return output,x




#生成anchor  --->  decode
    def decode(self,net):

        self.anchor_size = tf.constant(self.anchor_size , dtype=tf.float32)

        net = tf.reshape(net, [-1, 13 * 13, self.num_anchors, self.num_class + 5]) #[batch,169,5,85]

        # 偏移量、置信度、类别
        #中心坐标相对于该cell坐上角的偏移量，sigmoid函数归一化到(0,1)
        xy_offset = tf.nn.sigmoid(net[:, :, :, 0:2])
        wh_offset = tf.exp(net[:, :, :, 2:4])
        obj_probs = tf.nn.sigmoid(net[:, :, :, 4])  # 置信度,这个东西就是相当于v1中的confidence
        class_probs = tf.nn.softmax(net[:, :, :, 5:])  #

        # 在feature map对应坐标生成anchors，每个坐标五个
        height_index = tf.range(self.feature_map_size[0], dtype=tf.float32)
        width_index = tf.range(self.feature_map_size[1], dtype=tf.float32)

        x_cell, y_cell = tf.meshgrid(height_index, width_index)
        x_cell = tf.reshape(x_cell, [1, -1, 1])  # 和上面[H*W,num_anchors,num_class+5]对应
        y_cell = tf.reshape(y_cell, [1, -1, 1])

        # decode
        bbox_x = (x_cell + xy_offset[:, :, :, 0]) / 13
        bbox_y = (y_cell + xy_offset[:, :, :, 1]) / 13
        bbox_w = (self.anchor_size[:, 0] * wh_offset[:, :, :, 0]) / 13
        bbox_h = (self.anchor_size[:, 1] * wh_offset[:, :, :, 1]) / 13

        bboxes = tf.stack([bbox_x - bbox_w / 2, bbox_y - bbox_h / 2, bbox_x + bbox_w / 2, bbox_y + bbox_h / 2],
                          axis=3)

        return bboxes, obj_probs, class_probs

    #将边界框超出整张图片(0,0)—(415,415)的部分cut掉
    def bboxes_cut(self,bbox_min_max, bboxes):
        bboxes = np.copy(bboxes)
        bboxes = np.transpose(bboxes)
        bbox_min_max = np.transpose(bbox_min_max)
        # cut the box
        bboxes[0] = np.maximum(bboxes[0], bbox_min_max[0])  # xmin
        bboxes[1] = np.maximum(bboxes[1], bbox_min_max[1])  # ymin
        bboxes[2] = np.minimum(bboxes[2], bbox_min_max[2])  # xmax
        bboxes[3] = np.minimum(bboxes[3], bbox_min_max[3])  # ymax
        bboxes = np.transpose(bboxes)
        return bboxes

    def bboxes_sort(self,classes, scores, bboxes, top_k=400):
        index = np.argsort(-scores)
        classes = classes[index][:top_k]
        scores = scores[index][:top_k]
        bboxes = bboxes[index][:top_k]
        return classes, scores, bboxes


    def bboxes_iou(self,bboxes1, bboxes2):
        bboxes1 = np.transpose(bboxes1)
        bboxes2 = np.transpose(bboxes2)

        int_ymin = np.maximum(bboxes1[0], bboxes2[0])
        int_xmin = np.maximum(bboxes1[1], bboxes2[1])
        int_ymax = np.minimum(bboxes1[2], bboxes2[2])
        int_xmax = np.minimum(bboxes1[3], bboxes2[3])

        int_h = np.maximum(int_ymax - int_ymin, 0.)
        int_w = np.maximum(int_xmax - int_xmin, 0.)

        # 计算IOU
        int_vol = int_h * int_w  # 交集面积
        vol1 = (bboxes1[2] - bboxes1[0]) * (bboxes1[3] - bboxes1[1])  # bboxes1面积
        vol2 = (bboxes2[2] - bboxes2[0]) * (bboxes2[3] - bboxes2[1])  # bboxes2面积
        IOU = int_vol / (vol1 + vol2 - int_vol)  # IOU=交集/并集
        return IOU

    # NMS，或者用tf.image.non_max_suppression
    def bboxes_nms(self,classes, scores, bboxes, nms_threshold=0.2):
        keep_bboxes = np.ones(scores.shape, dtype=np.bool)
        for i in range(scores.size - 1):
            if keep_bboxes[i]:
                overlap = self.bboxes_iou(bboxes[i], bboxes[(i + 1):])
                keep_overlap = np.logical_or(overlap < nms_threshold,
                                             classes[(i + 1):] != classes[i])  # IOU没有超过0.5或者是不同的类则保存下来
                keep_bboxes[(i + 1):] = np.logical_and(keep_bboxes[(i + 1):], keep_overlap)

        idxes = np.where(keep_bboxes)
        return classes[idxes], scores[idxes], bboxes[idxes]

    def postprocess(self,bboxes, obj_probs, class_probs, image_shape=(416, 416), threshold=0.5):

        bboxes = np.reshape(bboxes, [-1, 4])
        # 将所有box还原成图片中真实的位置
        bboxes[:, 0:1] *= float(image_shape[1])
        bboxes[:, 1:2] *= float(image_shape[0])
        bboxes[:, 2:3] *= float(image_shape[1])
        bboxes[:, 3:4] *= float(image_shape[0])
        bboxes = bboxes.astype(np.int32)  # 转int


        bbox_min_max = [0, 0, image_shape[1] - 1, image_shape[0] - 1]
        bboxes = self.bboxes_cut(bbox_min_max, bboxes)


        obj_probs = np.reshape(obj_probs, [-1])  # 13*13*5
        class_probs = np.reshape(class_probs, [len(obj_probs), -1])  # (13*13*5,80)
        class_max_index = np.argmax(class_probs, axis=1)  # max类别概率对应的index
        class_probs = class_probs[np.arange(len(obj_probs)), class_max_index]
        scores = obj_probs * class_probs  # 置信度*max类别概率=类别置信度scores

        # 类别置信度scores>threshold的边界框bboxes留下
        keep_index = scores > threshold
        class_max_index = class_max_index[keep_index]
        scores = scores[keep_index]
        bboxes = bboxes[keep_index]

        # (2)排序top_k(默认为400)
        class_max_index, scores, bboxes = self.bboxes_sort(class_max_index, scores, bboxes)
        # (3)NMS
        class_max_index, scores, bboxes = self.bboxes_nms(class_max_index, scores, bboxes)
        return bboxes, scores, class_max_index



    def preprocess_image(self,image, image_size=(416, 416)):

        image_cp = np.copy(image).astype(np.float32)
        image_rgb = cv2.cvtColor(image_cp, cv2.COLOR_BGR2RGB)
        image_resized = cv2.resize(image_rgb, image_size)
        image_normalized = image_resized.astype(np.float32) / 225.0
        image_expanded = np.expand_dims(image_normalized, axis=0)
        return image_expanded


    '''
    train part
    '''


    def preprocess_true_boxes(self,true_box,anchors,img_size = (416,416)):
        '''
        :param true_box:实际框的位置和类别,2D TENSOR:(batch,5)
        :param anchors:anchors : 实际anchor boxes 的值，论文中使用了五个。[w,h]，都是相对于gird cell 的比值。
                2d
            第二个维度：[w,h]，w,h,都是相对于gird cell长宽的比值。
           [1.08, 1.19], [3.42, 4.41], [6.63, 11.38], [9.42, 5.11], [16.62, 10.52]
        :param img_size:
        :return:
           -detectors_mask: 取值是0或者1，这里的shape是[13,13,5,1]
                第四个维度：0/1。1的就是用于预测改true boxes 的 anchor boxes
           -matching_true_boxes:这里的shape是[13,13,5,5]。
        '''
        w,h = img_size
        feature_w = w // 32
        feature_h = h // 32

        num_box_params = true_box.shape[1]
        detectors_mask = np.zeros((feature_h,feature_w,self.num_anchors,1),dtype=np.float32)
        matching_true_boxes = np.zeros((feature_h,feature_w,self.num_anchors,num_box_params),dtype=np.float32)

        for i in true_box:
            #提取类别信息，属于哪类
            box_class = i[4:5]
            #换算成相对于gird cell的值
            box = i[0:4] * np.array([feature_w, feature_h, feature_w, feature_h])
            k = np.floor(box[1]).astype('int') #y方向上属于第几个gird cell
            j = np.floor(box[0]).astype('int') #x方向上属于第几个gird cell
            best_iou = 0
            best_anchor = 0

            #计算anchor boxes 和 true boxes的iou ，一个true box一个best anchor
            for m,anchor in enumerate(anchors):
                box_maxes = box[2:4] / 2.
                box_mins = -box_maxes
                anchor_maxes = (anchor / 2.)
                anchor_mins = -anchor_maxes

                intersect_mins = np.maximum(box_mins, anchor_mins)
                intersect_maxes = np.minimum(box_maxes, anchor_maxes)
                intersect_wh = np.maximum(intersect_maxes - intersect_mins, 0.)
                intersect_area = intersect_wh[0] * intersect_wh[1]
                box_area = box[2] * box[3]
                anchor_area = anchor[0] * anchor[1]
                iou = intersect_area / (box_area + anchor_area - intersect_area)
                if iou > best_iou:
                    best_iou = iou
                    best_anchor = m

            if best_iou > 0:
                detectors_mask[k, j, best_anchor] = 1

                adjusted_box = np.array(  #找到最佳预测anchor boxes
                    [
                        box[0] - j, box[1] - k, #'x,y都是相对于gird cell的位置，左上角[0,0]，右下角[1,1]'
                        np.log(box[2] / anchors[best_anchor][0]), #'对应实际框w,h和anchor boxes w,h的比值取log函数'
                        np.log(box[3] / anchors[best_anchor][1]), box_class #'class实际框的物体是属于第几类'
                    ],
                    dtype=np.float32)
                matching_true_boxes[k, j, best_anchor] = adjusted_box
            return detectors_mask, matching_true_boxes



    def yolo_head(self,feature_map, anchors, num_classes):
        '''
        这个函数是输入yolo的输出层的特征，转化成相对于gird cell坐标的x,y，相对于gird cell长宽的w,h，
        pred_confidence是判断否存在物体的概率，pred_class_prob是sofrmax后各个类别分别的概率
        :param feats:  网络最后一层输出 [none,13,13,125]/[none,13,13,425]
        :param anchors:[5,n]
        :param num_classes:类别数
        :return:x,y,w,h在loss function中计算iou，然后计算iou损失。
                然后和pred_confidence计算confidence_loss，pred_class_prob用于计算classification_loss。
                box_xy : 每张图片的每个gird cell中的每个pred_boxes中心点x,y相对于其所在gird cell的坐标值，左上顶点为[0,0],右下顶点为[1,1]。
                shape:[-1,13,13,5,2].
                box_wh : 每张图片的每个gird cell中的每个pred_boxes的w,h都是相对于gird cell的比值
                shape:[-1,13,13,5,2].
                box_confidence : 每张图片的每个gird cell中的每个pred_boxes的，判断是否存在可检测物体的概率。
                shape:[-1,13,13,5,1]。各维度信息同上。
                box_class_pred : 每张图片的每个gird cell中的每个pred_boxes所框起来的各个类别分别的概率(经过了softmax)。
                shape:[-1,13,13,5,20/80]
'''
        anchors = tf.reshape(tf.constant(anchors,dtype=tf.float32),[1,1,1,self.num_anchors,2])
        num_gird_cell = tf.shape(feature_map)[1:3] #[13,13]
        conv_height_index = K.arange(0,stop=num_gird_cell[0])
        conv_width_index = K.arange(0,stop=num_gird_cell[1])

        conv_height_index = tf.tile(conv_height_index, [num_gird_cell[1]])

        conv_width_index = tf.tile(
            tf.expand_dims(conv_width_index, 0), [num_gird_cell[0], 1])
        conv_width_index = K.flatten(K.transpose(conv_width_index))
        conv_index = K.transpose(K.stack([conv_height_index,conv_width_index]))
        conv_index = K.reshape(conv_index,[1,num_gird_cell[0],num_gird_cell[1],1,2])#[1，13，13，1，2]
        conv_index = K.cast(conv_index,K.dtype(feature_map))
        #[[0,0][0,1]....[0,12],[1,0]...]
        feature_map = K.reshape(feature_map,[-1,num_gird_cell[0],num_gird_cell[1],self.num_anchors,self.num_class + 5])
        num_gird_cell = K.cast(K.reshape(num_gird_cell,[1,1,1,1,2]),K.dtype(feature_map))

        box_xy = K.sigmoid(feature_map[...,:2])
        box_wh = K.exp(feature_map[...,2:4])
        confidence = K.sigmoid(feature_map[...,4:5])
        cls_prob = K.softmax(feature_map[...,5:])

        xy = (box_xy + conv_index) / num_gird_cell
        wh = box_wh * anchors / num_gird_cell

        return xy,wh,confidence,cls_prob



    def loss(self,
             net,
             true_boxes,
             detectors_mask,
             matching_true_boxes,
             anchors,
             num_classes):
        '''
        IOU损失，分类损失，坐标损失
        confidence_loss：
                共有845个anchor_boxes，与true_boxes匹配的用于预测pred_boxes，
                未与true_boxes匹配的anchor_boxes用于预测background。在未与true_boxes匹配的anchor_boxes中，
                与true_boxes的IOU小于0.6的被标记为background，这部分预测正确，未造成损失。
                但未与true_boxes匹配的anchor_boxes中，若与true_boxes的IOU大于0.6的我们需要计算其损失，
                因为它未能准确预测background，与true_boxes重合度过高，就是no_objects_loss。
                而objects_loss则是与true_boxes匹配的anchor_boxes的预测误差。与YOLOv1不同的是修正系数的改变，
                YOLOv1中no_objects_loss和objects_loss分别是0.5和1，而YOLOv2中则是1和5。
        classification_loss:
                经过softmax（）后，20维向量（数据集中分类种类为20种）的均方误差。
        coordinates_loss：
                计算x,y的误差由相对于整个图像（416x416）的offset坐标误差的均方改变为相对于gird cell的offset（这个offset是取sigmoid函数得到的处于（0,1）的值）坐标误差的均方。
                也将修正系数由5改为了1 。计算w,h的误差由w,h平方根的差的均方误差变为了，
                w,h与对true_boxes匹配的anchor_boxes的长宽的比值取log函数，
                和YOLOv1的想法一样，对于相等的误差值，降低对大物体误差的惩罚，加大对小物体误差的惩罚。同时也将修正系数由5改为了1。
        :param net:[batch_size,13,13,125],网络最后一层输出
        :param true_boxes:实际框的位置和类别 [batch,5]
        :param detectors_mask:取值是0或者1，[ batch_size，13,13,5,1]
                1的就是用于预测改true boxes 的 anchor boxes
        :param matching_true_boxes:[-1,13,13,5,5]
        :param anchors:
        :param num_classes:20
        :return:
        '''

        xy, wh, confidence, cls_prob = self.yolo_head(net,anchors,num_classes)
        shape = tf.shape(net)
        feature_map = tf.reshape(net,[-1,shape[1],shape[2],self.num_anchors,num_classes + 5])
        #用于和matching_true_boxes计算坐标损失
        pred_box = tf.concat([K.sigmoid(feature_map[...,0:2]),feature_map[...,2:4]],-1)

        pred_xy = tf.to_float(tf.expand_dims(xy,4))#[-1,13,13,5,2]-->[-1,13,13,5,1,2]
        pred_wh = tf.to_float(tf.expand_dims(wh,4))

        pred_min = tf.to_float(pred_xy - pred_wh / 2.0)
        pred_max = tf.to_float(pred_xy + pred_wh / 2.0)

        true_box_shape = K.shape(true_boxes)
        print(true_box_shape)
        true_boxes = K.reshape(true_boxes,[-1,1,1,1,true_box_shape[1], 5])
        #[-1,1,1,1,-1,5],batch, conv_height, conv_width, num_anchors, num_true_boxes, box_params'

        true_xy = tf.to_float(true_boxes[...,0:2])
        true_wh = tf.to_float(true_boxes[...,2:4])
        true_min = tf.to_float(true_xy - true_wh / 2.0)
        true_max = tf.to_float(true_xy + true_wh / 2.0)

        #计算所以abox和tbox的iou
        intersect_mins = tf.maximum(pred_min, true_min)
        intersect_maxes = tf.minimum(pred_max, true_max)
        intersect_wh = tf.maximum(intersect_maxes - intersect_mins, 0.)
        intersect_areas = tf.to_float(intersect_wh[..., 0] * intersect_wh[..., 1])
        pred_areas = pred_wh[..., 0] * pred_wh[..., 1]
        true_areas = true_wh[..., 0] * true_wh[..., 1]
        union_areas = pred_areas + true_areas - intersect_areas
        iou_scores = intersect_areas / union_areas


        #可能会有多个tbox落在同一个cell ，只去iou最大的
        # tf.argmax(iou_scores,4)
        best_ious = K.max(iou_scores, axis=4)
        best_ious = tf.expand_dims(best_ious,axis=-1)

        #选出IOU大于0.6的，若IOU小于0.6的被标记为background，
        obj_dec = tf.cast(best_ious > 0.6,dtype=K.dtype(best_ious))


        #IOU loss
        no_obj_w = (self.no_object_scale * obj_dec * detectors_mask) #
        no_obj_loss = no_obj_w * tf.square(-confidence)
        obj_loss = self.object_scale * detectors_mask * tf.square(1 - confidence)
        confidence_loss = no_obj_loss + obj_loss


        #class loss
        match_cls = tf.cast(matching_true_boxes[...,4],dtype=tf.int32)
        match_cls = tf.one_hot(match_cls,num_classes)

        class_loss = (self.class_scale * detectors_mask * tf.square(match_cls - cls_prob))

        #坐标loss
        match_box = matching_true_boxes[...,0:4]
        coord_loss = self.coordinates_scale * detectors_mask * tf.square(match_box - pred_box)


        confidence_loss_sum = K.sum(confidence_loss)
        class_loss_sum = K.sum(class_loss)
        coord_loss_sum = K.sum(coord_loss)
        all_loss = 0.5 * (confidence_loss_sum + class_loss_sum + coord_loss_sum)

        return all_loss


    def draw_detection(self,im, bboxes, scores, cls_inds, labels):

        imgcv = np.copy(im)
        h, w, _ = imgcv.shape
        for i, box in enumerate(bboxes):
            cls_indx = cls_inds[i]
            thick = int((h + w) / 1000)
            cv2.rectangle(imgcv, (box[0], box[1]), (box[2], box[3]), (0, 0, 255), thick)
            print("[x, y, w, h]=[%d, %d, %d, %d]" % (box[0], box[1], box[2], box[3]))
            mess = '%s: %.3f' % (labels[cls_indx], scores[i])
            text_loc = (box[0], box[1] - 10)
            cv2.putText(imgcv, mess, text_loc, cv2.FONT_HERSHEY_SIMPLEX, 1e-3 * h, (0, 0, 255), thick)
        # return imgcv
        cv2.imshow("detection_results", imgcv)  # 显示图片
        cv2.waitKey(0)

#v1 - v2 , v2 - v3
# 1、加入BN层 批次归一化   input --> 均值为0方差为1正太分布
#    ---》白化  --> 对‘input 变换到 均值0单位方差内的分布
# #使用：input * w -->bn

if __name__ == '__main__':
    network = yolov2('coco')

    net,x = network.darknet()
    



    saver = tf.train.Saver()
    ckpt_path = './model/v2/yolo2_coco.ckpt'
    sess = tf.Session()
    sess.run(tf.global_variables_initializer())
    saver.restore(sess,ckpt_path)

    img = cv2.imread('./test/3.jpg')
    #shape = img.shape[:2]
    img_r = network.preprocess_image(img)



    bboxes, obj_probs, class_probs = network.decode(net)
    bboxes, obj_probs, class_probs = sess.run([bboxes, obj_probs, class_probs],feed_dict={x:img_r})
    bboxes, scores, class_max_index = network.postprocess(bboxes, obj_probs, class_probs)
   
   
    print('置信度：',end="")
    print(scores)
    print('类别信息：',end="")
    print(class_max_index)

    img_detection = network.draw_detection(cv2.resize(img,(416,416)), bboxes, scores, class_max_index, network.CLASS)


    

'''
 yi、
    第一大层  :conv maxpoiling
    第2大层:3个卷积，maxpool
    3:3个卷积，maxpool
    4：3卷积，maxpool
    5:5卷积，maxpool   -----------
    6:5卷积                       | + add
    7三个卷积---------------------
    conv  
 er:
    ahchors生成和decode
 san:
    裁剪、选出前TOP_K，NMS 
'''

运行结果：

测试1：

对同一张测试图片分别做V1和V2版本的目标检测，对比图如下。

从对比图中可以看出：在YOLO V1中，对于本张测试图片，程序只检测出了人和猫两个物体，并且它们的置信度只有0.249和0.504；而在V2版本中，不仅检测到了更多的物体，人和猫的检测置信度也高达0.778和0.797，说明准确率也在提高。此外，程序在显示多个boungding box的同时也输出了他们的坐标以及大小信息。

测试2：

当然，在V1中有一个失败的测试，即那个行人、车辆都很密集且都尺寸比较小的图片，很遗憾在V2的版本中也没有检测到任何物体。

测试3：

最后，以我的女神tsy与她剧组的合照作为测试的结尾，效果还是很好的。

图11：YOLO V2图片检测结果（成功）

4.2 基于视频的目标检测

yolo_video

import tensorflow as tf
import numpy as np
from cv2 import cv2 as cv2
from keras import backend as K



def leaky_relu(x):    #leaky relu激活函数，leaky_relu激活函数一般用在比较深层次神经网络中
    return tf.maximum(0.1*x,x)

class yolov2(object):

    def __init__(self,cls_name):

        self.anchor_size = [[0.57273, 0.677385], #coco
                           [1.87446, 2.06253],
                           [3.33843, 5.47434],
                           [7.88282, 3.52778],
                           [9.77052, 9.16828]]
        self.num_anchors = len(self.anchor_size)
        if cls_name == 'coco':
            self.CLASS = ['person', 'bicycle', 'car', 'motorbike', 'aeroplane', 'bus', 'train',
                          'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign',
                          'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep',
                          'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella',
                          'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
                          'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard',
                          'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon',
                          'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot',
                          'hot dog', 'pizza', 'donut', 'cake', 'chair', 'sofa', 'pottedplant',
                          'bed', 'diningtable', 'toilet', 'tvmonitor', 'laptop', 'mouse',
                          'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster',
                          'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear',
                          'hair drier', 'toothbrush']  #coco
            self.f_num = 425

        else:
            self.CLASS = ["aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow", "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"]
            self.f_num = 125

        self.num_class = len(self.CLASS)
        self.feature_map_size = (13,13)
        self.object_scale = 5. #'物体位于gird cell时计算置信度的修正系数'
        self.no_object_scale = 1.   #'物体位于gird cell时计算置信度的修正系数'
        self.class_scale = 1.  #'计算分类损失的修正系数'
        self.coordinates_scale = 1.  #'计算坐标损失的修正系数'


#  NewWork

    def conv2d(self,x,filters_num,filters_size,pad_size=0,stride=1,batch_normalize=True,activation=leaky_relu,use_bias=False,name='conv2d'):

        if pad_size > 0:
            x = tf.pad(x,[[0,0],[pad_size,pad_size],[pad_size,pad_size],[0,0]])

        out = tf.layers.conv2d(x,filters=filters_num,kernel_size=filters_size,strides=stride,padding='VALID',activation=None,use_bias=use_bias,name=name)
        # BN应该在卷积层conv和激活函数activation之间,
        # (后面有BN层的conv就不用偏置bias，并激活函数activation在后)
        if batch_normalize:
            out = tf.layers.batch_normalization(out,axis=-1,momentum=0.9,training=False,name=name+'_bn')
        if activation:
            out = activation(out)
        return out

    def maxpool(self,x, size=2, stride=2, name='maxpool'):
        return tf.layers.max_pooling2d(x, pool_size=size, strides=stride,name=name)

    # passthrough
    def passthrough(self,x, stride):
        return tf.space_to_depth(x, block_size=stride)
        #或者tf.extract_image_patches

    def darknet(self):

        x = tf.placeholder(dtype=tf.float32,shape=[None,416,416,3])

        net = self.conv2d(x, filters_num=32, filters_size=3, pad_size=1,
                     name='conv1')
        net = self.maxpool(net, size=2, stride=2, name='pool1')

        net = self.conv2d(net, 64, 3, 1, name='conv2')
        net = self.maxpool(net, 2, 2, name='pool2')

        net = self.conv2d(net, 128, 3, 1, name='conv3_1')
        net = self.conv2d(net, 64, 1, 0, name='conv3_2')
        net = self.conv2d(net, 128, 3, 1, name='conv3_3')
        net = self.maxpool(net, 2, 2, name='pool3')

        net = self.conv2d(net, 256, 3, 1, name='conv4_1')
        net = self.conv2d(net, 128, 1, 0, name='conv4_2')
        net = self.conv2d(net, 256, 3, 1, name='conv4_3')
        net = self.maxpool(net, 2, 2, name='pool4')

        net = self.conv2d(net, 512, 3, 1, name='conv5_1')
        net = self.conv2d(net, 256, 1, 0, name='conv5_2')
        net = self.conv2d(net, 512, 3, 1, name='conv5_3')
        net = self.conv2d(net, 256, 1, 0, name='conv5_4')
        net = self.conv2d(net, 512, 3, 1, name='conv5_5')  #

        # 这一层特征图，要进行后面passthrough
        shortcut = net
        net = self.maxpool(net, 2, 2, name='pool5')  #

        net = self.conv2d(net, 1024, 3, 1, name='conv6_1')
        net = self.conv2d(net, 512, 1, 0, name='conv6_2')
        net = self.conv2d(net, 1024, 3, 1, name='conv6_3')
        net = self.conv2d(net, 512, 1, 0, name='conv6_4')
        net = self.conv2d(net, 1024, 3, 1, name='conv6_5')


        # 训练检测网络时去掉了分类网络的网络最后一个卷积层，
        # 在后面增加了三个卷积核尺寸为3 * 3，卷积核数量为1024的卷积层，并在这三个卷积层的最后一层后面跟一个卷积核尺寸为1 * 1
        # 的卷积层，卷积核数量是（B * （5 + C））。
        # 对于VOC数据集，卷积层输入图像尺寸为416 * 416
        # 时最终输出是13 * 13
        # 个栅格，每个栅格预测5种boxes大小，每个box包含5个坐标值和20个条件类别概率，所以输出维度是13 * 13 * 5 * （5 + 20）= 13 * 13 * 125。
        #
        # 检测网络加入了passthrough layer，从最后一个输出为26 * 26 * 512
        # 的卷积层连接到新加入的三个卷积核尺寸为3 * 3
        # 的卷积层的第二层，使模型有了细粒度特征。

        # 下面这部分主要是training for detection
        net = self.conv2d(net, 1024, 3, 1, name='conv7_1')
        net = self.conv2d(net, 1024, 3, 1, name='conv7_2')

        # shortcut增加了一个中间卷积层，先采用64个1*1卷积核进行卷积，然后再进行passthrough处理
        # 这样26*26*512 -> 26*26*64 -> 13*13*256的特征图
        shortcut = self.conv2d(shortcut, 64, 1, 0, name='conv_shortcut')
        shortcut = self.passthrough(shortcut, 2)

        # 连接之后，变成13*13*（1024+256）
        net = tf.concat([shortcut, net],axis=-1)  # channel整合到一起，concatenated with the original features，passthrough层与ResNet网络的shortcut类似，以前面更高分辨率的特征图为输入，然后将其连接到后面的低分辨率特征图上，
        net = self.conv2d(net, 1024, 3, 1, name='conv8')

        # detection layer: 最后用一个1*1卷积去调整channel，该层没有BN层和激活函数，变成: S*S*(B*(5+C))，在这里为：13*13*425
        output = self.conv2d(net, filters_num=self.f_num, filters_size=1, batch_normalize=False, activation=None,
                        use_bias=True, name='conv_dec')

        return output,x




#生成anchor  --->  decode
    def decode(self,net):

        self.anchor_size = tf.constant(self.anchor_size , dtype=tf.float32)
       

        net = tf.reshape(net, [-1, 13 * 13, self.num_anchors, self.num_class + 5]) #[batch,169,5,85]

        # 偏移量、置信度、类别
        #中心坐标相对于该cell坐上角的偏移量，sigmoid函数归一化到(0,1)
        xy_offset = tf.nn.sigmoid(net[:, :, :, 0:2])
        wh_offset = tf.exp(net[:, :, :, 2:4])
        obj_probs = tf.nn.sigmoid(net[:, :, :, 4])  # 置信度,这个东西就是相当于v1中的confidence
        class_probs = tf.nn.softmax(net[:, :, :, 5:])  #

        # 在feature map对应坐标生成anchors，每个坐标五个
        height_index = tf.range(self.feature_map_size[0], dtype=tf.float32)
        width_index = tf.range(self.feature_map_size[1], dtype=tf.float32)

        x_cell, y_cell = tf.meshgrid(height_index, width_index)
        x_cell = tf.reshape(x_cell, [1, -1, 1])  # 和上面[H*W,num_anchors,num_class+5]对应
        y_cell = tf.reshape(y_cell, [1, -1, 1])

        # decode
        bbox_x = (x_cell + xy_offset[:, :, :, 0]) / 13
        bbox_y = (y_cell + xy_offset[:, :, :, 1]) / 13
        bbox_w = (self.anchor_size[:, 0] * wh_offset[:, :, :, 0]) / 13
        bbox_h = (self.anchor_size[:, 1] * wh_offset[:, :, :, 1]) / 13

        bboxes = tf.stack([bbox_x - bbox_w / 2, bbox_y - bbox_h / 2, bbox_x + bbox_w / 2, bbox_y + bbox_h / 2],
                          axis=3)

        return bboxes, obj_probs, class_probs

    #将边界框超出整张图片(0,0)—(415,415)的部分cut掉
    def bboxes_cut(self,bbox_min_max, bboxes):
        bboxes = np.copy(bboxes)
        bboxes = np.transpose(bboxes)
        bbox_min_max = np.transpose(bbox_min_max)
        # cut the box
        bboxes[0] = np.maximum(bboxes[0], bbox_min_max[0])  # xmin
        bboxes[1] = np.maximum(bboxes[1], bbox_min_max[1])  # ymin
        bboxes[2] = np.minimum(bboxes[2], bbox_min_max[2])  # xmax
        bboxes[3] = np.minimum(bboxes[3], bbox_min_max[3])  # ymax
        bboxes = np.transpose(bboxes)
        return bboxes

    def bboxes_sort(self,classes, scores, bboxes, top_k=400):
        index = np.argsort(-scores)
        classes = classes[index][:top_k]
        scores = scores[index][:top_k]
        bboxes = bboxes[index][:top_k]
        return classes, scores, bboxes


    def bboxes_iou(self,bboxes1, bboxes2):
        bboxes1 = np.transpose(bboxes1)
        bboxes2 = np.transpose(bboxes2)

        int_ymin = np.maximum(bboxes1[0], bboxes2[0])
        int_xmin = np.maximum(bboxes1[1], bboxes2[1])
        int_ymax = np.minimum(bboxes1[2], bboxes2[2])
        int_xmax = np.minimum(bboxes1[3], bboxes2[3])

        int_h = np.maximum(int_ymax - int_ymin, 0.)
        int_w = np.maximum(int_xmax - int_xmin, 0.)

        # 计算IOU
        int_vol = int_h * int_w  # 交集面积
        vol1 = (bboxes1[2] - bboxes1[0]) * (bboxes1[3] - bboxes1[1])  # bboxes1面积
        vol2 = (bboxes2[2] - bboxes2[0]) * (bboxes2[3] - bboxes2[1])  # bboxes2面积
        IOU = int_vol / (vol1 + vol2 - int_vol)  # IOU=交集/并集
        return IOU

    # NMS，或者用tf.image.non_max_suppression
    def bboxes_nms(self,classes, scores, bboxes, nms_threshold=0.2):
        keep_bboxes = np.ones(scores.shape, dtype=np.bool)
        for i in range(scores.size - 1):
            if keep_bboxes[i]:
                overlap = self.bboxes_iou(bboxes[i], bboxes[(i + 1):])
                keep_overlap = np.logical_or(overlap < nms_threshold,
                                             classes[(i + 1):] != classes[i])  # IOU没有超过0.5或者是不同的类则保存下来
                keep_bboxes[(i + 1):] = np.logical_and(keep_bboxes[(i + 1):], keep_overlap)

        idxes = np.where(keep_bboxes)
        return classes[idxes], scores[idxes], bboxes[idxes]

    def postprocess(self,bboxes, obj_probs, class_probs, image_shape=(416, 416), threshold=0.5):

        bboxes = np.reshape(bboxes, [-1, 4])
        # 将所有box还原成图片中真实的位置
        bboxes[:, 0:1] *= float(image_shape[1])
        bboxes[:, 1:2] *= float(image_shape[0])
        bboxes[:, 2:3] *= float(image_shape[1])
        bboxes[:, 3:4] *= float(image_shape[0])
        bboxes = bboxes.astype(np.int32)  # 转int


        bbox_min_max = [0, 0, image_shape[1] - 1, image_shape[0] - 1]
        bboxes = self.bboxes_cut(bbox_min_max, bboxes)


        obj_probs = np.reshape(obj_probs, [-1])  # 13*13*5
        class_probs = np.reshape(class_probs, [len(obj_probs), -1])  # (13*13*5,80)
        class_max_index = np.argmax(class_probs, axis=1)  # max类别概率对应的index
        class_probs = class_probs[np.arange(len(obj_probs)), class_max_index]
        scores = obj_probs * class_probs  # 置信度*max类别概率=类别置信度scores

        # 类别置信度scores>threshold的边界框bboxes留下
        keep_index = scores > threshold
        class_max_index = class_max_index[keep_index]
        scores = scores[keep_index]
        bboxes = bboxes[keep_index]

        # (2)排序top_k(默认为400)
        class_max_index, scores, bboxes = self.bboxes_sort(class_max_index, scores, bboxes)
        # (3)NMS
        class_max_index, scores, bboxes = self.bboxes_nms(class_max_index, scores, bboxes)
        return bboxes, scores, class_max_index



    def preprocess_image(self,image, image_size=(416, 416)):

        image_cp = np.copy(image).astype(np.float32)
        image_rgb = cv2.cvtColor(image_cp, cv2.COLOR_BGR2RGB)
        image_resized = cv2.resize(image_rgb, image_size)
        image_normalized = image_resized.astype(np.float32) / 225.0
        image_expanded = np.expand_dims(image_normalized, axis=0)
        return image_expanded


 


    '''
    train part
    '''


    def preprocess_true_boxes(self,true_box,anchors,img_size = (416,416)):
        '''
        :param true_box:实际框的位置和类别,2D TENSOR:(batch,5)
        :param anchors:anchors : 实际anchor boxes 的值，论文中使用了五个。[w,h]，都是相对于gird cell 的比值。
                2d
            第二个维度：[w,h]，w,h,都是相对于gird cell长宽的比值。
           [1.08, 1.19], [3.42, 4.41], [6.63, 11.38], [9.42, 5.11], [16.62, 10.52]
        :param img_size:
        :return:
           -detectors_mask: 取值是0或者1，这里的shape是[13,13,5,1]
                第四个维度：0/1。1的就是用于预测改true boxes 的 anchor boxes
           -matching_true_boxes:这里的shape是[13,13,5,5]。
        '''
        w,h = img_size
        feature_w = w // 32
        feature_h = h // 32

        num_box_params = true_box.shape[1]
        detectors_mask = np.zeros((feature_h,feature_w,self.num_anchors,1),dtype=np.float32)
        matching_true_boxes = np.zeros((feature_h,feature_w,self.num_anchors,num_box_params),dtype=np.float32)

        for i in true_box:
            #提取类别信息，属于哪类
            box_class = i[4:5]
            #换算成相对于gird cell的值
            box = i[0:4] * np.array([feature_w, feature_h, feature_w, feature_h])
            k = np.floor(box[1]).astype('int') #y方向上属于第几个gird cell
            j = np.floor(box[0]).astype('int') #x方向上属于第几个gird cell
            best_iou = 0
            best_anchor = 0

            #计算anchor boxes 和 true boxes的iou ，一个true box一个best anchor
            for m,anchor in enumerate(anchors):
                box_maxes = box[2:4] / 2.
                box_mins = -box_maxes
                anchor_maxes = (anchor / 2.)
                anchor_mins = -anchor_maxes

                intersect_mins = np.maximum(box_mins, anchor_mins)
                intersect_maxes = np.minimum(box_maxes, anchor_maxes)
                intersect_wh = np.maximum(intersect_maxes - intersect_mins, 0.)
                intersect_area = intersect_wh[0] * intersect_wh[1]
                box_area = box[2] * box[3]
                anchor_area = anchor[0] * anchor[1]
                iou = intersect_area / (box_area + anchor_area - intersect_area)
                if iou > best_iou:
                    best_iou = iou
                    best_anchor = m

            if best_iou > 0:
                detectors_mask[k, j, best_anchor] = 1

                adjusted_box = np.array(  #找到最佳预测anchor boxes
                    [
                        box[0] - j, box[1] - k, #'x,y都是相对于gird cell的位置，左上角[0,0]，右下角[1,1]'
                        np.log(box[2] / anchors[best_anchor][0]), #'对应实际框w,h和anchor boxes w,h的比值取log函数'
                        np.log(box[3] / anchors[best_anchor][1]), box_class #'class实际框的物体是属于第几类'
                    ],
                    dtype=np.float32)
                matching_true_boxes[k, j, best_anchor] = adjusted_box
            return detectors_mask, matching_true_boxes



    def yolo_head(self,feature_map, anchors, num_classes):
        '''
        这个函数是输入yolo的输出层的特征，转化成相对于gird cell坐标的x,y，相对于gird cell长宽的w,h，
        pred_confidence是判断否存在物体的概率，pred_class_prob是sofrmax后各个类别分别的概率
        :param feats:  网络最后一层输出 [none,13,13,125]/[none,13,13,425]
        :param anchors:[5,n]
        :param num_classes:类别数
        :return:x,y,w,h在loss function中计算iou，然后计算iou损失。
                然后和pred_confidence计算confidence_loss，pred_class_prob用于计算classification_loss。
                box_xy : 每张图片的每个gird cell中的每个pred_boxes中心点x,y相对于其所在gird cell的坐标值，左上顶点为[0,0],右下顶点为[1,1]。
                shape:[-1,13,13,5,2].
                box_wh : 每张图片的每个gird cell中的每个pred_boxes的w,h都是相对于gird cell的比值
                shape:[-1,13,13,5,2].
                box_confidence : 每张图片的每个gird cell中的每个pred_boxes的，判断是否存在可检测物体的概率。
                shape:[-1,13,13,5,1]。各维度信息同上。
                box_class_pred : 每张图片的每个gird cell中的每个pred_boxes所框起来的各个类别分别的概率(经过了softmax)。
                shape:[-1,13,13,5,20/80]
'''
        anchors = tf.reshape(tf.constant(anchors,dtype=tf.float32),[1,1,1,self.num_anchors,2])
        num_gird_cell = tf.shape(feature_map)[1:3] #[13,13]
        conv_height_index = K.arange(0,stop=num_gird_cell[0])
        conv_width_index = K.arange(0,stop=num_gird_cell[1])

        conv_height_index = tf.tile(conv_height_index, [num_gird_cell[1]])

        conv_width_index = tf.tile(
            tf.expand_dims(conv_width_index, 0), [num_gird_cell[0], 1])
        conv_width_index = K.flatten(K.transpose(conv_width_index))
        conv_index = K.transpose(K.stack([conv_height_index,conv_width_index]))
        conv_index = K.reshape(conv_index,[1,num_gird_cell[0],num_gird_cell[1],1,2])#[1，13，13，1，2]
        conv_index = K.cast(conv_index,K.dtype(feature_map))
        #[[0,0][0,1]....[0,12],[1,0]...]
        feature_map = K.reshape(feature_map,[-1,num_gird_cell[0],num_gird_cell[1],self.num_anchors,self.num_class + 5])
        num_gird_cell = K.cast(K.reshape(num_gird_cell,[1,1,1,1,2]),K.dtype(feature_map))

        box_xy = K.sigmoid(feature_map[...,:2])
        box_wh = K.exp(feature_map[...,2:4])
        confidence = K.sigmoid(feature_map[...,4:5])
        cls_prob = K.softmax(feature_map[...,5:])

        xy = (box_xy + conv_index) / num_gird_cell
        wh = box_wh * anchors / num_gird_cell

        return xy,wh,confidence,cls_prob



    def loss(self,
             net,
             true_boxes,
             detectors_mask,
             matching_true_boxes,
             anchors,
             num_classes):
        '''
        IOU损失，分类损失，坐标损失
        confidence_loss：
                共有845个anchor_boxes，与true_boxes匹配的用于预测pred_boxes，
                未与true_boxes匹配的anchor_boxes用于预测background。在未与true_boxes匹配的anchor_boxes中，
                与true_boxes的IOU小于0.6的被标记为background，这部分预测正确，未造成损失。
                但未与true_boxes匹配的anchor_boxes中，若与true_boxes的IOU大于0.6的我们需要计算其损失，
                因为它未能准确预测background，与true_boxes重合度过高，就是no_objects_loss。
                而objects_loss则是与true_boxes匹配的anchor_boxes的预测误差。与YOLOv1不同的是修正系数的改变，
                YOLOv1中no_objects_loss和objects_loss分别是0.5和1，而YOLOv2中则是1和5。
        classification_loss:
                经过softmax（）后，20维向量（数据集中分类种类为20种）的均方误差。
        coordinates_loss：
                计算x,y的误差由相对于整个图像（416x416）的offset坐标误差的均方改变为相对于gird cell的offset（这个offset是取sigmoid函数得到的处于（0,1）的值）坐标误差的均方。
                也将修正系数由5改为了1 。计算w,h的误差由w,h平方根的差的均方误差变为了，
                w,h与对true_boxes匹配的anchor_boxes的长宽的比值取log函数，
                和YOLOv1的想法一样，对于相等的误差值，降低对大物体误差的惩罚，加大对小物体误差的惩罚。同时也将修正系数由5改为了1。
        :param net:[batch_size,13,13,125],网络最后一层输出
        :param true_boxes:实际框的位置和类别 [batch,5]
        :param detectors_mask:取值是0或者1，[ batch_size，13,13,5,1]
                1的就是用于预测改true boxes 的 anchor boxes
        :param matching_true_boxes:[-1,13,13,5,5]
        :param anchors:
        :param num_classes:20
        :return:
        '''

        xy, wh, confidence, cls_prob = self.yolo_head(net,anchors,num_classes)
        shape = tf.shape(net)
        feature_map = tf.reshape(net,[-1,shape[1],shape[2],self.num_anchors,num_classes + 5])
        #用于和matching_true_boxes计算坐标损失
        pred_box = tf.concat([K.sigmoid(feature_map[...,0:2]),feature_map[...,2:4]],axis=-1)

        pred_xy = tf.to_float(tf.expand_dims(xy,4))#[-1,13,13,5,2]-->[-1,13,13,5,1,2]
        pred_wh = tf.to_float(tf.expand_dims(wh,4))

        pred_min = tf.to_float(pred_xy - pred_wh / 2.0)
        pred_max = tf.to_float(pred_xy + pred_wh / 2.0)

        true_box_shape = K.shape(true_boxes)
        print(true_box_shape)
        true_boxes = K.reshape(true_boxes,[-1,1,1,1,true_box_shape[1], 5])
        #[-1,1,1,1,-1,5],batch, conv_height, conv_width, num_anchors, num_true_boxes, box_params'

        true_xy = tf.to_float(true_boxes[...,0:2])
        true_wh = tf.to_float(true_boxes[...,2:4])
        true_min = tf.to_float(true_xy - true_wh / 2.0)
        true_max = tf.to_float(true_xy + true_wh / 2.0)

        #计算所以abox和tbox的iou
        intersect_mins = tf.maximum(pred_min, true_min)
        intersect_maxes = tf.minimum(pred_max, true_max)
        intersect_wh = tf.maximum(intersect_maxes - intersect_mins, 0.)
        intersect_areas = tf.to_float(intersect_wh[..., 0] * intersect_wh[..., 1])
        pred_areas = pred_wh[..., 0] * pred_wh[..., 1]
        true_areas = true_wh[..., 0] * true_wh[..., 1]
        union_areas = pred_areas + true_areas - intersect_areas
        iou_scores = intersect_areas / union_areas


        #可能会有多个tbox落在同一个cell ，只去iou最大的
        # tf.argmax(iou_scores,4)
        best_ious = K.max(iou_scores, axis=4)
        best_ious = tf.expand_dims(best_ious,axis=-1)

        #选出IOU大于0.6的，若IOU小于0.6的被标记为background，
        obj_dec = tf.cast(best_ious > 0.6,dtype=K.dtype(best_ious))


        #IOU loss
        no_obj_w = (self.no_object_scale * obj_dec * detectors_mask) #
        no_obj_loss = no_obj_w * tf.square(-confidence)
        obj_loss = self.object_scale * detectors_mask * tf.square(1 - confidence)
        confidence_loss = no_obj_loss + obj_loss


        #class loss
        match_cls = tf.cast(matching_true_boxes[...,4],dtype=tf.int32)
        match_cls = tf.one_hot(match_cls,num_classes)

        class_loss = (self.class_scale * detectors_mask * tf.square(match_cls - cls_prob))

        #坐标loss
        match_box = matching_true_boxes[...,0:4]
        coord_loss = self.coordinates_scale * detectors_mask * tf.square(match_box - pred_box)


        confidence_loss_sum = K.sum(confidence_loss)
        class_loss_sum = K.sum(class_loss)
        coord_loss_sum = K.sum(coord_loss)
        all_loss = 0.5 * (confidence_loss_sum + class_loss_sum + coord_loss_sum)

        return all_loss


    def draw_detection(self,j,im, bboxes, scores, cls_inds, labels):
        f = open('./output/final.txt', "a")

        imgcv = np.copy(im)
        h, w, _ = imgcv.shape
        for i, box in enumerate(bboxes):
            cls_indx = cls_inds[i]
            thick = int((h + w) / 1000)
            cv2.rectangle(imgcv, (box[0], box[1]), (box[2], box[3]), (0, 0, 255), thick)
            f.write('[x, y, w, h]=['+str(box[0])+','+str(box[1])+','+str(box[2])+','+str(box[3])+']
')
            #print("[x, y, w, h]=[%d, %d, %d, %d]" % (box[0], box[1], box[2], box[3]))
            mess = '%s: %.3f' % (labels[cls_indx], scores[i])
            text_loc = (box[0], box[1] - 10)
            cv2.putText(imgcv, mess, text_loc, cv2.FONT_HERSHEY_SIMPLEX, 1e-3 * h, (0, 0, 255), thick)
        # return imgcv
        #将处理后的每帧图片存到本地
        address = './output/' + str(j)+ '.jpg'
        cv2.imwrite(address,imgcv)

        #将位置信息写入文件
        f.write('
')

#v1 - v2 , v2 - v3
# 1、加入BN层 批次归一化   input --> 均值为0方差为1正太分布
#    ---》白化  --> 对‘input 变换到 均值0单位方差内的分布
# #使用：input * w -->bn

if __name__ == '__main__':
    network = yolov2('coco')
  
    net,x = network.darknet()
    _bboxes, _obj_probs, _class_probs = network.decode(net)
    

    saver = tf.train.Saver()
    ckpt_path = './model/v2/yolo2_coco.ckpt'
    sess = tf.Session()
    sess.run(tf.global_variables_initializer())
    saver.restore(sess,ckpt_path)



    # 读取视频文件
    cap = cv2.VideoCapture("./test/3.mp4")
    # 通过摄像头的方式
    # videoCapture=cv2.VideoCapture(1)
    #读帧
    j=0
    while cap.isOpened():
        ret, frame = cap.read()
        img_r = network.preprocess_image(frame)
        
        
        bboxes, obj_probs, class_probs = sess.run([_bboxes, _obj_probs, _class_probs],feed_dict={x:img_r})
        bboxes, scores, class_max_index = network.postprocess(bboxes, obj_probs, class_probs)
        #print(scores, box_classes)
        img_detection = network.draw_detection(j, cv2.resize(frame,(416,416)), bboxes, scores, class_max_index, network.CLASS)
        j=j+1




'''
 yi、
    第一大层  :conv maxpoiling
    第2大层:3个卷积，maxpool
    3:3个卷积，maxpool
    4：3卷积，maxpool
    5:5卷积，maxpool   -----------
    6:5卷积                       | + add
    7三个卷积---------------------
    conv  
 er:
    ahchors生成和decode
 san:
    裁剪、选出前TOP_K，NMS 
'''

运行结果（第30帧）：

视频还是上一篇文章中的测试视频。限于上传困难，在这里依然只展示单帧的测试。对比图如下（上面的是v1，下面的是v2）

从对比图可以看出，与V1版本第30帧的检测结果相比，V2可以检测到更多的物体，并且准确率更高。

原视频。见：传送门

处理后的视频。见：传送门

另外，检测到的bbox位置也特别多，无法截图展示，我就把信息全部写入到了txt文本中。见：传送门

参考：
https://pjreddie.com/darknet/yolo/
https://xmfbit.github.io/2017/02/04/yolo-paper/
https://www.cnblogs.com/AntonioSu/p/12164255.html
https://zhuanlan.zhihu.com/p/25052190
http://lanbing510.info/2017/09/04/YOLOV2.html
https://segmentfault.com/a/1190000016842636#comment-area
https://www.youtube.com/watch?v=VOC3huqHrss
https://www.cnblogs.com/wangguchangqing/p/10480995.html
https://zhuanlan.zhihu.com/p/74540100

查看全文

相关阅读:
硬盘安装Win 7系统Windows 7 系统硬盘安装教程（图解）
修改phpMyAdmin导入SQL文件的大小限制
 金三银四面试季节之Java 核心面试技术点
 2015年校园招聘12家IT公司面试体验
 正则表达式小结
 【译文】NginScript – 为什么我们要实现自己的JS引擎？
经典算法合集
 【高级JSE技术】线程池
 【高性能服务器】Tomcat剖析
 【高性能服务器】Nginx剖析

原文地址：https://www.cnblogs.com/han-sy/p/13301054.html