P2B: Point-to-Box Network for 3D Object Tracking in Point Clouds中英译文

zoukankan html css js c++ java

P2B: Point-to-Box Network for 3D Object Tracking in Point Clouds中英译文

P2B: Point-to-Box Network for 3D Object Tracking in Point Clouds

P2B:用于点云中三维物体跟踪的点到盒网络

Haozhe Qi, Chen Feng, Zhiguo Cao, Feng Zhao, and Yang Xiao

齐浩哲、陈锋、曹志国、赵峰、杨晓

National Key Laboratory of Science and Technology on Multi-Spectral Information Processing, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430074, China

华中科技大学人工智能与自动化学院多光谱信息处理科学技术国家重点实验室

qihaozhe, chen feng, zgcao@hust.edu.cn , fzhao@alumni.hust.edu.cn , Yang Xiao@hust.edu.cn

Target template目标模板

Cluster of potential target centers潜在目标中心集群

Final predicted 3D target box最终预测的3D目标框

Seed points with targetspecific feature具有特定目标特征的种子点

s p: Proposalwise附言:提议

targetness score目标得分

图1-举例说明P2B是如何工作的，从种子采样到3D目标提议和验证。

摘要

　　论文解析:https://zhuanlan.zhihu.com/p/146512901

　　针对点云中的三维目标跟踪，提出了一种新的端到端学习的P2B网络。我们的主要思想是首先在嵌入目标信息的三维搜索区域中局部化潜在的目标中心。然后联合执行点驱动三维目标定位和验证。这样，可以避免耗时的3D穷举搜索。具体来说，我们首先分别从模板和搜索区域的点云中采样种子。然后，我们执行排列不变特征增强，将从模板中获取的tar线索嵌入到搜索区域种子中，并用特定于目标的特征来表示它们。因此，有约束的搜索区域种子通过霍夫投票回归潜在的目标中心。种子阶段的目标得分进一步加强了这些中心。最后，每个中心包括它的邻居，以利用整体力量进行联合3D目标提议和验证。我们将PointNet++作为我们的主干，在KITTI跟踪数据集上的实验证明了P2B的优势(相对于最先进的技术，提高了10%)。请注意，P2B可以在单个NVIDIA 1080Ti图形处理器上运行40FPS。我们的代码和模型在https://github.com/HaozheQi/P2B.是可用的

1.Introduction

3D object tracking in point clouds is essential for appli-cations in autonomous driving and robotics vision [25, 26, 7].However, point clouds' sparsity and disorder imposes great challenges on this task, and leads to the fact that, well-established 2D object tracking approaches (e.g., Siamese network [3]) cannot be directly applied.Most existing 3D object tracking methods [1, 4, 24, 16, 15] inherit 2D's ex-perience and rely heavily on RGB-D information.But they may fail when RGB visual information is degraded with illuminational change or even inaccessible.We hence focus on 3D object tracking using only point clouds.The first pi-oneer effort on this topic appears in [11].It mainly executes 3D template matching using Kalman filtering [12] to gen-erate bunches of 3D target proposals.Meanwhile, it uses shape completion to regularize feature learning on point set.Nevertheless, it tends to suffer from four main defects: 1) its tracking network cannot be end-to-end trained;2) 3D search with Kalman filtering consumes much time;3) each target proposal is represented with only one-dimensional global feature, which may lose fine local geometric information;4) shape completion network brings strong class prior which weakens generality.

1.介绍

　　点云中的三维物体跟踪对于自动驾驶和机器人视觉的应用是必不可少的[25，26，7]。然而，点云的稀疏性和无序性给这一任务带来了巨大的挑战，并导致了这样一个事实，即，成熟的2D目标跟踪方法(例如，暹罗网络[3])不能直接应用。大多数现有的三维目标跟踪方法[1，4，24，16，15]继承了2D的经验，并且严重依赖于RGB-D信息。但是当RGB视觉信息因光照变化而退化，甚至无法访问，它们可能会失败。因此，我们专注于仅使用点云的3D对象跟踪。关于这一主题的第一篇论文出现在[11]中。它主要使用卡尔曼滤波[12]来执行3D模板匹配，以生成3D目标建议束。同时，利用形状补全来规范点集的特征学习。然而，它有四个主要缺陷:1)它的跟踪网络不能进行端到端的训练；2)卡尔曼滤波三维搜索耗时长；3)每个目标方案仅用一维全局特征表示，这可能会丢失精细的局部几何信息；4)形状完成网络带来强大的类先验，削弱了通用性。

Towards the above concerns, we propose a novel point-to-box network termed P2B for 3D object tracking which can be end-to-end trained.Differing from the intuitive 3D search with box in [11], we turn to addressing 3D ob-ject tracking by first localizing potential target centers and then executing point-driven target proposal and verification jointly.Our intuition lies in two folders.First, the point-wise tracking paradigm may help better exploit 3D local geometric information to characterize target in point clouds. Secondly, formulating 3D object tracking task in an end-to-end manner is of stronger ability to fit target's 3D appear-ance variation during tracking.

　　针对上述问题，我们提出了一种新的点到盒网络称为P2B三维物体跟踪，可以端到端的训练。与[11]中直观的带框3D搜索不同，我们通过首先定位潜在目标中心，然后联合执行点驱动目标提议和验证来解决3D对象跟踪问题。我们的直觉存在于两个文件夹中。首先，逐点跟踪范例可以帮助更好地利用3D局部几何信息来表征点云中的目标。其次，以端到端的方式制定三维目标跟踪任务具有更强的适应跟踪过程中目标三维外观变化的能力。

We exemplify how P2B works in Fig. 1.We first feed template and search area into backbone respectively and ob-tain their seeds.The search area seeds will consequently predict potential target centers for joint target proposal and verification.Then the search area seeds are augmented with target-specific features, yielding three main components: 1) their 3D position coordinates to retain spatial geometric in-formation, 2) their point-wise similarity with template seeds to mine resembling patterns and reveal the local tracking clue, and 3) encoded global feature of target from tem-plate.This augmentation is invariant to seeds' permutation and yields consistent target-specific features.After that, the augmented seeds are projected to the potential target cen-ters via Hough voting [28].Meanwhile, each seed is as-sessed with its targetness to regularize earlier feature learn-ing;the result targetness score further strengthens its pre-dicted target center's representation.Finally, each potential target center clusters the neighbors to leverage the ensemble power for joint target proposal and verification.

　　我们在图1中举例说明了P2B是如何工作的。我们首先将模板和搜索区域分别输入到主干中，并获取它们的种子。搜索区域种子将因此预测联合目标提议和验证的潜在目标中心。然后，搜索区域种子被目标特定特征扩充，产生三个主要部分:1)它们的3D位置坐标以保持空间几何信息，2)它们与模板种子的点状相似性以挖掘相似模式并揭示局部跟踪线索，以及3)从模板编码目标的全局特征。这种增强对种子的排列是不变的，并产生一致的目标特异性特征。之后，通过霍夫投票将增强的种子投影到潜在的目标中心[28]。同时，每一个种子都有其目标性，以规范早期特征学习；结果目标性得分进一步加强了其预测目标中心的代表性。最后，每个潜在的目标中心将邻居聚集在一起，以利用集合能力进行联合目标提议和验证。

Experiments on KITTI tracking dataset [10] demon-strate that, P2B significantly outperforms the state-of-the-art method [11] by large a margin (∼10% on both Success and Precision).Note that P2B can run with about 40FPS on a single NVIDIA 1080Ti GPU.

Overall, the main contributions of this paper include

• P2B: a novel point-to-box network for 3D object track-ing in point clouds, which can be end-to-end trained;

• Target-specific feature augmentation to include global and local 3D visual clues for 3D object tracking;

• Integration of 3D target proposal and verification.

　　在KITTI跟踪数据集[10]上的实验表明，P2B在成功率和精确度方面都显著优于最先进的方法[11]。请注意，P2B可以在单个NVIDIA 1080Ti图形处理器上运行约40FPS。

　　总的来说，本文的主要贡献包括:

　　　　P2B:一个新颖的点到盒网络，用于点云中的三维物体跟踪，可以进行端到端的训练；

　　　　特定目标特征增强，包括用于3D对象跟踪的全局和局部3D视觉线索；

　　　　3D目标提议和验证的集成。

2.Related Works

We briefly introduce the works most related to our P2B: 3D object tracking, 2D Siamese tracking, deep learning on point set, target proposal and Hough voting.

2.相关工作

　　我们简要介绍与我们的P2B最相关的作品:3D物体跟踪，2D暹罗跟踪，深入学习点集，目标提案和霍夫投票。

3D object tracking.To the best of our knowledge, 3D object tracking using only point clouds has seldom been studied before the recent pioneer attempt [11].Earlier re-lated tracking methods [24, 16, 15, 27, 1, 4] generally resort to RGB-D information.Though with the paid efforts from different theoretical aspects, they may suffer from two main defects: 1) they rely on RGB visual clue and may fail if it is degraded or even inaccessible.This limits some real appli-cations;2) they have no networks designed for 3D tracking, which may limit the representative power.Besides, some of them [24, 16, 15] focus on generating 2D boxes.The above concerns are addressed in [11].Leveraging deep learning on point set and 3D target proposal, it achieves the state-of-the-art result on 3D object tracking using only point clouds.However, it still suffers from some drawbacks as in Sec.1, which motivates our research.

3D对象跟踪。

　　据我们所知，在最近的先锋尝试之前，仅使用点云的3D对象跟踪很少被研究[11]。早期的相关跟踪方法[24，16，15，27，1，4]通常采用RGB-D信息。尽管从不同的理论角度进行了努力，但它们可能存在两个主要缺陷:1)它们依赖于RGB视觉线索，如果视觉线索退化甚至不可访问，就可能失败。这限制了一些实际应用；2)他们没有设计用于3D跟踪的网络，这可能会限制代表的力量。此外，其中一些[24，16，15]专注于生成2D盒。

　　上述问题在[11]中有所阐述。利用对点集和3D目标提议的深入学习，它实现了仅使用点云的3D对象跟踪的最先进的结果。然而，它仍然有一些缺点，如在证券交易委员会。1，这激发了我们的研究。

2D Siamese tracking.Numerous state-of-the-art 2D tracking methods [33, 3, 34, 13, 42, 35, 20, 8, 40, 36, 21] are built upon Siamese network.Generally, Siamese network has two branches for template and search area with shared weights to measure their similarity in an implicitly embed-ded space.Recently, [21] unites region proposal network and Siamese network to boost performance.Hence, time-consuming multi-scale search and online fine-tuning are both avoided.Afterwards, many efforts [42, 20, 40, 36, 8] follow this paradigm.However, the above methods are all driven by 2D CNN which is inapplicable to point clouds.We hence aim to extend the Siamese tracking paradigm to 3D object tracking with effective 3D target proposal.

2D暹罗跟踪。

　　许多最先进的2D跟踪方法[33，3，34，13，42，35，20，8，40，36，21]都建立在暹罗网络上。通常，暹罗网络有两个分支用于模板和搜索区域，它们具有共享的权重以在隐式嵌入的空间中测量它们的相似性。最近，[21]联合地区提案网络和暹罗网络提高了性能。因此，耗时的多尺度搜索和在线微调都得以避免。后来，许多努力[42，20，40，36，8]遵循这一范式。然而，上述方法都是由2D有线电视新闻网驱动的，不适用于点云。因此，我们的目标是通过有效的3D目标方案将暹罗跟踪范例扩展到3D对象跟踪。

Deep learning on point set.Recently, deep learning on point set draws increasing research interests [5, 30].To ad-dress point clouds' disorder, sparsity and rotation variance, the paid efforts have facilitated the research in 3D object recognition [18, 23], 3D object detection [28, 29, 32, 39], 3D pose estimation [22, 9, 6], and 3D object tracking [11].However, the 3D tracking network in [11] cannot execute end-to-end 3D target proposal and verification jointly, which constitutes P2B's focus.

点集的深度学习。

　　最近，关于点集的深度学习吸引了越来越多的研究兴趣[5，30]。为了适应点云的无序、稀疏和旋转变化，人们的努力促进了3D对象识别[18，23]、3D对象检测[28，29，32，39]、3D姿态估计[22，9，6]和3D对象跟踪[11]的研究。但是，[11]中的3D跟踪网络不能执行可爱的端到端3D目标提案和联合验证，这构成了P2B的重点。

　　种子目标性得分。

　　MLP多层感知器，具有全连通层、批量归一化和ReLU。

　　图3-排列不变性的概念。为了表示rj，我们首先计算rj和所有模板种子之间的逐点相似度Simj:Q = { qi } I = 1。然而，辛吉说:“由于Q的无序，它一直在变化(Q的顺序可以不规则地变化)。这激活了我们对一致(即排列不变)f ^t rj的特征增强。“、”表示Simj和f t rj中的尺寸。

Target proposal.In 2D tracking tasks, many tracking-by-detection methods [41, 37, 14] exploit the target clue contained in template to obtain high-quality target-specific proposals.They operate on (2D) area-based pixels with ei-ther edge features [41], region-proposal network [37] or at-tention map [14] in a target-aware manner.Comparatively, P2B regards each point as a regressor towards potential tar-get center which directly relates to 3D target proposal.

目标提案。

　　在2D跟踪任务中，许多检测跟踪方法[41，37，14]利用模板中包含的目标线索来获得高质量的特定目标建议。它们以目标感知的方式对具有其他边缘特征[41]、区域建议网络[37]或潜在地图[14]的基于(2D)区域的像素进行操作。相比之下，P2B把每一个点都看作是一个潜在的与三维目标方案直接相关的目标获取中心的回归。

Hough voting.The seminal work of Hough voting [19] proposes a highly flexible learned representation for object shape, which can combine the information observed on dif-ferent training examples in a probabilistic extension of the Generalized Hough Transform [2].Recently, [28] embeds Hough voting into an end-to-end trainable deep network for 3D object detection in point cloud, which further aggregates local context and yields promising results.But how to ef-fectively apply it to 3D object tracking remains unexplored.

霍夫投票。

　　霍夫投票的开创性工作[19]提出了一种高度灵活的物体形状的学习表示，它可以在广义霍夫变换[2]的概率扩展中结合在不同训练例子上观察到的信息。最近，[28]将霍夫投票嵌入到端到端可训练的深度网络中，用于点云中的3D对象检测，这进一步聚集了局部上下文并产生了有希望的结果。但是如何有效地将其应用于三维目标跟踪仍然是一个有待探索的问题。

3.P2B: A Novel Network on Point Set for 3D Object Tracking

3.1.Overview

In 3D object tracking, we focus on localizing the target (defined by template) in search area frame by frame.We aim to embed template's target clue into search area to pre dict potential target centers, and execute joint target pro-posal and verification in an end-to-end manner.P2B has two main parts (Fig. 2): 1) target-specific feature augmen-tation, and 2) 3D target proposal and verification.We first feed template and search area respectively into backbone and obtain their seeds.Then the template seeds help aug-ment the search area seeds with target-specific features.Af-ter that, these augmented search area seeds are projected to potential target centers via Hough voting.Seed-wise target-ness scores are also calculated to regularize feature learning and strengthen the discriminative power of these potential target centers.Then each potential target center clusters its neighbors for 3D target proposal.Proposal with the maxi-mal proposal-wise targetness score is verified as the final re-sult.We will detail them as follows.Main symbols within P2B are defined in Table 1.For easy comprehension, we also sketch the detailed technical flow in Algorithm 1.

3. P2B:一种新的三维目标跟踪点集网络

3.1概观

　　在三维目标跟踪中，我们着重于在搜索区域逐帧定位目标(由模板定义)。我们旨在将模板的目标线索嵌入到搜索区域中，以预先确定潜在的目标中心，并以端到端的方式执行联合目标计划和验证。P2B有两个主要部分(图2): 1)特定目标特征增强，和2)三维目标建议和验证。我们首先将模板和搜索区域分别馈入骨干网并获取它们的种子。然后模板种子帮助用特定于目标的特征来更新搜索区域种子。之后，这些扩大的搜索区域种子通过霍夫投票被投影到潜在的目标中心。种子方式的目标性分数也被计算以规范特征学习并增强这些潜在目标中心的辨别能力。然后，每个潜在的目标中心将它的邻居聚集在一起进行3D目标提议。具有最大建议针对性得分的建议被验证为最终结果。我们将详述如下。表1定义了P2B的主要符号。为了便于理解，我们还在算法1中概述了详细的技术流程。

　　φ和θ表示在特征信道上运行的MLP-最大池-MLP网络。

　　输入:模板 (N1大小的Ptmp)和搜索区域(N2大小的Psea) 中的点。

　　输出:具有最高标准S^P的提案。

　　1:特征提取。将Ptmp和Psea馈入主干网，并分别获取种子Q = {qi} ^M_i=1和R = {rj } Mj=1，具有特征f ∈ R d。每个种子用其3D位置和f表示，产生3 + d1的维度。

　　2:逐点相似。计算每个点之间的逐点相似度Simj:

　　种子rj和q。对于所有的搜索区域种子，我们得到其与所有模板种子的Sim ∈ RM×M

　　3:特征增强。增加每个Simj:Q为M1的大小×(1+3 + d)。将结果输入φ，以获得rj的目标特定特征f t rj∈R d rj。rj用其3D位置和ft rj表示，以产生3 + d的尺寸。

　　4:生成潜在的目标中心。每个种子预测一个具有特征fcj ∈ R ^d2的潜在目标中心cj，1）通过Hough投票以及2)用种子方式的目标性得分s j ∈ R来评估。cj通过连接s j、其3D位置和fcj来表示，以产生1 + 3 + d的维度

　　5:集群。在C中采样一个子集，使其大小为k。为每个样本cj用球查询生成聚类Tj.，其中Tj包含nj潜在的目标中心。

　　6: 3D目标提案。将每个Tj输入θ，生成一个3D目标方案p ^t_j。通过提案针对性得分s^P_j，共预测了K个提案.

3.2.Target-specific feature augmentation

Here we aim to merge template's target information into search area seed to include both global target clue and local tracking clue.We first feed template and search area respec-tively into feature backbone and obtain their seeds.With the embedded target information in template, we then aug-ment the search area seeds with target-specific features in spirit of pattern matching, which also satisfies permutation-invariance to address point cloud's disorder.

3.2特定目标特征增强

　　在这里，我们旨在将模板的目标信息合并到搜索区域种子中，以包括全局目标线索和局部跟踪线索。我们首先将模板和搜索区域分别馈入特征主干并获取它们的种子。利用模板中嵌入的目标信息，在模式匹配的精神下，将搜索区域种子与目标特定的特征结合起来，满足排列不变性，解决点云的无序问题。？为何就解决了

Feature encoding on point cloud.We feed the points in template Ptmp (of size N1) and search area Psea (of size N2) to a feature backbone and obtain M1 template seeds Q = {qi} Mi=1 and M2 search area seeds R = {rj} Mj=1 with features f ∈ R d. We applied hierarchical feature learn-ing architecture of PointNet++ [30] as backbone (but not restricted to it), so that Q and R could preserve local con-text within Ptmp and Psea.Each seed is finally represented with [x;f] ∈ R 3+d(x denotes the seed's 3D position).

　　点云特征编码。我们将模板Ptmp(大小为N1)和搜索区域Psea(大小为N2)中的点馈送到一个特征主干，并获得M1模板种子Q = {qi} Mi=1和M2搜索区域种子R = {rj} Mj=1，其特征为f ∈ R d。我们应用PointNet ++ 30的分层特征学习架构作为主干(但不限于此)，以便Q和R可以在Ptmp和Psea中保留本地内容。每个种子最终用[x；f] ∈ R 3+d表示(x表示种子的3D位置)。

Permutation-invariant target-specific feature aug-mentation.To embed Q's target information into R, a nat-ural idea is to compute point-wise similarity Sim (of size M2 × M1) between Q and R, e.g., using cosine distance:　　

Note that Simj,: (row j in Sim) denotes similarity between rj and all seeds in Q. We may first consider Simj,: as rj 's target-specific feature.However, as in Fig. 3, Simj,: keeps unstable due to Q's disorder.This contradicts our need for a consistent feature, i.e., a feature invariant to Q's inside permutation.We accordingly apply symmetric functions (specifically, Maxpool) to ensure permutation-invariance.As in Fig. 4, we first augment each Simj,: (local track-ing clue) with Q' spatial coordinates and features (global target clue), yielding a tensor of size M₁ × (1 + 3 + d₁).Then we feed the tensor into network Φ (MLP-Maxpool-MLP

There are other selections to extract f t : leaving out Q's feature, leaving out Sim or adding R's feature.All of them turns inferior in Sec.4.3.1.

　　排列不变的特定于目标的特征增强。为了将Q的目标信息嵌入到R中，自然的想法是计算Q和R之间的点状相似度Sim(大小为M₂ × M₁)，例如，使用余弦距离:

　　请注意，Sim_j(Sim中的j行)表示r_j和q中所有种子之间的相似性。我们可以首先考虑Simj:作为rj的目标特定特征。然而，如图3所示，由于Q的紊乱，Simj保持不稳定。这与我们对一致特征的需求相矛盾，即一个对Q的内部置换不变的特征。因此，我们应用对称函数(具体来说，Maxpool)来确保置换不变性。如图4所示，我们首先用Q’空间坐标和特征(全局目标线索)扩充每个Simj(局部跟踪线索)，产生大小为M1 × (1 + 3 + d1)的张量。然后我们将张量输入到网络φ(MLP-麦克斯韦-MLP)中，得到rj的特定目标特征，ft rj∈R d rj最后用[xrj；f t rj ] ∈ R 3+d(xrj表示rj的3D位置)。

　　还有其他选择来提取f t:省去Q的特征，省去Sim或增加R的特征。他们在第二节都变得很差。4.3.1。

3.3.Target proposal based on potential target centers

Embedded with target clue, each rj can directly predict one target proposal.But our intuition is that, individual seed can only capture limited local clue, which may not suffice the final prediction.We follow the idea within VoteNet [28] to 1) regress the search area seeds into potential target cen-ters via Hough voting, and 2) cluster neighboring centers to leverage the ensemble power and obtain target proposals.

3.3基于潜在目标中心的目标提案

　　嵌入目标线索，每个rj可以直接预测一个目标提案。但我们的直觉是，单个种子只能捕捉有限的局部线索，这可能不足以做出最终的预测。我们遵循VoteNet [28]中的思想，1)通过霍夫投票将搜索区域种子回归到潜在的目标中心，以及2)聚类相邻中心以利用集成能力并获得目标建议。

Potential target center generation

　　潜在目标中心生成。每一个具有特征f t rj的种子rj可以通过霍夫投票粗略地预测潜在的目标中心cj。根据VoteNet [28]，投票模型应用MLP预测rj和地面真实目标中心之间的坐标偏移xj以及f t rj的残差f t rj。因此，cj用…。xj的损失称为 (2)

这里，gtj表示从rj到目标中心的地面真值偏移；指示我们只训练那些位于地面真实目标表面的种子；Mts表示经过训练的种子数量。

　　聚类和目标提案。对于每个cj，我们使用球查询[]生成半径为R:T ^t j = { CK | kck cjk 2 < R }的聚类T T j。由于相邻的聚类可能捕捉到相似的区域级上下文，为了提高效率，我们在所有潜在的目标中心对大小为K的子集进行采样，作为聚类质心。在4.3.3节，P2B变得强大到各种各样的Ks。最后，我们将每一个T ^t j输入到θ(MLP-马克斯普尔-MLP)中，并获得目标建议和建议针对性得分(共生成K个建议):

　　p t j有参数:三维位置的偏移和在X-Y平面的旋转。我们将详细介绍如何在秒内学习θ。3.5。

3.4.Improved target proposal with seed-wise tar-getness score

We consider each seed with target-specific feature can be directly assessed with its targetness to 1) regularize earlier feature learning and 2) strengthen the representation of its predicting potential target center.Therefore, we can obtain target proposals with higher quality.

3.4改进的目标提案，带有种子阶段的得分

　　我们认为每个具有特定目标特征的种子可以直接用其目标性来评估，以1)正则化早期特征学习和2)增强其预测潜在目标中心的表示。因此，我们可以获得更高质量的目标提案。

Seed-wise targetness score s s . We learn a MLP to gen-erate s s j for each rj .Those search area seeds located on the surface of ground-truth target are regarded as positives, and the extra as negatives.We use a standard binary cross en-tropy loss Lcla for s s . Since s s j tightly relates to f t rj , Lcla can explicitly constrain the point feature learning and con-sequent target-specific feature augmentation.

　　种子阶段的目标得分为100分，我们学习一个MLP来为每个目标得分.位于地面真实目标表面的搜索区域种子被认为是肯定的，多余的被认为是否定的。由于s-s-j与f-t-rj密切相关，因此Lcla可以明确地约束点特征学习和随后的特定目标特征增强。

Improved target proposal.Inheriting more discrimi-native power from s s j , we update cj 's representation with Sequentially, we update clusters with ball query and target proposals with Equation (3).We consider that, s^s can implicitly help pick out representative potential target centers to benefit final target proposal.

　　改进的目标提案。从s . s . j .那里继承了更多与生俱来的权力，我们用..,。接下来，我们使用ball查询更新集群，并使用等式(3)确定目标建议。我们认为，s可以含蓄地帮助挑选有代表性的潜在目标中心，使最终目标提案受益。

3.5.Final target verification

With K proposals generated from above (refer to Θ in Equation (3)), proposal with the highest proposal-wise tar-getness score is verified as the final tracking result.

3.5最终目标验证

　　根据以上生成的K个建议(参见等式(3)中的θ)，建议方面得分最高的建议被验证为最终跟踪结果。

We follow VoteNet [28] to learn Θ.Specifically, we con-sider proposals whose centers near the target center (within 0.3 meters) as positives and those faraway (by more than 0.6 meters) as negatives.Other proposals are left unpenalized.We use a standard binary cross entropy loss .As for p^t_j, only the positives' box parameters are supervised via Huber (smooth-L1 [31]) loss L_box.We aggregate all the mentioned losses as our final loss L:

　　我们遵循VoteNet [28]来学习θ。具体来说，我们认为那些靠近目标中心(0.3米以内)的中心是积极的，而那些远离目标中心(0.6米以上)的中心是消极的。其他提议未被采纳。我们使用标准的二进制交叉熵损失Lprop，对于p . t . j，通过Huber(光滑-L1 [31])损失Lbox只监督阳性盒参数。我们将上述所有损失合计为最终损失L:

Here γ1(= 0.2), γ2(= 1.5) and γ3(= 0.2) are used to nor-malize all the component losses to be of the same scale.

　　这里，γ1(= 0.2)、γ2(= 1.5)和γ3(= 0.2)被用来将所有的元件损耗非均匀化为相同的比例。

4.Experiments

We applied KITTI tracking dataset [10] (with point clouds scanned using lidar) as benchmark.We followed settings in [11] (shortened as SC3D by us for simplicity) in data split, tracklet generationand evaluation metric for fair comparisons.Since cars in KITTI appear in largest quan-tity and diversity, we mainly focused on car tracking and perform ablation study on it as in SC3D.We also did exten-sive experiments with other three target types (Pedestrain, Van, Cyclist) for better comparisons.

4.实验

　　我们应用KITTI跟踪数据集[10](使用激光雷达扫描点云)作为基准。我们遵循[11]中的设置(为简单起见，我们将其简称为SC3D)进行数据分割、轨迹线生成和公平比较的评估指标。由于KITTI的汽车以最大的质量和多样性出现，我们主要关注汽车跟踪，并在SC3D中对其进行消融研究。为了更好地进行比较，我们还对其他三种目标类型(行人、货车、自行车)进行了广泛的实验。

4.1.Experimental setting

4.1.1 Dataset

4.1实验环境

4.1.1数据集

Since ground truth for test set in KITTI is inaccessible offline, we used its training set to train and test our P2B.This tailored dataset had 21 outdoor scenes and 8 types of targets.We generated tracklets for target instances within all videos and split the dataset as follows: scenes 0-16 for training, 17-18 for validation, and 19-20 for testing.

　　因为在KITTI的测试集的地面真相是离线不可访问的，我们使用它的训练集来训练和测试我们的P2B。这个定制的数据集有21个室外场景和8种类型的目标。我们为所有视频中的目标实例生成轨迹，并将数据集分割如下:场景0-16用于训练，场景17-18用于验证，场景19-20用于测试。

Point cloud's sparsity.Though each frame reports an average of 120k points, we suppose the points on target might be quite sparse with general occlusion and lidar's de-fect on distant objects.To validate our idea, we counted the number of points on KITTI's cars in Fig. 5.We can observe that about 34% cars held fewer than 50 points.The situation may be worse on smaller-size pedestrians and cyclists.This sparsity imposes great challenge onto point cloud based 3D object tracking.

　　点云的稀疏性。虽然每帧报告平均12万个点，我们假设目标上的点可能非常稀疏，一般遮挡和激光雷达对远处物体的影响。为了验证我们的想法，我们在图5中计算了KITTI汽车的点数。我们可以观察到，大约34%的汽车持有少于50个点。小型行人和骑自行车者的情况可能更糟。这种稀疏性给基于点云的三维目标跟踪带来了巨大的挑战。

Frames containing the same target instance, e.g., a car, are concate-nated by time order to form a tracklet.

　　包含相同目标实例(例如，汽车)的帧按时间顺序连接，以形成轨迹。

图5-KITTI汽车上点的数量直方图，以例证目标上点的稀疏性。

4.1.2 Evaluation metric

We used One Pass Evaluation (OPE) [38] to measure Suc-cess and Precision of different methods."Success" is de-fined as IOU between predicted box and ground-truth (GT) box."Precision" is defined as AUC for errors (distance be-tween two boxes' centers) from 0 to 2m.

4.1.2评估指标

　　我们使用一次通过评估(OPE) [38]来测量不同方法的成功率和精确度。“成功”被定义为预测框和实际框之间的借据。“精度”被定义为误差(两个盒子中心之间的距离)从0到2m的AUC。

4.1.3 Implementation details

Template and search area.For template, we col-lected and normalized its points to N1 = 512 ones with randomly abandoning or duplicating.For search area, we similarly collected and normalized the points to N2 = 1024 ones.The ways to generate template and search area differ in training and testing as detailed below.

4.1.3实施细节

模板和搜索区域。对于模板，我们通过随机放弃或复制的方式，将其点集合并归一化为N1 = 512。对于搜索区域，我们类似地收集并标准化了N2 = 1024个点。生成模板和搜索区域的方法在培训和测试中有所不同，具体如下。

Network architecture.We adopted PointNet++ [30] as our backbone.We tailored it to contain three set-abstraction (SA) layers, with receptive radius of 0.3, 0.5, 0.7 meters, and 3 times of half-size down-sampling.This yielded M1 = 64(= N1/2 ) template seeds and M2 = 128(= N2/2 ) search area seeds.We applied random sampling, and re-moved up-sampling layers in PointNet++ due to points' sparsity.The output feature was of d1 = 256 dimensions.

网络架构。我们采用了PointNet ++ 30作为我们的主干。我们将其定制为包含三个集合抽象层，接收半径分别为0.3、0.5、0.7米和3倍半尺寸下采样。这产生了M1 = 64(= N1/2)个模板种子和M2 = 128(= N2/2)个搜索区域种子。由于点的稀疏性，我们应用了随机采样，并在PointNet++中重新移动了上采样层。输出特征为d1 = 256维。

Throughout our method, all used MLPs had three layers.The size of these layers was 256 (hence d2 = 256) except that of the last layers (sizely) in following MLPs:

• For MLP to predict s s , sizely = 1.

• For Θ to predict s p and p t , sizely = 5.

　　在我们的方法中，所有使用的多层都有三层。这些层的尺寸为256(因此d2 = 256)，但以下多层中的最后几层(尺寸)除外:

　　MLP预测s，大小= 1。

　　对于θ来预测s p和p t，sizely = 5。

Clustering.K = 64 randomly sampled potential target centers clustered the neighbors within R = 0.3 meters.

集群。K = 64个随机抽样的潜在目标中心聚集在R = 0.3米内的邻居。

Training.1) Data Augmentation: we applied random offset on previous GT and fused point clouds within the re-sult box and the first GT for more template samples;we en-larged the current GT by 2 meters to include background (negative seeds), applied similar random offset and col-lected inside point cloud for more search area samples.2) We trained P2B from scratch with the augmented samples. We applied Adam optimizer [17].Learning rate was ini-tially 0.001 and decreased by 5 times after 10 epochs.Batch size was 32.In practice, we observed P2B converged to a satisfying result after about 40 epochs.

训练。1)数据增强:我们在结果框和第一个模板样本的第一个模板中，对先前的第一个模板和融合点云应用了随机偏移；我们将当前的GT放大了2米，以包括背景(负种子)，应用了类似的随机偏移，并为更多的搜索区域样本选择了内部点云。2)我们用增加的样本从头开始训练P2B。我们应用了亚当优化器[17]。学习率最初为0.001，10个时期后下降了5倍。批量为32。在实践中，我们观察到P2B在大约40个纪元后收敛到一个令人满意的结果。

表2。与SC3D进行全面比较。右三列生成搜索区域的方式不同。

表3。与SC3D进行广泛比较。右边的五列显示了不同目标类型及其平均值的结果。

模板和搜索区域是点云的形式。燃气轮机和结果是3D盒子的形式。

Method Car Pedestrian Van Cyclist Mean

方法汽车行人货车自行车平均

Testing.We used the trained P2B to infer 3D bound-ing boxes within tracklets frame by frame.For the current frame, template initially adopted the first GT's point cloud and then fusion of the first GT's and previous result's point clouds.We enlarged previous result by 2 meters in current frame and collected inside point cloud to obtain search area.

测试。我们使用训练有素的P2B逐帧推断轨迹中的3D绑定框。对于当前帧，模板最初采用第一个点云，然后融合第一个点云和前一个结果的点云。在当前帧中，我们将先前的结果放大了2米，并收集了内部点云以获得搜索区域。

4.2.Comprehensive comparisons

We only compared our P2B with SC3D [11], the first and only work on point cloud based 3D object tracking.We reported results for 3D car tracking in Table 2.

4.2综合比较

　　我们仅将我们的P2B与SC3D [11]进行了比较，SC3D是第一个也是唯一一个基于点云的3D对象跟踪的工作。我们在表2中报告了3D汽车跟踪的结果。

We generated search area centered on previous result, previous GT or current GT.Using previous result as the search center meets the requirement of real scenarios, while using previous GT helps approximately assess short-term tracking performance.For the two situations, SC3D applies Kalman filtering to generate proposals.Using current GT is unreasonable, but is considered in SC3D to approximate exhaustive search and assess SC3D's discriminative power.Specifically, SC3D conducts grid search around target cen-ter to include GT box in generated proposals.However, P2B clusters potential target centers to generate proposals with-out explicit dependence on GT box.I.e., P2B may adapt to various scenarios while SC3D could degrade when the GT boxes are removed as demonstrated in Table 2 .Compre-hensively, P2B outperformed SC3D by a large margin.All later experiments adopted the more realistic setting of using previous result ("Testing" in Sec.4.1.3).

　　我们生成了一个搜索区域，以之前的搜索结果、之前的搜索结果或当前的搜索结果为中心。使用以前的结果作为搜索中心符合真实场景的要求，而使用以前的GT有助于大致评估短期跟踪性能。对于这两种情况，SC3D应用卡尔曼滤波来生成建议。使用当前的GT是不合理的，但是在SC3D中被认为是近似穷举搜索和评估SC3D的辨别能力。具体来说，SC3D围绕目标中心进行网格搜索，以在生成的建议书中包含GT框。然而，P2B将潜在的目标中心聚集在一起，以产生不依赖于燃气轮机箱的方案。如表2所示，当移除燃气轮机箱时，P2B可能会适应各种情况，而SC3D可能会降级。总的来说，P2B的表现远远超过了SC3D。所有后来的实验都采用了更现实的设置，即使用先前的结果(“测试”)。4.1.3)。

Extensive comparisons.We further compared P2B with SC3D on Pedestrian, Van, and Cyclist (Table 3).P2B out-performed SC3D by ∼10% on average.P2B's advantage turned significant on data-rich Car and Pedestrian.But P2B degraded when training data decreased as was the case for Van and Cyclist.We conjecture that P2B may rely on more data to learn better networks especially when regressing potential target centers.Comparatively, SC3D needs rela-tively less data to suffice similarity measuring between two regions.To validate this, we used the model trained on data-rich Car to test Van, with the belief that car resem-bles van and contains potentially transferable information.As expected, the Success/Precision result of P2B showed an improved 49.9/59.9 (original: 40.8/48.4), while SC3D reported a declined 37.2/45.9 (original: 40.4/47.0).

　　广泛的比较。我们进一步比较了P2B和SC3D在行人、货车和自行车上的表现(表3)。P2B平均比SC3D高出10%。P2B在数据丰富的汽车和行人方面的优势变得非常明显。但是，当训练数据减少时，P2B就退化了货车和自行车手。我们推测，P2B可能依赖更多的数据来学习更好的网络，尤其是在回归潜在的目标中心时。相比之下，SC3D需要相对较少的数据来满足两个区域之间的相似性测量。为了验证这一点，我们使用了在数据丰富的汽车上训练的模型来测试货车，相信汽车能重置货车并包含潜在的可转移信息。不出所料，P2B的“成功/精度”结果显示49.9/59.9(原始值:40.8/48.4)，而SC3D报告的结果为37.2/45.9(原始值:40.4/47.0)。

表4。目标特定特征增强的不同方法(tsfa)。用于获得搜索特征A和B的方法在图6中示出。

图6。在特定目标特征增强中包含搜索区域特征的两种方法。对于A，我们复制搜索区域种子的特征，并在模板特征沿着相似性图的每一列复制之后附加它们；对于B，我们将搜索区域特征与Maxpool之后的特征连接起来(图4)。

4.3.Ablation study

4.3.1 Ways for target-specific feature augmentation

Besides our default setting in P2B (Sec.3.2), there are another four possible ways for feature augmentation: re-moving (the duplication of) template features, removing the similarity map, using search area feature A and B (Fig. 6).

4.3消融研究

4.3.1特定目标特征增强的方法

　　除了我们在P2B的默认设置(秒。3.2)，还有另外四种可能的特征增强方法:重新移动(复制) 模板特征，去除相似性图，使用搜索区域特征A和B(图6)。

We compared the five settings in Table 4.Here remov-ing template features or similarity map degraded by about 1% or 3%, which validates the contributions of these two parts in our default setting.Search area feature A and B did not improve or even harm the performance.Note that we already combined template features in both conditions.This may reveal that search area features only capture spa-tial context rather than target clue, and hence turns useless for target-specific feature augmentation.In comparison, our default setting brings with richer target clue from template seeds to yield a more "directed" proposal generation.

　　我们比较了表4中的五种设置。在这里，移除模板特征或相似性图降低了大约1%或3%，这验证了这两个部分在我们的默认设置中的贡献。搜索区域特征A和B并没有提高甚至损害性能。请注意，我们已经在两种情况下组合了模板特性。这可能揭示搜索区域特征仅捕捉空间上下文而非目标线索，因此对于目标特定特征增强变得无用。相比之下，我们的默认设置带来了来自模板种子的更丰富的目标线索，以产生更“定向”的建议生成。

表6。模板生成的不同方式。“第一个和先前的”表示“第一个燃气轮机和先前的结果”。

图7-种子阶段目标得分和潜在目标中心的说明。绿线显示从种子(第一行中的彩色点)到潜在目标中心(第二行中的彩色点)的投影。我们用红色标出了那些信息点，即目标性分数较高的点，用黄色标出了相反的点。成对的种子和潜在中心用相同的颜色标记以显示相关性。

图8。不同数量的建议表明，我们的方法与广泛的参数兼容。

4.3.2 Effectiveness of seed-wise targetness

In Sec.3.4, we obtain seed-wise targetness scores s s and concatenate them with potential target centers to guide the proposal and verification.Here we tested P2B without this concatenation or even the whole branch of s s (Table 5).We can observe that leaving out concatenation dropped the performance by ∼1%, while removing the whole branch dropped by ∼3%.This verifies that s s offers good super-vision on learning the whole network for improved target proposal and verification.

4.3.2种子阶段目标的有效性

　　在3.4，我们获得种子方式的目标得分s，并将它们与潜在的目标中心连接起来，以指导建议和验证。在这里，我们测试了P2B，没有这种连接，甚至没有s的整个分支(表5)。我们可以观察到，省略串联会使性能下降1%，而删除整个分支会使性能下降3%。这验证了s-s在学习整个网络以改进目标提议和验证方面提供了良好的超视觉。

4.3.3 Robustness with different number of proposals

We tested P2B (without re-training) and SC3D with dif-ferent number of proposals.From the results in Fig. 8, P2B obtained satisfying results even with only 20 proposals.But SC3D degraded dramatically when using less than 40 pro-posals.To conclude, P2B turns more robust to less number of proposals, showing that P2B can generate proposals with both higher quality and efficiency.

4.3.3不同提案数量的稳健性

　　我们测试了P2B(没有再培训)和SC3D，并提出了不同数量的建议。从图8的结果来看，即使只有20个建议，P2B也获得了令人满意的结果。但是当使用少于40个处理器时，SC3D性能显著下降。总之，P2B变得更加稳健，提案数量减少，这表明P2B能够以更高的质量和效率提出提案。

4.3.4 Ways for template generation

For template generation, SC3D concatenates the points in all previous results while P2B concatenates the points within the first GT and previous result to update template for efficiency.Here we reported results with four settings for template generation: the first GT, the previous result, the fusion of the first GT and previous result, and all previ-ous results.Results in Table 6 show P2B's consistent advan-tage over SC3D in all settings, even in "All previous shapes" where P2B reported degraded result.We attribute the degra-dation to that 1) we did not include shape completion [11] and 2) we did not train P2B with all previous results while SC3D considered both.

4.3.4模板生成方式

　　对于模板生成，SC3D连接所有先前结果中的点，P2B将第一个燃气轮机和先前结果中的点连接起来，以更新效率模板。在这里，我们用模板生成的四个设置来报告结果:第一个组、前一个结果、第一个组和前一个结果的融合，以及所有前一个结果。表6中的结果显示了P2B在所有设置中相对于SC3D的一致优势，甚至在“所有以前的形状”中，P2B报告了降级结果。我们将退化归因于1)我们没有包括形状完成[11]和2)我们没有用所有以前的结果训练P2B，而SC3D考虑了两者。

4.4.Qualitative analysis

4.4.1 Advantageous cases

We first exemplified our target-specific feature's discrim-inative power in Fig. 7.The first row visualizes seeds' tar-getness scores to demonstrate their possibility of belonging to the target (Car).We can observe that P2B had learnt to discriminate the target seeds from the background ones.The second row visualizes how P2B projects seeds to po-tential target centers.We can observe that the potential cen-ters with more target information gathered tightly around GT target center, which further validates our discriminative target-specific features.Besides, P2B can address the occlu-sion because it can generate groups of informative potential target centers for final prediction.

4.4定性分析

4.4.1有利案例

　　我们首先在图7中举例说明了我们的特定目标特征的发散能力。第一行可视化种子的焦油含量分数，以证明它们属于目标(汽车)的可能性。我们可以观察到，P2B已经学会区分目标种子和背景种子。第二行显示了P2B如何将种子投射到潜在的目标中心。我们可以观察到，具有更多目标信息的潜在中心紧密地聚集在燃气轮机目标中心周围，这进一步验证了我们的区别性目标特异性特征。此外，P2B可以解决这一问题，因为它可以生成一组信息丰富的潜在目标中心，用于最终预测。

We then visualize P2B's advantage over SC3D to address point cloud's sparsity in Fig. 9.We can observe that in the sparse scenarios where SC3D tracked off course or even failed, our predicted box held tight to the target center.

　　然后，我们将P2B相对于SC3D的优势可视化，以解决图9中点云的稀疏性。我们可以观察到，在SC3D偏离轨道甚至失败的稀疏场景中，我们的预测框紧紧抓住目标中心。

4.4.2 Failure cases

Here we searched for tracklets where P2B failed and found that most failure cases arose when initial template in the first frame was too sparse and hence yielded little target information.As exemplified in Fig. 10, when P2B faced such case and tracked off course with cluttered background, points from the initial template cannot modify current er-roneous predictions and re-obtain an informative template.This failure may also reveal that P2B inherits target infor-mation from template instead of search area.

We believe that when fed with more points containing potentially rich target information, P2B could generate pro-posals with higher quality to yield better results.Our intu-ition is validated in Fig. 11.

4.4.2失败案例

　　在这里，我们搜索了P2B失败的轨迹，发现当第一帧中的初始模板过于稀疏，因此产生的目标信息很少时，会出现大多数失败情况。如图10所示，当P2B面对这种情况并在混乱的背景下偏离轨道时，来自初始模板的点不能修改当前的时间预测并重新获得信息模板。这种失败也可能表明P2B从模板而不是从搜索区域继承目标信息。

　　我们相信，当获得更多包含潜在丰富目标信息的分数时，P2B可以生成更高质量的预测，从而产生更好的结果。我们的假设在图11中得到验证。

4.5.Running speed

Here we averaged the running time of all test frames for car to measure P2B's speed.P2B achieved 45.5 FPS, in-cluding 7.0 ms for processing point cloud, 14.3 ms for net-work forward propagation and 0.9ms for post-processing, on a single NVIDIA 1080Ti GPU.SC3D in default setting ran with 1.8 FPS on the same platform.

4.5行驶速度

　　在这里，我们对汽车所有测试帧的运行时间进行平均，以测量P2B的速度。在单个NVIDIA 1080Ti图形处理器上，P2B实现了45.5 FPS，包括处理点云的7.0毫秒、网络前向传播的14.3毫秒和后处理的0.9毫秒。默认设置下的SC3D在同一平台上以1.8 FPS运行。

5.Conclusions

In this work we propose a novel point-to-box (P2B) net-work for 3D object tracking.We focus on embedding the target information within template into search space and formulate an end-to-end method for point-driven target pro-posal and verification jointly.P2B operates on sampled seeds instead of 3D boxes to reduce search space by a large margin.Experiments justify our proposition's superiority.

5.结论

　　在这项工作中，我们提出了一个新颖的点对盒(P2B)网络三维目标跟踪。我们着重于将模板中的目标信息嵌入到搜索空间中，并提出一种端到端的方法，用于点驱动目标定位和联合验证。P2B对采样种子而不是3D盒子进行操作，大大减少了搜索空间。实验证明了我们主张的优越性。

图11。第一帧的车点数对我们方法的影响。我们计算了测试集中每个时间间隔(水平轴)的平均成功率。

　　实验还表明，P2B需要更多的数据才能获得满意的结果。因此，我们可以期待一个不那么依赖数据的P2B，同时我们也可以收集更多的数据来处理这个大数据时代的问题。此外，我们可以在搜索区域寻找更好的特征增强方法，并在更具挑战性的场景中测试我们的方法。

　　本工作得到了国家自然科学基金(批准号:U1913602、61876211和61502187)、中国装备预研领域基金(批准号:61403120405)、中国国家重点实验室开放基金(批准号:6142113180211)和中央大学基础研究基金(批准号:2019kfyXKJC024)的共同资助。

word文档版看笔记标注。

查看全文

相关阅读:
JavaScript 本地对象、内置对象、宿主对象
 数据交换格式
 网页设计之内容、结构、表现分离
 Web前端浏览器兼容初探
 javascript call()与apply()
天气API
display:inline,display:inline-block,display:block 区别
 javascript sort()与reverse()
The Primo ScholarRank Technology: Bringing the Most Relevant Results to the Top of the List
IOS ——OC—— NSMutableDictionary的使用总结