zoukankan      html  css  js  c++  java
  • Spatiotemporal Transformer for Video-based Person Re-identification

    Spatiotemporal Transformer for Video-based Person Re-identification

    Abstract

    1. Key issue: How to extract discriminative information from a tracklet.
    2. Problem: Vanilla Transformer overfits here.
    3. Solution: A pipeline: pretrained on synthesized video data + transferred to downstream domain with STT(perception-constrained Spatiotemporal Transformer) module and GT(global Transformer) module.

    STT

    STT & GT

    1. Propose a constrained attention learning scheme to prevent the Transformer from over-focusing on local regions.
    2. Two-stage design separates S & T information to avoid the over-fitting. But patches in an image cannot communicate with those from another image -- to associate patches across frames, a global attention learning branch was added here.

    即兴笑话一则: 有一群影魔在那里排队solo, 他们魔很多一个个挨个上, 那么请问第几个sf最厉害呢? -- 第p个, 因为p-th的影魔最厉害.

    Constrained Attention Learning

    Give constraints on ST and TT to relieve the over-fitting.

    Feature: feature map -> H x W -> 1-dim tokens. And there is an extra one as classification token (totally H x W + 1).

    Loss spatial constraint (L_{SpaC} = L_{spa\_part\_xent} + L_{spa\_xent}):

    1. (L_{spa\_xent}) -- cross entropy loss for learning discriminative representation.
    2. Due to dataset is small for the Transformer, to avoid that ST focus on limited regions but ignore detailed cues, a spatial part cross entropy loss (L_{spa\_part\_xent}) is proposed -- Divide tokens into P groups horizontally as some re-id tasks did. An average pooling operation is conducted with each group. (L_{spa\_part\_xent = frac{1}{P}sumlimits_{spa\_part\_xent}^{(p)}}). where p presents the pth group.

    Loss temporal constraint (L_{TemC} = L_{tem\_trip} + L_{tem\_attn}):

    1. (L_{tem\_xent}) -- supervises the final output.
    2. (L_{tem\_trip}) -- shrink the distances of positive pairs (in the same tracklet).
    3. (L_{tem\_attn} = sumlimits_{i=1}^{N}[exp(sumlimits_{k=1}^{L}alpha_{i,k} log(alpha_{i, k})) - alpha]_{+}) -- increase the information entropy of the attention weights in each tracklet and leaves much space for Transformer to decide which frame is more critical with the parameter (alpha).

    Global Attention Learning

    The aim is to establish the relationships between patches of different frames, which are ignored in the former design.

    Components: a Global Transformer module -- take H x W patches of all L frames in a tracklet as its input. As a result, there will be H x W x L + 1 (classification token) fed into the Global Transformer (GT) with an extra classification token.

    Then, a cross entropy loss (L_{global\_xent}) is adopted to supervise the learning of GT.

    The final representation is generated by the concatenation of outputs of STT and GT.

    Synthesized Video Pre-training

    1. Adopt UnrealPerson toolkit, 4 environments X 34 cameras.
    2. Set disturbance that persons may not appear in the middle.
    3. Set disturbance that severely occluded frames are also kept.

    Implementation

    • Architecture
      1. CNN baseline -- first 4 residual blocks of ResNet-50.
      2. CNN + Transformer -- 4-th residual block is replaced by Transformer blocks.
      3. ST and TT share the same architecture design, with 1 layer and 6 heads.
      4. The Global Transformer has 2 layers and 6 heads.
    • Workflow:
      • The output feature maps of the CNN backbone go through a conv layer and are flattened to patch tokens.
      • The embedding dimension of all Transformers is set to 768.
      • Positional embeddings are only used in ST.

    Idea: 换人物背景后, triplet loss拉近会不会效果更好? 其实unrealPerson那种很适合直接换, 不然就要出掩码之类的.

    Ablation Study

    1. Spatial Attention

    spatial_attention

    1. Temporal Attention

    temporal_attention

    Both of them give regularization to the attention region to avoid overfitting.

    1. Variable Controlling

    ablation

    1. Results

    result

  • 相关阅读:
    MarkDown语法总结
    HashMap
    [LeetCode] 102. Binary Tree Level Order Traversal(二叉树的中序遍历)
    [LeetCode] 287. Find the Duplicate Number(寻找重复数字)
    [LeetCode] 215. Kth Largest Element in an Array(数组里的第 k 大元素)
    [LeetCode] 39. Combination Sum(组合的和)
    [LeetCode] 49. Group Anagrams(分组相同字母异序词)
    [LeetCode] 48. Rotate Image(旋转图片)
    [LeetCode] 647. Palindromic Substrings(回文子串)
    [LeetCode] 238. Product of Array Except Self(数组除自身元素外的乘积)
  • 原文地址:https://www.cnblogs.com/ZhengPeng7/p/14606608.html
Copyright © 2011-2022 走看看