zoukankan      html  css  js  c++  java
  • Deep Generative Video Compression(NIPS 2019)

    Based on VAE

    Steps:

    1. Transform a sequence of frames (x_{1:T}=(x_1,...,x_T)) to a sequence of latent states (z_{1:T}) and optionally a global state (f). This transformation is lossy, but the video is not optimally compressed as there are still correlations in the latent space variables.
    2. So the latent space must be entropy coded into binary.
    3. The bit stream can then be sent to a receiver where it is decoded into video frames.

    (Q: 为什么要先变成latent variables? 直接entropy coding不行吗?)

    So we need two models:

    1. optimal lossy transformation into the latent space.
    2. predictive model required for entropy coding.

    Temporal model is most important for videos, because video exhibit strong temporal correlations in addition to the spatial correlations present in images.

    So we propose to learn a temporally-conditioned prior distribution parameterized by a deep generative model to efficiently code the latent variables associated with each frame.

    Notation:
    (x_{1:T}=(x_1,...,x_T)=)video sequence, (z_{1:T}=)associated latent variables, (f=)global variables(optionally)

    Arithmetic coding:
    Coding the entire sequence of discretized latent states (z_{1:T}) into a single number. Use conditional probabilities (p(z_t|z_{<t})) to iteratively refine the real number interval ([0,1)) into a progressively smaller interval. (Q: 具体怎么调整的?) After a final (very small) interval is obtained, a binarized floating point number from the final interval is stored to encode the entire sequence of latents.

    Decoder: f(latent)=data
    Use a stochastic recurrent variational autoencoder(随机循环变分自编码器) that transforms a sequence of local latent variables (z_{1:T}) and a global state (f) into the frame sequence (x_{1:T})

    [p_ heta(x_{1:T},z_{1:T},f)=p_ heta(f)p_ heta(z_{1:T})prod_{t=1}^{T}p_ heta(x_t|z_t,f) ]

    [p_ heta(x_t|z_t,f)=Laplace(mu_ heta(z_t,f),lambda^{-1}1);(frame;likelihood)\ widetilde{x}_t=mu_ heta(z_t,f)=decoder;mean ]

    Encoder:
    Use amortized variational inference(平摊变分推理) to predict a distribution over latent codes given the input video.

    [q_phi(z_{1:T},f|x_{1:T})=q_phi(f|x_{1:T})prod_{t=1}^{T}q_phi(z_t|x_t) ]

    采用以均值为中心的固定宽度均匀分布:

    [widetilde{f} sim q_{phi}(f|x_{1:T})=mathcal{U}(hat{f}-frac{1}{2},hat{f}+frac{1}{2}) \ widetilde{z}_t sim q_{phi}(z_t|x_t)=mathcal{U}(hat{z}_t-frac{1}{2},hat{z}_t+frac{1}{2}) ]

    均值通过附加的编码器神经网络得到:

    [hat{f}=mu_{phi}(x_{1:T}) \ hat{z}_t=mu_{phi}(x_t) ]

    The mean for the global state is parametrized by convolutions over (x_{1:T}), followed by a bi-directional LSTM which is then processed by a MLP.

    The encoder mean for the local state is simpler, consisting of convolutions over each frame followed by a MLP.

    论文中假设全局先验(p_ heta(f))是固定的,而(p_ heta(z_{1:T}))由时间序列模型组成

    [p_ heta(f)=prod_{i}^{dim(f)}p_ heta(f^i)*mathcal{U}(-frac{1}{2}, frac{1}{2}) \ p_ heta(z_{1:T})=prod_{t}^{T} prod_{i}^{dim(z)}p_ heta(z_t^i|z_{<t})*mathcal{U}(-frac{1}{2}, frac{1}{2}) ]

    有两种方法可以对潜变量序列(z_{1:T})建模:

    1. A recurrent LSTM prior architecture for (p_ heta(z_t^i|z_{<t})) which conditions on all previous frames in a segment.
    2. 单帧上下文, (p_ heta(z_t^i|z_{<t})=p_ heta(z_t^i|z_{t-1})), 本质上是一个Kalman filter

    通过最大化(eta-VAE)目标函数,可以联合学习encoder(变分模型)和decoder(生成模型)

    [L(phi, heta)=mathbb{E}_{widetilde{f},widetilde{z}_{1:T} sim q_{phi}} [log{p_ heta}(x_{1:T}|widetilde{f},widetilde{z}_{1:T})] +eta mathbb{E}_{widetilde{f},widetilde{z}_{1:T} sim q_{phi}} [log{p_ heta}(widetilde{f},widetilde{z}_{1:T})] ]

    第一项表示失真,第二项表示近似后验和先验的交叉熵

    模型:

  • 相关阅读:
    MIUI(Android)使用Webview上传文件
    使用EntityFramework中DbSet.Set(Type entityType)方法碰到的问题
    Web文件管理:elFinder.Net(支持FTP)
    ASP.NET 根据现有动态页面生成静态Html
    LaTeX学习
    Java Integer剖析
    20140711 loop
    20140711 eat
    20140711 set
    20140710 loop
  • 原文地址:https://www.cnblogs.com/hhhhhxh/p/13198639.html
Copyright © 2011-2022 走看看