zoukankan      html  css  js  c++  java
  • Deep Generative Video Compression(NIPS 2019)

    Based on VAE

    Steps:

    1. Transform a sequence of frames (x_{1:T}=(x_1,...,x_T)) to a sequence of latent states (z_{1:T}) and optionally a global state (f). This transformation is lossy, but the video is not optimally compressed as there are still correlations in the latent space variables.
    2. So the latent space must be entropy coded into binary.
    3. The bit stream can then be sent to a receiver where it is decoded into video frames.

    (Q: 为什么要先变成latent variables? 直接entropy coding不行吗?)

    So we need two models:

    1. optimal lossy transformation into the latent space.
    2. predictive model required for entropy coding.

    Temporal model is most important for videos, because video exhibit strong temporal correlations in addition to the spatial correlations present in images.

    So we propose to learn a temporally-conditioned prior distribution parameterized by a deep generative model to efficiently code the latent variables associated with each frame.

    Notation:
    (x_{1:T}=(x_1,...,x_T)=)video sequence, (z_{1:T}=)associated latent variables, (f=)global variables(optionally)

    Arithmetic coding:
    Coding the entire sequence of discretized latent states (z_{1:T}) into a single number. Use conditional probabilities (p(z_t|z_{<t})) to iteratively refine the real number interval ([0,1)) into a progressively smaller interval. (Q: 具体怎么调整的?) After a final (very small) interval is obtained, a binarized floating point number from the final interval is stored to encode the entire sequence of latents.

    Decoder: f(latent)=data
    Use a stochastic recurrent variational autoencoder(随机循环变分自编码器) that transforms a sequence of local latent variables (z_{1:T}) and a global state (f) into the frame sequence (x_{1:T})

    [p_ heta(x_{1:T},z_{1:T},f)=p_ heta(f)p_ heta(z_{1:T})prod_{t=1}^{T}p_ heta(x_t|z_t,f) ]

    [p_ heta(x_t|z_t,f)=Laplace(mu_ heta(z_t,f),lambda^{-1}1);(frame;likelihood)\ widetilde{x}_t=mu_ heta(z_t,f)=decoder;mean ]

    Encoder:
    Use amortized variational inference(平摊变分推理) to predict a distribution over latent codes given the input video.

    [q_phi(z_{1:T},f|x_{1:T})=q_phi(f|x_{1:T})prod_{t=1}^{T}q_phi(z_t|x_t) ]

    采用以均值为中心的固定宽度均匀分布:

    [widetilde{f} sim q_{phi}(f|x_{1:T})=mathcal{U}(hat{f}-frac{1}{2},hat{f}+frac{1}{2}) \ widetilde{z}_t sim q_{phi}(z_t|x_t)=mathcal{U}(hat{z}_t-frac{1}{2},hat{z}_t+frac{1}{2}) ]

    均值通过附加的编码器神经网络得到:

    [hat{f}=mu_{phi}(x_{1:T}) \ hat{z}_t=mu_{phi}(x_t) ]

    The mean for the global state is parametrized by convolutions over (x_{1:T}), followed by a bi-directional LSTM which is then processed by a MLP.

    The encoder mean for the local state is simpler, consisting of convolutions over each frame followed by a MLP.

    论文中假设全局先验(p_ heta(f))是固定的,而(p_ heta(z_{1:T}))由时间序列模型组成

    [p_ heta(f)=prod_{i}^{dim(f)}p_ heta(f^i)*mathcal{U}(-frac{1}{2}, frac{1}{2}) \ p_ heta(z_{1:T})=prod_{t}^{T} prod_{i}^{dim(z)}p_ heta(z_t^i|z_{<t})*mathcal{U}(-frac{1}{2}, frac{1}{2}) ]

    有两种方法可以对潜变量序列(z_{1:T})建模:

    1. A recurrent LSTM prior architecture for (p_ heta(z_t^i|z_{<t})) which conditions on all previous frames in a segment.
    2. 单帧上下文, (p_ heta(z_t^i|z_{<t})=p_ heta(z_t^i|z_{t-1})), 本质上是一个Kalman filter

    通过最大化(eta-VAE)目标函数,可以联合学习encoder(变分模型)和decoder(生成模型)

    [L(phi, heta)=mathbb{E}_{widetilde{f},widetilde{z}_{1:T} sim q_{phi}} [log{p_ heta}(x_{1:T}|widetilde{f},widetilde{z}_{1:T})] +eta mathbb{E}_{widetilde{f},widetilde{z}_{1:T} sim q_{phi}} [log{p_ heta}(widetilde{f},widetilde{z}_{1:T})] ]

    第一项表示失真,第二项表示近似后验和先验的交叉熵

    模型:

  • 相关阅读:
    学习JNA,Jnative
    JNative用法注意事项
    使用JNA替代JNI调用本地方法
    傅盛读书笔记:下一个Moonshot是什么?
    华为内部狂转好文:有关大数据,看这一篇就够了
    ws2_32.dll的妙用与删除 (禁网)
    保护颈椎重点按这三大穴位(图)
    在java中调用python方法
    在Windows中实现Java调用DLL(转载)
    java程序员,英语那点事
  • 原文地址:https://www.cnblogs.com/hhhhhxh/p/13198639.html
Copyright © 2011-2022 走看看