Deep Generative Video Compression(NIPS 2019)

zoukankan html css js c++ java

Deep Generative Video Compression(NIPS 2019)
Based on VAE

Steps:
1. Transform a sequence of frames (x_{1:T}=(x_1,...,x_T)) to a sequence of latent states (z_{1:T}) and optionally a global state (f). This transformation is lossy, but the video is not optimally compressed as there are still correlations in the latent space variables.
2. So the latent space must be entropy coded into binary.
3. The bit stream can then be sent to a receiver where it is decoded into video frames.
(Q: 为什么要先变成latent variables? 直接entropy coding不行吗?)

So we need two models:
1. optimal lossy transformation into the latent space.
2. predictive model required for entropy coding.
Temporal model is most important for videos, because video exhibit strong temporal correlations in addition to the spatial correlations present in images.

So we propose to learn a temporally-conditioned prior distribution parameterized by a deep generative model to efficiently code the latent variables associated with each frame.

Notation:
(x_{1:T}=(x_1,...,x_T)=)video sequence, (z_{1:T}=)associated latent variables, (f=)global variables(optionally)

Arithmetic coding:
Coding the entire sequence of discretized latent states (z_{1:T}) into a single number. Use conditional probabilities (p(z_t|z_{<t})) to iteratively refine the real number interval ([0,1)) into a progressively smaller interval. (Q: 具体怎么调整的?) After a final (very small) interval is obtained, a binarized floating point number from the final interval is stored to encode the entire sequence of latents.

Decoder: f(latent)=data
Use a stochastic recurrent variational autoencoder(随机循环变分自编码器) that transforms a sequence of local latent variables (z_{1:T}) and a global state (f) into the frame sequence (x_{1:T})

[p_ heta(x_{1:T},z_{1:T},f)=p_ heta(f)p_ heta(z_{1:T})prod_{t=1}^{T}p_ heta(x_t|z_t,f) ]
[p_ heta(x_t|z_t,f)=Laplace(mu_ heta(z_t,f),lambda^{-1}1);(frame;likelihood)\ widetilde{x}_t=mu_ heta(z_t,f)=decoder;mean ]
Encoder:
Use amortized variational inference(平摊变分推理) to predict a distribution over latent codes given the input video.

[q_phi(z_{1:T},f|x_{1:T})=q_phi(f|x_{1:T})prod_{t=1}^{T}q_phi(z_t|x_t) ]
采用以均值为中心的固定宽度均匀分布:

[widetilde{f} sim q_{phi}(f|x_{1:T})=mathcal{U}(hat{f}-frac{1}{2},hat{f}+frac{1}{2}) \ widetilde{z}_t sim q_{phi}(z_t|x_t)=mathcal{U}(hat{z}_t-frac{1}{2},hat{z}_t+frac{1}{2}) ]
均值通过附加的编码器神经网络得到:

[hat{f}=mu_{phi}(x_{1:T}) \ hat{z}_t=mu_{phi}(x_t) ]
The mean for the global state is parametrized by convolutions over (x_{1:T}), followed by a bi-directional LSTM which is then processed by a MLP.

The encoder mean for the local state is simpler, consisting of convolutions over each frame followed by a MLP.

论文中假设全局先验(p_ heta(f))是固定的，而(p_ heta(z_{1:T}))由时间序列模型组成

[p_ heta(f)=prod_{i}^{dim(f)}p_ heta(f^i)*mathcal{U}(-frac{1}{2}, frac{1}{2}) \ p_ heta(z_{1:T})=prod_{t}^{T} prod_{i}^{dim(z)}p_ heta(z_t^i|z_{<t})*mathcal{U}(-frac{1}{2}, frac{1}{2}) ]
有两种方法可以对潜变量序列(z_{1:T})建模:
1. A recurrent LSTM prior architecture for (p_ heta(z_t^i|z_{<t})) which conditions on all previous frames in a segment.
2. 单帧上下文, (p_ heta(z_t^i|z_{<t})=p_ heta(z_t^i|z_{t-1})), 本质上是一个Kalman filter
通过最大化(eta-VAE)目标函数，可以联合学习encoder(变分模型)和decoder(生成模型)

[L(phi, heta)=mathbb{E}_{widetilde{f},widetilde{z}_{1:T} sim q_{phi}} [log{p_ heta}(x_{1:T}|widetilde{f},widetilde{z}_{1:T})] +eta mathbb{E}_{widetilde{f},widetilde{z}_{1:T} sim q_{phi}} [log{p_ heta}(widetilde{f},widetilde{z}_{1:T})] ]
第一项表示失真，第二项表示近似后验和先验的交叉熵

模型:
查看全文

相关阅读:
流复制-pg_basebackup (有自定义表空间)
流复制-pg_basebackup (没有自定义表空间)
PG 更新统计信息
 PG修改参数方法
 Postgres的索引01
Postgres基础操作
 PostgreSQL安装
 SQL拦截器
 没对象的快自己写一个吧！带你了解一下python对象！
喜欢看电影来哦！教你如果使用Python网络爬虫爬取豆瓣高分电影！

原文地址：https://www.cnblogs.com/hhhhhxh/p/13198639.html