zoukankan      html  css  js  c++  java
  • Entropy, relative entropy and mutual information

    Entropy, relative entropy and mutual information.

    Entropy

    [H(X) = -sum_{x} p(x) log p(x), ]

    熵非负, 且当且仅当(X)确定性的时候为有最小值0, 即(P(X=x_0)=1).

    Proof:

    (log)的凹性可得

    [egin{array}{ll} H(X) & = -sum_{x} p(x) log p(x) \ & = sum_{x} p(x) log frac{1}{p(x)} \ & ge log 1=0. end{array} ]

    Joint Entropy

    [H(X,Y) := -mathbb{E}_{p(x, y)} [log p(x, y)] = sum_{x in mathcal{X}} sum_{yin mathcal{Y}} p(x, y) log p(x, y). ]

    Conditional Entropy

    [egin{array}{ll} H(Y|X) &= - mathbb{E}_{p(x)} [H(Y|X=x)] \ &= - sum_{x in mathcal{X}} p(x) H(Y|X=x) \ &= - sum_{x in mathcal{X}} sum_{y in mathcal{Y}} p(x)p(y|x) log p(y|x) \ &= - sum_{x in mathcal{X}} sum_{y in mathcal{Y}} p(x, y) log p(y|x). end{array} ]

    注意 (H(Y|X))(H(Y|X=x)) 的区别.

    Chain rule

    [H(X, Y) = H(X) + H(Y|X). ]

    proof:

    根据(p(y|x)=frac{p(x, y)}{p(x)})以及上面的推导可知:

    [egin{array}{ll} H(Y|X) &= H(X,Y) + sum_{x in mathcal{X}} sum_{y in mathcal{Y}} p(x, y) log p(x) \ &= H(X, Y) -H(X). end{array} ]

    推论:

    [H(X,Y| Z) = H(X|Z) + H(Y|X, Z). ]

    [egin{array}{ll} H(Y|X,Z) &= mathbb{E}_{x,z} [H(Y|x,z)] \ &= -sum_{x,z} p(x,z) p(y|x,z) log p(y|x,z) \ &= -sum_{x, z} p(x, y, z) [log p(x, y|z) - log p(x|z)] \ &= mathbb{E}_{z} H(X, Y|z) - mathbb{E}_{z} H(X|z) = H(X, Y|Z) - H(X|Z). end{array} ]

    Mutual Information

    [I(X;Y) = H(X) - H(X|Y) = sum_{x in mathcal{X}} sum_{y in mathcal{Y}}p(x, y)log frac{p(x, y)}{p(x)p(y)} ]

    • (I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = I(Y;X) = H(X) + H(Y) - H(X, Y) ge 0)

    • (I(X, X) = H(X))

    Relative Entropy

    [D(p|q) := mathbb{E}_p (log frac{p(x)}{q(x)}) = sum_{xin mathcal{X}} p(x) log frac{p(x)}{q(x)}. ]

    Chain Rules

    Chain Rule for Entropy

    ((X_1, X_2,ldots, X_n) sim p(x_1, x_2, ldots, x_n)):

    [H(X_1, X_2, ldots, X_n) = sum_{i-1}^n H(X_i|X_{i-1}, ldots, X_1). ]

    proof:

    归纳法 + (H(X, Y) = H(X) + H(Y|X)).

    Chain Rule for Mutual Information

    Conditional Mutual Information

    定义:

    [I(X;Y|Z) := H(X|Z) - H(X|Y,Z)= mathbb{E}_{p(x, y,z)} log frac{p(X, Y| Z)}{p(X|Z)p(Y|Z)}. ]

    性质:

    [I(X_1, X_2, ldots, X_n; Y) = sum_{i=1}^n I(X_i;Y|X_{i-1}, ldots, X_1). ]

    proof:

    [egin{array}{ll} I(X_1, X_2, ldots, X_n; Y) & =H(X_1, ldots, X_n) + H(Y) - H(X_1,ldots, X_n;Y) \ &= H(X_1,ldots, X_{n-1}) + H(X_n|X_1,ldots, X_{n-1}) + H(Y) - H(X_1, ldots, X_n;Y) \ &= I(X_1, X_2,ldots, X_{n-1};Y) + H(X_n|X_1,ldots, X_{n-1}) - H(X_n|X_1, ldots, X_{n-1};Y) \ &= I(X_1, X_2,ldots, X_{n-1};Y) + I(X_n;Y|X_1,ldots, X_{n-1}). \ end{array} ]

    Chain Rule for Relative Entropy

    定义:

    [egin{array}{ll} D(p(y|x)|q(y|x)) &:= mathbb{E}_{p(x, y)} [log frac{p(Y| X)}{q(Y|X)}] \ &= sum_x p(x) sum_y p(y|x) log frac{p(y|x)}{q(y|x)}. end{array} ]

    性质:

    [D(p(x, y) | q(x, y)) = D(p(x) | q(x)) + D(p(y|x)| q(y|x)). ]

    proof:

    [egin{array}{ll} D(p(x, y)| q(x, y)) &= sum_{x, y} p(x, y) log frac{p(x, y)}{q(x, y)} \ &= sum_{x, y} p(x, y) log frac{p(y|x)p(x)}{q(y|x)q(x)} \ &= sum_{x, y} [p(x, y) (log frac{p(y|x)}{q(y|x)} + log frac{p(x)}{q(x)})]\ &= D(p(x)|q(x)) + D(p(y|x)|q(y|x)). end{array} ]

    补充:

    [D(p(x, y) | q(x, y)) = D(p(y) | q(y)) + D(p(x|y)| q(x|y)). ]

    故, 当(p(x) = q(x))的时候, 我们可以得到

    [D(p(x, y) | q(x, y)) = D(p(y|x)| q(y|x)) ge D(p(y)|q(y)) ]

    1. (D(p(y|x)|q(y|x))=D(p(x, y)| p(x)q(y|x)))

    2. (D(p(x_1, x_2,ldots, x_n)| q(x_1, x_2,ldots, x_m)) = sum_{i=1}^n D(p(x_i|x_{i-1}, ldots, x_1)|q(x_i| x_{i-1}, ldots, x_1)))

    3. (D(p(y)| q(y)) le D(p(y|x)|q(y|x))), (q(x)=p(x)).

    1, 2, 3的证明都可以通过上面的稍作变换得到.

    Jensen's Inequality

    如果(f)是凸函数, 则

    [mathbb{E} [f(X)] ge f(mathbb{E}[X]). ]

    Properties

    • (D(p|q) ge 0) 当且仅当(p=q)取等号.
    • (I(X; Y) ge 0)当且仅当(X, Y)独立取等号.
    • (D(p(y|x)|q(y|x)) ge 0) (根据上面的性质), 当且仅当(p(y|x) = q(y|x))取等号, (p(x) > 0).
    • (I(X; Y|Z) ge 0), 当且仅当(X, Y)条件独立.
    • (H(X|Y)le H(X)), 当且仅当(X, Y)独立等号成立.
    • (H(X_1, X_2, ldots, X_n)le sum_{i=1}^n H(X_i)), 当且仅当所有变量独立等号成立.

    Log Sum Inequality

    • (D(p|q)) 关于((p, q))为凸函数, 即(forall 0le lambda le 1):

      [D(lambda p_1 + (1-lambda)p_2| lambda q_1 + (1-lambda)q_2) le lambda D(p_1|q_1) + (1-lambda)D(p_2 | q_2). ]

    此部分的证明, 一方面可以通过(plogfrac{p}{q})的凸性得到, 更有趣的证明是, 构造一个新的联合分布

    [p(x,c) = p_1 cdot lambda + p_2 cdot (1- lambda), q(x, c) = q_1 cdot lambda + q_2 cdot (1-lambda). ]

    [p(x|c=0)=p_1, p(x|c=1)=p_2, q(x|c=0)=q_1, q(x|c=2)=q_2, \ p(c=0)=q(c=0)=lambda, p(c=1) = q(c=1) = 1-lambda. ]

    并注意到(D(p(y)| q(y)) le D(p(y|x)|q(y|x))).

    • (H(X) = -sum_{x in mathcal{X}} p(x) log p(x))是关于(p)的凹函数.
    • (I(X, Y) = sum_{x, y} p(y|x)p(x) log frac{p(y|x)}{p(y)}), 当固定(p(y|x))的时候是关于(p(x))的凹函数, 当固定(p(x))的时候, 是关于(p(y|x))的凸函数.

    仅仅证明后半部分, 任给(p_1(y|x), p_2(y|x)), 由于(p(x))固定, 故(forall 0 le lambda le 1):

    [p(x, y) := lambda p_1(x, y) + (1-lambda) p_2(x, y) = [lambda p_1(y|x) + (1-lambda) p_2(y|x)]p(x) \ p(y): = sum_x p(x, y) = lambda sum_x p_1(x, y) + (1-lambda) sum_{x} p_2(x, y) \ q(x, y):= p(x)p(y) = sum_x p(x, y) = lambda p(x) sum_x p_1(x, y) + (1-lambda) p(x)sum_{x} p_2(x, y) =: lambda q_1(x, y) + (1-lambda)q_2(x, y).\ ]

    [I(X, Y) = D(p(x, y)| p(x)p(y))=D(p(x, y)| q(x,y)), ]

    因为KL散度关于((p, q))是凸函数, 所以(I)关于(p(y|x))如此.

    Data-Processing Inequality

    数据(X ightarrow Y ightarrow Z), 即(P(X, Y,Z) = P(X)P(Y|X)P(Z|Y)) 比如(Y=f(X), Z = g(Y)).

    [I(Y, Z;X) = I(X;Y) + I(X;Z|Y)= I(X;Z) + I(X;Y|Z), ]

    [I(X;Z|Y) = sum_{x, y, z} p(x, y, z) log frac{p(x,z|y)}{p(x|y)p(z|y)} = sum_{x,y,z}p(x,y,z) log 1 = 0. \ I(X;Y|Z) = sum_{x,y,z} p(x,y,z) log frac{p(x|y)}{p(x|z)}ge 0. ]

    [I(X;Z) le I(X;Y) \ I(X;Y|Z) le I(X;Y). ]

    Sufficient Statistics

    Statistics and Mutual Information

    • 一族概率分布({f_{ heta(x)}})

    • (X sim f_{ heta}(x)), (T(X))为其统计量, 则

      [ heta ightarrow X ightarrow T(X) ]

    • [I( heta;X) ge I( heta;T(X)) ]

    Sufficient Statistics and Compression

    充分统计量定义: 一个函数(T(X))被称之为一族概率分布({f_{ heta}(x)})的充分统计量, 如果给定(T(X)=t)(X)的条件分布与( heta)无关, 即

    [f_{ heta}(x) = f(x|t) f_{ heta}(t) Rightarrow heta ightarrow T(X) ightarrow X Rightarrow I( heta;T(X)) ge I( heta;X). ]

    此时, (I( heta;T(X))= I( heta;X)).

    最小充分统计量定义: 如果一个充分统计量(T(X))与其余的一切关于({f_{ heta}(x)})的充分统计量(U(X))满足

    [ heta ightarrow T(X) ightarrow U(X) ightarrow X. ]

  • 相关阅读:
    第一章 接口自动化简述
    第八章 影响自动化实施的非技术因素
    第七章 自动化无人值守运行(下)
    第七章 自动化无人值守运行(上)
    第六章 自动测试实施(下)
    第六章 自动测试实施(上)
    Webdriver firefox plugin-container.exe应用程序错误
    第五章 常用页面元素自动化操作(下)
    第五章 常用页面元素自动化操作(上)
    开发,测试和开发测试工程师的区别
  • 原文地址:https://www.cnblogs.com/MTandHJ/p/13860953.html
Copyright © 2011-2022 走看看