zoukankan      html  css  js  c++  java
  • PRML 1: Gaussian Distribution

    1. Overview of Machine Learning

    P.S.  随手翻下教材,复习几个概率论的基本概念。


      概率是定义在样本空间上的一种测度,满足非负性、规范性和可列可加性,用于描述我们对随机事件的确信程度。
      随机变量是随机试验结果的实值函数。对于一维随机变量,我们可以定义一种满足单调性、有界性和右连续性的实函数,称为概率分布函数;对于多维随机变量,我们可以定义联合分布、边缘分布和条件分布。
      在一维空间中,我们可以定义一个随机变量的均值以及两个随机变量间的协方差,$cov[vec{x},vec{y}]=E[vec{x}^Tvec{y}]-E[vec{x}]^TE[vec{y}]$;在高维空间中,与之相对应的概念是均值向量和协方差矩阵。两个随机变量(标量)不相关意味着它们之间的协方差为零,其充要条件为均值之积为积的均值、方差之和为和的方差。独立性是一种特殊的不相关性,它要求两个随机变量的边缘分布函数之积等同于联合分布函数。
      概率论中有两个著名的极限定理,分别称为大数定律和中心极限定理:前者证明了当随机试验次数无限多时,某结果出现的频率依概率收敛于对应事件的概率;后者说明数量无限多的独立同分布随机变量的均值近似服从正态分布,这也正是为何正态分布如此 popular 的原因之一。

    2. The Gaussian Distribution

        $Gauss(vec x ext{ | }vecmu,Sigma)=frac{1}{(2pi)^{D/2}}cdotfrac{1}{|Sigma|^{1/2}}cdot exp{-frac{1}{2}(vec x-vec mu)^TcdotSigma^{-1}cdot(vec x-vec mu)}$

      Lemma: If $Mat = egin{bmatrix} A &B \ C & Dend{bmatrix}$, then $Mat^{-1}=egin{bmatrix}S^{-1} & -S^{-1}BD^{-1}\ -D^{-1}CS^{-1} & D^{-1}(I+CS^{-1}BD^{-1})end{bmatrix}$ ,

    where $S=A-BD^{-1}C$ is Schur Complement of $Mat$ with respect to $D$.

      Partitioned Gaussians: Suppose $vec x = [{vec x_1}^T,{vec x_2}^T]^T$ obeys the Gaussian distribution with the mean vector $vecmu = [{vecmu_1}^T,{vecmu_2}^T]^T$ and the covariance matrix $Sigma =egin{bmatrix}Sigma_{11} & Sigma_{12} \ Sigma_{21} & Sigma_{22}end{bmatrix}$, then we have:

        (1) Marginal Distribution: $p(vec{x_1})=Gauss(vec{x_1} ext{ | }vec{mu_1},Sigma_{11})$, $p(vec{x_2})=Gauss(vec{x_2} ext{ | }vec{mu_2},Sigma_{22})$;

        (2) Conditional Distribution: $p(vec{x_1} ext{ | }vec{x_2})=Gauss(vec{x_1} ext{ | }vec{mu_1}+ Sigma_{12}Sigma^{-1}_{22}(vec{x_2}-vec{mu_2}),Sigma_{11}-Sigma_{12}Sigma_{22}^{-1}Sigma_{21})$.

      Linear Gaussian Model: Given $p(vec{x})=Gauss(vec{mu},Lambda^{-1})$ and $p(vec{y} ext{ | }vec{x})=Gauss(vec{y} ext{ | }Acdotvec{x}+vec{b},L^{-1})$, then we have:

        (1)  $p(vec{y})=Gauss(vec{x} ext{ | }Acdotvec{mu}+vec{b},L^{-1}+AcdotLambda^{-1}cdot A^T)$;

        (2)  $p(vec{x} ext{ | }vec{y})=Gauss(vec{x} ext{ | }Sigmacdot{A^Tcdot Lcdot(vec{y}-vec{b})+Lambdacdotvec{mu}},Sigma)$, where $Sigma=(Lambda+A^Tcdot Lcdot A)^{-1}$.

      Maximum Likelihood Estimate: The mean vector can be estimated sequentially by $vec{mu}^{(n)}_{ML}=vec{mu}^{(n-1)}_{ML}+frac{1}{n}cdot(vec{x_n}-vec{mu}^{(n-1)}_{ML})$, whereas the covariance matrix can only be obtained by $Sigma_{ML}=frac{1}{N}cdotsum_{i=1}^{N}(vec{x_i}-vec{mu}_{ML})cdot(vec{x_i}-vec{mu}_{ML})^T$.

      Convolution: $int {Gauss(vec t ext{ | }vec y,Sigma_2)cdot Gauss(vec y ext{ | }vecmu,Sigma_1)}=Gauss(vec t ext{ | }vecmu,Sigma_1+Sigma_2)$.

    3. The Exponential Family

      A distribution belongs to the Exponential Family so long as it satisfies $p(vec{x} ext{ | }vec{eta})=g(vec{eta})cdot h(vec{x})cdot exp{vec{eta}^Tcdotvec{u}(vec{x})}$.

      (1) For the Multinomial Distribution $p(vec{x} ext{ | }vec{eta})=(1+sum_{k=1}^{K-1}exp{eta_k})^{-1}cdot exp{vec{eta}^Tcdotvec{x}}$:

        $vec{eta}=[eta_1,eta_2,...,eta_{K-1}]^T$,  where  $eta_k=ln(frac{mu_k}{1-sum_{i=1}^{K-1}mu_i})$;

      (2) For the Univariate Gaussian Distribution $p(x ext{ | }vec{eta})=frac{1}{sqrt{2pi}sigma}cdot exp{-frac{mu^2}{2sigma^2}}cdot exp{vec{eta}^Tcdotvec{u}(x)}$:

        $vec{eta}=[frac{mu}{sigma^2},-frac{1}{2sigma^2}]^T$  where  $vec{u}(x)=[x,x^2]^T$.

      To make a maximum likelihood estimate of the parameters $vec{eta}$, one may only maintain the sufficient statistics of the data set: $sumvec{u}(x)$. For multinomial distribution, maintaining $sum x$ is enough, whereas for the univariate Gaussian distribution both $sum x$ and $sum x^2$ are requisite.

      Let $vec{eta}=vec{eta}(vec{w}^Tvec{x})$, and $p(y ext{ | }vec{x},vec{w})=h(y)g(vec{eta})e^{vec{eta}^Tvec{u}(y)}$, then we have the Generalized Linear Model (GLM) such that $E[y ext{ | }vec{x},vec{w}]=f(vec{w}^Tvec{x})$, an activation function $f$ acting on a linear function of feature variables.

      (1) Linear Regression:  $p(y ext{ | }vec{x},vec{w})=Gauss(y ext{ | }vec{w}^Tvec{x},sigma^2)$ for $yinmathbb{R}$,  $E[y ext{ | }vec{x},vec{w}]=vec{w}^Tvec{x}$,

        $vec{eta}(vec{w}^Tvec{x})=[frac{vec{w}^Tvec{x}}{sigma^2},-frac{1}{2sigma^2}]^T$,  $g(vec{eta})=frac{1}{sigma}e^{-frac{(vec{w}^Tvec{x})^2}{2sigma^2}}$,  $vec{u}(y)=[y,y^2]^T$,  $h(y)=frac{1}{sqrt{2pi}}$;

      (2) Logistic Regression:  $p(y ext{ | }vec{x},vec{w})=sigma(vec{w}^Tvec{x})^y(1-sigma(vec{w}^Tvec{x}))^{1-y}$ for $y=0,1$,  $E[y ext{ | }vec{x},vec{w}]=sigma(vec{w}^Tvec{x})$,

        $eta(vec{w}^Tvec{x})=vec{w}^Tvec{x}$,  $g(vec{eta})=(1+e^{vec{w}^Tvec{x}})^{-1}$,  $u(y)=y$,  $h(y)=1$;

      (3) Poisson Regression:  $p(y ext{ | }vec{x},vec{w})=frac{lambda^y e^{-lambda}}{y!}$ for $y=0,1,2,...$, where $lambda=E[y ext{ | }vec{x},vec{w}]=e^{vec{w}^Tvec{x}}$,

        $eta(vec{w}^Tvec{x})=vec{w}^Tvec{x}$,  $g(eta)=e^{-lambda}$,  $u(y)=y$,  $h(y)=frac{1}{y!}$.

    P.S. Density Estimate 的两种方法:

      (1) Parzen Window 固定 V 统计 k:$p(vec{x})=frac{1}{N}sum_{n=1}^N Gauss(vec{x}_n ext{ | }vec{x},lambda I)$;

      (2) kNN 固定 k 度量 V(最小超球):$p(vec{x})=frac{k}{N}cdot(frac{4}{3}picdot r(vec{x})^3)^{-1}$.

    References:

      1. Bishop, Christopher M. Pattern Recognition and Machine Learning [M]. Singapore: Springer, 2006

  • 相关阅读:
    SVN的import和export的使用
    windows下CreateDirectory创建路径失败的解决办法
    windows下查看rabbitmq服务是否启动
    tcp和udp的socket形式
    sockaddr_in 转成string
    Halcon系列(1) 菜鸟入门
    tesseract系列(3) -- tesseract训练
    tesseract系列(2) -- tesseract的使用
    springboot之redis
    hadoop格式化
  • 原文地址:https://www.cnblogs.com/DevinZ/p/4419433.html
Copyright © 2011-2022 走看看