zoukankan      html  css  js  c++  java
  • EM and GMM(Theory)

    Part 1: Theory

     

    目录:

    • What's GMM?
    • How to solve GMM?
    • What's EM?
    • Explanation of the result

    What's GMM?

      GMM is short for Guassian Mixture Model, which can be represented as follows:
    [
    p(mathbf{x}) = sum_{k=1}^{K}pi_kp(mathbf{x}| heta_k)
    ]

      where,
    [
    p(mathbf{x}| heta_k) = frac{1}{2pi^{frac{d}{2}}|Sigma_k|^{frac{1}{2}}}expleft[-frac{1}{2}left( mathbf{x} - mathbf{mu_k} ight)^TSigma_k^{-1}left( mathbf{x} - mathbf{mu_k} ight) ight]
    ]

      represents the k$th$ Guassian componets of GMM and $pi_k$ represents the scale factor of the k$th$ Guassian componets.

      GMM can be used to estimate the PDF of given data, that is to say, we can suppose that the given data obey GMM distribution(we can also suppose that the given data obey single Guassian distribution, but GMM can describe more complex distribution).

      Here is the problem, if the given data is showed as Figure 1, how can we estimate the distribution of these data?

    Figure 1

      If we use MLE(Maximum Likehood Estimation) to solve this problem, namely:
    [
    egin{split}
    &max L = max log prod_{n=1}^{N}p(mathbf{x_n}) = max sum_{n=1}^{N}logsum_{k=1}^{K}pi_kp(mathbf{x_n}| heta_k)\
    & abla_{pi_k}L = 0 quad abla_{mu_k}L = 0 quad abla_{Sigma_k}L = 0
    end{split}
    ]

      We can't get the analytic solution, thus we should use the other algorithm to solve GMM.

    How to solve GMM?

      To begin with, let's analysis this GMM problem first. If we can get the parameters in GMM, which are $pi_k, Sigma_k$ and $mu_k$, we solve GMM. So, our algorithm should estimate $pi_k, Sigma_k$ and $mu_k$.

      To simplify this problem, if we know each data point's Guassian distribution sperately, in other words, each data point belongs to one certain Guassian distribution and we have known that which Guassian distribution each data point belongs to, then we can use MLE to solve GMM sperately.

      For example, in Figure 2, if have known that the same color data point from the same Guassian distribution, we can use MLE to each color group sperately to estimate $Sigma_k$ and $mu_k$. If these five color group have the same quntity of data points, then $pi_k = 0.2$, $k=1,2,3,4,5$. In this situation, GMM can be easily solved.

    Figure 2


      But, the problem is, we don't know which Guassian distribution each data point belongs to ! Thus there should be a hidden parameter to control which Guassian distribution the n$th$ data point belongs to.

      Now lets define
    $$z_{nk}in{0,1}$$

      $z_{nk}=1$ for the n$th$ point belongs the k$th$ Guassian distribution

      $z_{nk}=0$ for the n$th$ point doesn't belong the k$th$ Guassian distribution

      Use $z_{nk}$ we can rewrite the likehood function as follows:
    [
    L = log prod_{i=1}^{N}prod_{k=1}^{K}pi_k^{z_{nk}}p(mathbf{x_n}| heta_k)^{z_{nk}}\
    ]

      Notice that, if we define $z_{nk}$, then each data point can be decribe by only one guassian distribution. Thus, $prod_{k=1}^{K}pi_k^{z_{nk}}p(mathbf{x_n}| heta_k)^{z_{nk}}$ can be used to describe each data point's probability density. Although $prod_{k=1}^{K}pi_k^{z_{nk}}p(mathbf{x_n}| heta_k)^{z_{nk}}$ has the form of '$prod$', $z_{nk}$ can be 1 only one time when given $n$ for all $k$.

      Lets continue to write likehood function:
    [
    egin{split}
    L &= log prod_{i=1}^{N}prod_{k=1}^{K}pi_k^{z_{nk}}p(mathbf{x_n}| heta_k)^{z_{nk}}\
    & = sum_{i=1}^{N}sum_{k=1}^{K}logpi_k^{z_{nk}} p(mathbf{x_n}| heta_k)^{z_{nk}}\
    & = sum_{i=1}^{N}sum_{k=1}^{K}left[z_{nk}logpi_k + z_{nk}log p(mathbf{x_n}| heta_k) ight]\
    & = sum_{i=1}^{N}sum_{k=1}^{K}left[z_{nk}logpi_k + z_{nk}log p(mathbf{x_n}|Sigma_k,mathbf{mu_k}) ight]
    end{split}
    ]

      In this likehood function, there are three exposed parameters $pi_k$, $Sigma_k$ and $mathbf{mu_k}$, which we will solve. There is one hidden parameter $z_{nk}$, which is not included in the final result.

      Now, how to solve exposed parameter $pi_k$, $Sigma_k$ and $mathbf{mu_k}$ with respect to the hidden paramer $z_{nk}$ ?

    What's EM?

      To solve above question, we should use EM algorithm, which has two parts: E(Expection) part and M(Maximum) part.

      E part: calculating the exception of the likehood function with respect to hidden parameter.

      M part: finding the right exposed parameters that maximize the expection.And go back E part to iterate.(Notice that the hidden parameter and exposed parameters influence each other! Thus, when go to the E part again, the exception will change.)

      As for the above GMM problem, the hidden parameter is $z_{nk}$.

      So, in E part, we should calculate the expection of the likehood function with respect to $z_{nk}$, which is:

    [
    egin{split}
    Q &= E_{z_{nk}}{L}\
    & = E_{z_{nk}}{ sum_{i=1}^{N}sum_{k=1}^{K}left[z_{nk}logpi_k + z_{nk}log p(mathbf{x_n}|Sigma_k,mathbf{mu_k}) ight] }\
    & = sum_{i=1}^{N}sum_{k=1}^{K}p(z_{nk}=1)left[logpi_k + log p(mathbf{x_n}|Sigma_k,mathbf{mu_k}) ight] + sum_{i=1}^{N}sum_{k=1}^{K}p(z_{nk}=0)left[0logpi_k + 0log p(mathbf{x_n}|Sigma_k,mathbf{mu_k}) ight]\
    & = sum_{i=1}^{N}sum_{k=1}^{K}p(z_{nk}=1)left[logpi_k + log p(mathbf{x_n}|Sigma_k,mathbf{mu_k}) ight]
    end{split}
    ]

      Notice that, in interation process (Suppose we have known $pi_k$, $Sigma_k$ and $mathbf{mu_k}$)
    [
    p(z_{nk}=1) = frac{pi_k p(mathbf{x_n}|Sigma_k,mathbf{mu_k})}{sum_{j=1}^{K}pi_j p(mathbf{x_n}|Sigma_j,mathbf{mu_j})}
    ]

      In M part:
    [
    abla_{pi_k}Q = 0 quad abla_{mu_k}Q = 0 quad abla_{Sigma_k}Q = 0
    ]

      We can get:
    [
    egin{split}
    &mathbf{mu_k}^{new} = frac{1}{N_k}sum_{n=1}^{N}p(z_{nk}=1)mathbf{x_n}\
    &Sigma_k^{new} = frac{1}{N_k}sum_{n=1}^{N}p(z_{nk}=1)(mathbf{x} - mathbf{mu_k^{new}})(mathbf{x} - mathbf{mu_k^{new}})^T\
    &pi_k^{new} = frac{N_k}{N}\
    &N_k = sum_{i=1}^{n}p(z_{nk}=1)
    end{split}
    ]

      Thus we can firstly initial $pi_k$, $Sigma_k$ and $mathbf{mu_k}$, then calculate $p(z_{nk}=1)$, then calculate new $pi_k$, $Sigma_k$ and $mathbf{mu_k}$, the calculate $p(z_{nk}=1)$, then ... until the solution converges.

    Explanation of the result

    Analyzing the result, there is an explanation:

    The result can be treated as cluster, which cluster $N$ people to $K$ groups :

    1. Number of people in the k$th$ group($N_k$) is the sum of gene($p(z_{nk}=1)$), which represents how much the n$th$ people belongs to the k$th$ group.

    2. Each person has a weight($mathbf{x_n}$), so when we cluster people in groups, we want to know what's the average weight($mathbf{mu_k}$) in each group, and what's the weight variance($Sigma_k$) in each group.

    3. When we calculate the average weight in one group, we should calculate the total weight in this group($sum_{n=1}^{N}p(z_{nk}=1)mathbf{x_n}$), and then divide the number of people in this group($N_k$).

    4. When we calculate the weight variance in one group, we should calculate the total weight variance in one group($sum_{n=1}^{N}p(z_{nk}=1)(mathbf{x} - mathbf{mu_k^{new}})(mathbf{x} - mathbf{mu_k^{new}})^T$), and then divide the number of people in this group($N_k$).

    5. $pi_k$ can treated as the population proportion that the k$th$ group takes up.

    Matlab code for em algorithm can be found in "EM and GMM(Code)"

  • 相关阅读:
    Webstorm(OnlineSearch2)自定义快捷搜索API文档手册
    cargo设置国内源
    win10安装rust和编译失败的解决办法
    pycharm打开项目找不到根目录的解决办法
    VM虚拟机/Linux上网
    idea启动springboot项目突然特别慢
    (亲测有效)MacPycharm打不开的解决方法
    vue使用webpack打包失败
    使用七牛云上传文件报错incorrect region, please use up-z1.qiniup.com
    Zookeeper3.5及以上启动时8080端口被占用
  • 原文地址:https://www.cnblogs.com/ghmgm/p/6336439.html
Copyright © 2011-2022 走看看