zoukankan      html  css  js  c++  java
  • Notes on Probabilistic Latent Semantic Analysis (PLSA)

    I highly recommend you read the more detailed version of http://arxiv.org/abs/1212.3900

    Formulation of PLSA

    There are two ways to formulate PLSA. They are equivalent but may lead to different inference process.

    1. P(d,w) = P(d) sum_{z} P(w|z)P(z|d)
    2. P(d,w) = sum_{z} P(w|z)P(d|z)P(z)

    Let’s see why these two equations are equivalent by using Bayes rule.

    P(z|d) = frac{P(d|z)P(z)}{P(d)} P(z|d)P(d) =P(d|z)P(z) P(w|z)P(z|d)P(d) =P(w|z)P(d|z)P(z) P(d) sum_{z} P(w|z)P(z|d) = sum_{z} P(w|z)P(d|z)P(z)

    The whole data set is generated as (we assume that all words are generated independently):

    D = prod_{d} prod_{w} P(d,w)^{n(d,w)}

    The Log-likelihood of the whole data set for (1) and (2) are:

    L_{1} = sum_{d} sum_{w} n(d,w) log [ P(d) sum_{z} P(w|z)P(z|d) ]

    L_{2} = sum_{d} sum_{w} n(d,w) log [ sum_{z} P(w|z)P(d|z)P(z) ]

    EM

    For L_{1} or L_{2}, the optimization is hard due to the log of sum. Therefore, an algorithm called Expectation-Maximization is usually employed. Before we introduce anything about EM, please note that EM is only guarantee to find a local optimum (although it may be a global one).

    First, we see how EM works in general. As we shown for PLSA, we usually want to estimate the likelihood of data, namely P(X|	heta), given the paramter 	heta. The easiest way is to obtain a maximum likelihood estimator by maximizing P(X|	heta). However, sometimes, we also want to include some hidden variables which are usually useful for our task. Therefore, what we really want to maximize is P(X|	heta)=sum_{z}P(X|z,	heta)P(z|	heta), the complete likelihood. Now, our attention becomes to this complete likelihood. Again, directly maximizing this likelihood is usually difficult. What we would like to show here is to obtain a lower bound of the likelihood and maximize this lower bound.

    We need Jensen’s Inequality to help us obtain this lower bound. For any convex function f(x), Jensen’s Inequality states that :

    lambda f(x) + (1-lambda) f(y) geq f(lambda x + (1-lambda) y)

    Thus, it is not difficult to show that :

    E[f(x)] = sum_{x} P(x) f(x) geq f(sum_{x} P(x) x) = f(E[x])

    and for concave functions (like logarithm), it is :

    E[f(x)] leq f(E[x])

    Back to our complete likelihood, we can obtain the following conclusion by using concave version of Jensen’s Inequality :

     log sum_{z}P(X|z,	heta)P(z|	heta)= log sum_{z}P(X|z,	heta)P(z|	heta)frac{q(z)}{q(z)}  = log E[frac{P(X|z,	heta)P(z|	heta)}{q(z)}]  geq E[log frac{P(X|z,	heta)P(z|	heta)}{q(z)}]

    Therefore, we obtained a lower bound of complete likelihood and we want to maximize it as tight as possible. EM is an algorithm that maximize this lower bound through a iterative fashion. Usually, EM first would fix current 	heta value and maximize q(z) and then use the new q(z) value to obtain a new guess on 	heta, which is essentially a two stage maximization process. The first step can be shown as follows:

    E[log frac{P(X|z,	heta)P(z|	heta)}{q(z)}] = sum_{z} q(z) log frac{P(X|z,	heta)P(z|	heta)}{q(z)} = sum_{z} q(z) log frac{P(z|X,	heta)P(X,	heta)}{q(z)} = sum_{z} q(z) log P(x,	heta) + sum_{z} q(z) log frac{P(z|X,	heta)}{q(z)} = log P(x,	heta) - sum_{z} q(z) log frac{q(z)}{P(z|X,	heta)} = log P(x,	heta) - KL(q(z) || P(z|X,	heta))

    The first term is the same for all z. Therefore, in order to maximize the whole equation, we need to minimize KL divergence between q(z) and P(z|X,	heta), which eventually leads to the optimum solution of q(z) = P(z|X,	heta). So, usually for E-step, we use current guess of 	heta to calculate the posterior distribution of hidden variable as the new update score. For M-step, it is problem-dependent. We will see how to do that in later discussions.

    Another explanation of EM is in terms of optimizing a so-called Q function. We devise the data generation process as P(X|	heta)=P(X,H|	heta)=P(H|X,	heta)P(X|	heta). Therefore, the complete likelihood is modified as:

    L_{c}(	heta) = log P(X,H|	heta) = log P(X|	heta) + log P(H|X,	heta) = L(	heta) + log P(H|X,	heta)

    Think about how to maximize L_{c}(	heta). Instead of directly maximizing it, we can iteratively maximize L_{c}(	heta^{(n+1)})-L_{c}(	heta^{(n)}) as :

    L(	heta) - L(	heta^{(n)}) = L_{c}(	heta) - log P(H|X,	heta) - L_{c}(	heta^{(n)}) + log P(H|X,	heta^{(n)}) = L_{c}(	heta) - L_{c}(	heta^{(n)}) + log frac{P(H|X,	heta^{(n)})}{P(H|X,	heta)}

    Now take the expectation of this equation, we have:

    L(	heta) - L(	heta^{(n)}) = sum_{H} L_{c}(	heta)P(H|X,	heta^{(n)}) - sum_{H} L_{c}(	heta^{(n)})P(H|X,	heta^{(n)}) + sum_{H} P(H|X,	heta^{(n)})log frac{P(H|X,	heta^{(n)})}{P(H|X,	heta)}

    The last term is always non-negative since it can be recognized as the KL-divergence of P(H|X,	heta^{(n)} and P(H|X,	heta). Therefore, we obtain a lower bound of Likelihood :

    L(	heta) geq sum_{H} L_{c}(	heta)P(H|X,	heta^{(n)}) + L(	heta^{(n)}) - sum_{H} L_{c}(	heta^{(n)})P(H|X,	heta^{(n)})

    The last two terms can be treated as constants as they do not contain the variable 	heta, so the lower bound is essentially the first term, which is also sometimes called as “Q-function”. Q(	heta;	heta^{(n)}) = E(L_{c}(	heta)) = sum_{H} L_{c}(	heta) P (H|X,	heta^{(n)})

    EM of Formulation 1

    In case of Formulation 1, let us introduce hidden variables R(z,w,d) to indicate which hidden topic z is selected to generated w in d (sum_{z} R(z,w,d) = 1). Therefore, the complete likelihood can be formulated as :

    L_{c1} = sum_{d} sum_{w} n(d,w) sum_{z} R(z,w,d) log [ P(d) P(w|z)P(z|d) ] = sum_{d} sum_{w} n(d,w) sum_{z} R(z,w,d) [ log P(d) + log P(w|z) + log P(z|d) ]

    From the equation above, we can write our Q-function for the complete likelihood E[L_{c1}]:

     E[L_{c1}] = sum_{d} sum_{w} n(d,w) sum_{z} P(z|w,d) [ log P(d) + log P(w|z) + log P(z|d) ]

    For E-step, simply using Bayes Rule, we can obtain:

    P(z|w,d) = frac{P(w|z,d)}{P(w,d)} = frac{P(w|z)P(z|d)P(d)}{sum_{z} P(w|z)P(z|d)P(d)} = frac{P(w|z)P(z|d)}{sum_{z} P(w|z)P(z|d)}

    For M-step, we need to maximize Q-function, which needs to be incorporated with other constraints:

    H = E[L_{c1}]+ alpha [1-sum_{d} P(d) ]+ eta sum_{z}[1- sum_{w} P(w|z)] +gamma sum_{d}[1- sum_{z} P(z|d)]

    and take all derivatives:

    frac{partial H}{partial P(d)} = sum_{w} sum_{z} n(d,w) frac{P(z|w,d)}{P(d)} - alpha = 0 
ightarrow sum_{w} sum_{z} n(d,w) P(z|w,d) - alpha P(d) = 0 frac{partial H}{partial P(w|z)} = sum_{d} n(d,w) frac{P(z|w,d)}{P(w|z)} - eta = 0 
ightarrow sum_{d} n(d,w) P(z|w,d) - eta P(w|z) = 0 frac{partial H}{partial P(z|d)} = sum_{w} n(d,w) frac{P(z|w,d)}{P(z|d)} - gamma = 0 
ightarrow sum_{w} n(d,w) P(z|w,d) - gamma P(z|d) = 0

    Therefore, we can easily obtain:

    P(d) = frac{sum_{w} sum_{z} n(d,w) P(z|w,d)}{sum_{d} sum_{w} sum_{z} n(d,w) P(z|w,d)} = frac{n(d)}{sum_{d} n(d)} P(w|z) = frac{sum_{d} n(d,w) P(z|w,d)}{sum_{w} sum_{d} n(d,w) P(z|w,d) } P(z|d) = frac{sum_{w} n(d,w) P(z|w,d)}{sum_{z} sum_{w} n(d,w) P(z|w,d) } = frac{sum_{w} n(d,w) P(z|w,d)}{n(d)}

    EM of Formulation 2

    Use similar method to introduce hidden variables to indicate which z is selected to generated w and d and we can have the following complete likelihood :

    L_{c2} = sum_{d} sum_{w} n(d,w) sum_{z} R(z,w,d) log [ P(z) P(w|z)P(d|z) ] = sum_{d} sum_{w} n(d,w) sum_{z} R(z,w,d) [ log P(z) + log P(w|z) + log P(d|z) ]

    Therefore, the Q-function E[L_{c2}] would be :

    E[L_{c2}] = sum_{d} sum_{w} n(d,w) sum_{z} P(z|w,d) [ log P(z) + log P(w|z) + log P(d|z) ]

    For E-step, again, simply using Bayes Rule, we can obtain:

    P(z|w,d) = frac{P(w|z,d)}{P(w,d)} = frac{P(w|z)P(d|z)P(z)}{sum_{z} P(w|z)P(d|z)P(z)}

    For M-step, we maximize the constraint version of Q-function:

    H = E[L_{c2}] + alpha [1-sum_{z} P(z) ] + eta sum_{z}[1- sum_{w} P(w|z)]+ +gamma sum_{z} [1- sum_{d} P(d|z)]

    and take all derivatives:

    frac{partial H}{partial P(z)}= sum_{d} sum_{w} n(d,w) frac{P(z|w,d)}{P(z)} - alpha = 0 
ightarrow sum_{d} sum_{w} n(d,w) P(z|w,d) - alpha P(z)= 0  [latex]
ightarrow sum_{d} n(d,w) P(z|w,d) - eta P(w|z) = 0 frac{partial H}{partial P(d|z)} = sum_{w} n(d,w) frac{P(z|w,d)}{P(d|z)} - gamma = 0 
ightarrow sum_{w} n(d,w) P(z|w,d) - gamma P(d|z) = 0

    Therefore, we can easily obtain:

    P(z) = frac{sum_{d} sum_{w} n(d,w) P(z|w,d)}{sum_{d} sum_{w} sum_{z} n(d,w) P(z|w,d)} = frac{sum_{d} sum_{w} n(d,w) P(z|w,d)}{sum_{d} sum_{w} n(d,w)} P(w|z)= frac{sum_{d} n(d,w) P(z|w,d)}{sum_{w} sum_{d} n(d,w) P(z|w,d) } P(d|z) = frac{sum_{w} n(d,w) P(z|w,d)}{sum_{d} sum_{w} n(d,w) P(z|w,d) }

  • 相关阅读:
    2. 替换空格
    1.二维数组中的查找
    《STL源码剖析》相关面试题总结
    STL详解
    java之Stack详细介绍
    ArrayList、LinkedList、Vector的区别
    记一次vue升级element-ui的体验
    nestJs项目打包部署的方法
    Echarts 图例 legend formatter 如何返回 html
    微信小程序echarts字体大小 真机和开发者工具不一致(太小)的解决办法
  • 原文地址:https://www.cnblogs.com/likai198981/p/3303496.html
Copyright © 2011-2022 走看看