Notes on Probabilistic Latent Semantic Analysis (PLSA)

zoukankan html css js c++ java

Notes on Probabilistic Latent Semantic Analysis (PLSA)
转自：http://www.hongliangjie.com/2010/01/04/notes-on-probabilistic-latent-semantic-analysis-plsa/
I highly recommend you read the more detailed version of http://arxiv.org/abs/1212.3900

Formulation of PLSA

There are two ways to formulate PLSA. They are equivalent but may lead to different inference process.

$P(d,w) = P(d) sum_{z} P(w|z)P(z|d)$

$P(d,w) = sum_{z} P(w|z)P(d|z)P(z)$

Let’s see why these two equations are equivalent by using Bayes rule.

$P(z|d) = frac{P(d|z)P(z)}{P(d)}$ $P(z|d)P(d) =P(d|z)P(z)$ $P(w|z)P(z|d)P(d) =P(w|z)P(d|z)P(z)$ $P(d) sum_{z} P(w|z)P(z|d) = sum_{z} P(w|z)P(d|z)P(z)$

The whole data set is generated as (we assume that all words are generated independently):

$D = prod_{d} prod_{w} P(d,w)^{n(d,w)}$

The Log-likelihood of the whole data set for (1) and (2) are:

$L_{1} = sum_{d} sum_{w} n(d,w) log [ P(d) sum_{z} P(w|z)P(z|d) ]$

$L_{2} = sum_{d} sum_{w} n(d,w) log [ sum_{z} P(w|z)P(d|z)P(z) ]$

EM

For $L_{1}$ or $L_{2}$ , the optimization is hard due to the log of sum. Therefore, an algorithm called Expectation-Maximization is usually employed. Before we introduce anything about EM, please note that EM is only guarantee to find a local optimum (although it may be a global one).

First, we see how EM works in general. As we shown for PLSA, we usually want to estimate the likelihood of data, namely $P(X| heta)$ , given the paramter $heta$ . The easiest way is to obtain a maximum likelihood estimator by maximizing $P(X| heta)$ . However, sometimes, we also want to include some hidden variables which are usually useful for our task. Therefore, what we really want to maximize is $P(X| heta)=sum_{z}P(X|z, heta)P(z| heta)$ , the complete likelihood. Now, our attention becomes to this complete likelihood. Again, directly maximizing this likelihood is usually difficult. What we would like to show here is to obtain a lower bound of the likelihood and maximize this lower bound.

We need Jensen’s Inequality to help us obtain this lower bound. For any convex function $f(x)$ , Jensen’s Inequality states that :

$lambda f(x) + (1-lambda) f(y) geq f(lambda x + (1-lambda) y)$

Thus, it is not difficult to show that :

$E[f(x)] = sum_{x} P(x) f(x) geq f(sum_{x} P(x) x) = f(E[x])$

and for concave functions (like logarithm), it is :

$E[f(x)] leq f(E[x])$

Back to our complete likelihood, we can obtain the following conclusion by using concave version of Jensen’s Inequality :

$log sum_{z}P(X|z, heta)P(z| heta)= log sum_{z}P(X|z, heta)P(z| heta)frac{q(z)}{q(z)}$ $= log E[frac{P(X|z, heta)P(z| heta)}{q(z)}]$ $geq E[log frac{P(X|z, heta)P(z| heta)}{q(z)}]$

Therefore, we obtained a lower bound of complete likelihood and we want to maximize it as tight as possible. EM is an algorithm that maximize this lower bound through a iterative fashion. Usually, EM first would fix current $heta$ value and maximize $q(z)$ and then use the new $q(z)$ value to obtain a new guess on $heta$ , which is essentially a two stage maximization process. The first step can be shown as follows:

$E[log frac{P(X|z, heta)P(z| heta)}{q(z)}] = sum_{z} q(z) log frac{P(X|z, heta)P(z| heta)}{q(z)}$ $= sum_{z} q(z) log frac{P(z|X, heta)P(X, heta)}{q(z)}$ $= sum_{z} q(z) log P(x, heta) + sum_{z} q(z) log frac{P(z|X, heta)}{q(z)}$ $= log P(x, heta) - sum_{z} q(z) log frac{q(z)}{P(z|X, heta)}$ $= log P(x, heta) - KL(q(z) || P(z|X, heta))$

The first term is the same for all $z$ . Therefore, in order to maximize the whole equation, we need to minimize KL divergence between $q(z)$ and $P(z|X, heta)$ , which eventually leads to the optimum solution of $q(z) = P(z|X, heta)$ . So, usually for E-step, we use current guess of $heta$ to calculate the posterior distribution of hidden variable as the new update score. For M-step, it is problem-dependent. We will see how to do that in later discussions.

Another explanation of EM is in terms of optimizing a so-called Q function. We devise the data generation process as $P(X| heta)=P(X,H| heta)=P(H|X, heta)P(X| heta)$ . Therefore, the complete likelihood is modified as:

$L_{c}( heta) = log P(X,H| heta) = log P(X| heta) + log P(H|X, heta) = L( heta) + log P(H|X, heta)$

Think about how to maximize $L_{c}( heta)$ . Instead of directly maximizing it, we can iteratively maximize $L_{c}( heta^{(n+1)})-L_{c}( heta^{(n)})$ as :

$L( heta) - L( heta^{(n)}) = L_{c}( heta) - log P(H|X, heta) - L_{c}( heta^{(n)}) + log P(H|X, heta^{(n)})$ $= L_{c}( heta) - L_{c}( heta^{(n)}) + log frac{P(H|X, heta^{(n)})}{P(H|X, heta)}$

Now take the expectation of this equation, we have:

$L( heta) - L( heta^{(n)}) = sum_{H} L_{c}( heta)P(H|X, heta^{(n)}) - sum_{H} L_{c}( heta^{(n)})P(H|X, heta^{(n)}) + sum_{H} P(H|X, heta^{(n)})log frac{P(H|X, heta^{(n)})}{P(H|X, heta)}$

The last term is always non-negative since it can be recognized as the KL-divergence of $P(H|X, heta^{(n)}$ and $P(H|X, heta)$ . Therefore, we obtain a lower bound of Likelihood :

$L( heta) geq sum_{H} L_{c}( heta)P(H|X, heta^{(n)}) + L( heta^{(n)}) - sum_{H} L_{c}( heta^{(n)})P(H|X, heta^{(n)})$

The last two terms can be treated as constants as they do not contain the variable $heta$ , so the lower bound is essentially the first term, which is also sometimes called as “Q-function”. $Q( heta; heta^{(n)}) = E(L_{c}( heta)) = sum_{H} L_{c}( heta) P (H|X, heta^{(n)})$

EM of Formulation 1

In case of Formulation 1, let us introduce hidden variables $R(z,w,d)$ to indicate which hidden topic $z$ is selected to generated $w$ in $d$ ( $sum_{z} R(z,w,d) = 1$ ). Therefore, the complete likelihood can be formulated as :

$L_{c1} = sum_{d} sum_{w} n(d,w) sum_{z} R(z,w,d) log [ P(d) P(w|z)P(z|d) ]$ $= sum_{d} sum_{w} n(d,w) sum_{z} R(z,w,d) [ log P(d) + log P(w|z) + log P(z|d) ]$

From the equation above, we can write our Q-function for the complete likelihood $E[L_{c1}]$ :

$E[L_{c1}] = sum_{d} sum_{w} n(d,w) sum_{z} P(z|w,d) [ log P(d) + log P(w|z) + log P(z|d) ]$

For E-step, simply using Bayes Rule, we can obtain:

$P(z|w,d) = frac{P(w|z,d)}{P(w,d)}$ $= frac{P(w|z)P(z|d)P(d)}{sum_{z} P(w|z)P(z|d)P(d)}$ $= frac{P(w|z)P(z|d)}{sum_{z} P(w|z)P(z|d)}$

For M-step, we need to maximize Q-function, which needs to be incorporated with other constraints:

$H = E[L_{c1}]+ alpha [1-sum_{d} P(d) ]+ eta sum_{z}[1- sum_{w} P(w|z)]$ $+gamma sum_{d}[1- sum_{z} P(z|d)]$

and take all derivatives:

$frac{partial H}{partial P(d)} = sum_{w} sum_{z} n(d,w) frac{P(z|w,d)}{P(d)} - alpha = 0$ $ightarrow sum_{w} sum_{z} n(d,w) P(z|w,d) - alpha P(d) = 0$ $frac{partial H}{partial P(w|z)} = sum_{d} n(d,w) frac{P(z|w,d)}{P(w|z)} - eta = 0$ $ightarrow sum_{d} n(d,w) P(z|w,d) - eta P(w|z) = 0$ $frac{partial H}{partial P(z|d)} = sum_{w} n(d,w) frac{P(z|w,d)}{P(z|d)} - gamma = 0$ $ightarrow sum_{w} n(d,w) P(z|w,d) - gamma P(z|d) = 0$

Therefore, we can easily obtain:

$P(d) = frac{sum_{w} sum_{z} n(d,w) P(z|w,d)}{sum_{d} sum_{w} sum_{z} n(d,w) P(z|w,d)}$ $= frac{n(d)}{sum_{d} n(d)}$ $P(w|z) = frac{sum_{d} n(d,w) P(z|w,d)}{sum_{w} sum_{d} n(d,w) P(z|w,d) }$ $P(z|d) = frac{sum_{w} n(d,w) P(z|w,d)}{sum_{z} sum_{w} n(d,w) P(z|w,d) }$ $= frac{sum_{w} n(d,w) P(z|w,d)}{n(d)}$

EM of Formulation 2

Use similar method to introduce hidden variables to indicate which $z$ is selected to generated $w$ and $d$ and we can have the following complete likelihood :

$L_{c2} = sum_{d} sum_{w} n(d,w) sum_{z} R(z,w,d) log [ P(z) P(w|z)P(d|z) ]$ $= sum_{d} sum_{w} n(d,w) sum_{z} R(z,w,d) [ log P(z) + log P(w|z) + log P(d|z) ]$

Therefore, the Q-function $E[L_{c2}]$ would be :

$E[L_{c2}] = sum_{d} sum_{w} n(d,w) sum_{z} P(z|w,d) [ log P(z) + log P(w|z) + log P(d|z) ]$

For E-step, again, simply using Bayes Rule, we can obtain:

$P(z|w,d) = frac{P(w|z,d)}{P(w,d)}$ $= frac{P(w|z)P(d|z)P(z)}{sum_{z} P(w|z)P(d|z)P(z)}$

For M-step, we maximize the constraint version of Q-function:

$H = E[L_{c2}] + alpha [1-sum_{z} P(z) ] + eta sum_{z}[1- sum_{w} P(w|z)]+$ $+gamma sum_{z} [1- sum_{d} P(d|z)]$

and take all derivatives:

$frac{partial H}{partial P(z)}= sum_{d} sum_{w} n(d,w) frac{P(z|w,d)}{P(z)} - alpha = 0$ $ightarrow sum_{d} sum_{w} n(d,w) P(z|w,d) - alpha P(z)= 0$ $[latex] ightarrow sum_{d} n(d,w) P(z|w,d) - eta P(w|z) = 0$ $frac{partial H}{partial P(d|z)} = sum_{w} n(d,w) frac{P(z|w,d)}{P(d|z)} - gamma = 0$ $ightarrow sum_{w} n(d,w) P(z|w,d) - gamma P(d|z) = 0$

Therefore, we can easily obtain:

$P(z) = frac{sum_{d} sum_{w} n(d,w) P(z|w,d)}{sum_{d} sum_{w} sum_{z} n(d,w) P(z|w,d)}$ $= frac{sum_{d} sum_{w} n(d,w) P(z|w,d)}{sum_{d} sum_{w} n(d,w)}$ $P(w|z)= frac{sum_{d} n(d,w) P(z|w,d)}{sum_{w} sum_{d} n(d,w) P(z|w,d) }$ $P(d|z) = frac{sum_{w} n(d,w) P(z|w,d)}{sum_{d} sum_{w} n(d,w) P(z|w,d) }$
查看全文

相关阅读:
JavaScript中的valueOf与toString方法
 CSS的历史与工作原理
 Javascript让你的网页标题飘动起来
 getElementsByClassName的原生实现
 JavaScript去除空格trim()的原生实现
 JavaScript截取中英文字符串
 Keras函数式API介绍
 R语言kohonen包主要函数介绍
 在Shell直接运行Python命令并显示
 GitHub Pages

原文地址：https://www.cnblogs.com/likai198981/p/3303496.html

Notes on Probabilistic Latent Semantic Analysis (PLSA)

转自：http://www.hongliangjie.com/2010/01/04/notes-on-probabilistic-latent-semantic-analysis-plsa/

Formulation of PLSA

EM

EM of Formulation 1

EM of Formulation 2