Notes on EM and pLSA

(Appendix added on August 4th, 2014, see at the end)

Introduction to EM

EM (Expectation Maximization) is a general technique for solving model learning problem including latent variables. Given a set of observed data \mathbf{y} and a set of latent variables \mathbf{z}, we want to learn the parameters by maximizing the log-likelihood of data, i.e. L(\bm\theta) = \log P(\mathbf{y}|\bm{\theta}), including latent variable, we have the objective function for learning:

(1)   \begin{equation*} L(\bm\theta) = \log P(\mathbf y|\bm\theta) = \log \sum_{\mathbf z} P(\mathbf{y,z}|\bm\theta)$ \end{equation*}

However, this \log\sum makes it difficult for us to utilize some standard optimization techniques like Gradient Descent, etc.. Facing this difficulty, we can include two ideas to help: 1, we can do the the whole learning step by step, i.e. iteratively, so that at each step i+1 we will have a old set of parameters, \bm\theta^{(i)} from the i-th iteration, that we can use; 2, we can find an lower bound for objective function 1, and we optimize towards it, since it is an lower bound that we can guarantee to improve (or not decrease), we can (somehow) guarantee the objective function to be increasing (or not decreasing). Keep these two ideas in mind, we can do the derivation as below:

(2)   \begin{equation*}  \begin{split} L(\bm\theta) &= \log P(\mathbf y|\bm\theta) \\ &= \log \sum_{\mathbf z} P(\mathbf{y,z}|\bm\theta)$ \\ &= \log \sum_{\mathbf z} P(\mathbf z|\mathbf y, \bm\theta^{(i)}) \frac{ P(\mathbf{y,z}|\bm\theta) }{ P(\mathbf z|\mathbf y, \bm\theta^{(i)}) } \\ &\ge \sum_{\mathbf z} P(\mathbf z|\mathbf y, \bm\theta^{(i)}) \log \frac{ P(\mathbf{y,z}|\bm\theta) }{ P(\mathbf z|\mathbf y, \bm\theta^{(i)}) } \text{  (Jensen's inequality)} \\ &\propto \sum_{\mathbf z} P(\mathbf z|\mathbf y, \bm\theta^{(i)}) \log P(\mathbf{y,z}|\bm\theta) \text{  (dropping const)} \\ \end{split} \end{equation*}

In the above derivation, we use last iteration estimated parameters to find a lower bound, which we can just maximize to indirectly improve the original likelihood function. The equality holds when X in Jensen’s inequality is a constant variable, in above case that is \forall \mathbf z, \frac{ P(\mathbf{y,z}|\bm\theta) }{ P(\mathbf z|\mathbf y, \bm\theta^{(i)}) } = c, c is a constant. This equality gives us a estimate of latent variable \mathbf z under the model:

(3)   \begin{equation*} P(\mathbf z|\mathbf y, \bm\theta^{(i)}) = \frac{ P(\mathbf{y,z}|\bm\theta) }{ \sum_{\mathbf z} P(\mathbf{y,z}|\bm\theta) } \end{equation*}

Many textbook about EM utilizes the so-called Q-function:

(4)   \begin{equation*}  Q(\bm\theta|\bm\theta^{(i)}) = \sum_{\mathbf z} P(\mathbf z|\mathbf y, \bm\theta^{(i)}) \log P(\mathbf{y,z}|\bm\theta) \end{equation*}

Where we can view \log P(\mathbf{y,z}|\bm\theta) as the complete data log-likelihood (\mathbf y is incomplete data, \mathbf z are latent variables), P(\mathbf z|\mathbf y, \bm\theta^{(i)}) as predictions of latent variables by using current estimated parameters, and Q-function as an expectation of the log-likelihood of the complete data. Then we can formalize the procedure for applying the EM algorithm:

Pre-Initialization:

Prepare the Q-function 4

Initialization:

Randomly initialize the parameters \bm\theta^{(0)}

E step:

Estimate latent variables according current estimated parameters: P(\mathbf z|\mathbf y, \bm\theta^{(i)})

M step:

Maximize the Q-function 4 w.r.t. the parameters \bm\theta given last estimated parameters \bm\theta^{(i)}

Some properties about EM: it is proved that under EM procedure, if the lower bound Q-function is increasing (or not decreasing), then the log-likelihood function will be increasing (or not decreasing) and the iteration will finally go to a converged point (one way to see it is subtract L(\theta) with L(\theta^{(i)})). It is worth noting that EM can only find local maxima (a steady high point, but not the highest one), and so the initialization of EM matters.

Applying EM in pLSA

pLSA (Probabilistic Latent Semantic Analysis [1, 2]) is a technique for mining topics of texts. In pLSA, the log-likelihood function is:

(5)   \begin{equation*}  \begin{split} \log P(\mathbf{d,w}) &= \sum_{i=1}^N \sum_{j=1}^M n(d_i, w_j) \log P(d_i, w_j) \\ &= \sum_{i=1}^N n(d_i) [\log P(d_i) + \sum_{j=1}^M \frac{n(d_i, w_j)}{n(d_i)} \log \sum_{k=1}^K P(w_j|z_k) P(z_k|d_i)] \\ & \propto \sum_{i=1}^N \sum_{j=1}^M n(d_i, w_j) \log \sum_{k=1}^K P(w_j|z_k) P(z_k|d_i) \text{  (dropping const)}  \\ \end{split} \end{equation*}

Where d indicates documents, w denotes words, z means topic assignments. N is the number of documents, M is the size of vocabulary, K is number of topics.

In the log-likelihood function 5, we find a \log\sum structure with latent variable that is very similar to the log-likelihood function 1 of a general EM problem, so we can utilize the similar derivation that utilized by EM to further derive the log-likelihood function 5, however, we can also do it by applying EM algorithm. To do so, we need to derive Q-function for EM. First, as the equation 5 is not the complete data log-likelihood, which is:

(6)   \begin{equation*}  \log P(\mathbf{d,w,z}) = \sum_{i=1}^N \sum_{j=1}^M n(d_i, w_j) \log P(w_j|z_{ij}) P(z_{ij}|d_i) \end{equation*}

To apply Q-function, we have to be very careful now, as it will get very tricky as we include the random variable \mathbf z, which is a multivariate random variable that has n slots with each slot of K states. In Q-function, we have to sum over all possible \mathbf z, which in this case seems lead to an exponential of terms. However that is not the case, as p(z_k|d_i,w_j) is independent to each other. We can write the Q-function as:

(7)   \begin{equation*}  \begin{split} Q(\bm\theta|\bm\theta^{(i)}) &= \sum_{z_1, ..., z_n} \sum_{l=1}^n \log P(\mathbf{d,w},z_l) P(z_1, ..., z_n|\mathbf{d,w}) \\ &= \sum_{z_1,z_2,...,z_n} [ \log P(\mathbf{d,w},z_1) + ... + \log P(\mathbf{d,w},z_n)] \\ & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(z_1|\textbf{d,w}) P(z_2|\textbf{d,w}) ... P(z_n|\textbf{d,w}) \\ &= \sum_{z_1,z_2,...,z_n} [\log P(d_1,w_1,z_1) + ... + \log P(d_n,w_n,z_n)] \\ & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(z_1|d_1,w_1) P(z_2|d_1,w_1) ... P(z_n|d_1,w_1) \\ &= \sum_{i=1}^N \sum_{j=1}^M n(d_i, w_j) \sum_{k=1}^K P (z_k|d_i, w_j)\log P(w_j|z_k) P(z_k|d_i) \\ \end{split} \end{equation*}

(Still, the above derivation is not at the most detailed level compared to how tricky it is).

Now we have prepared the Q-function 7 and established the E step, which is to calculate P (z_k|d_i, w_j) (using Bayes’ Rule), we can go ahead and optimize the Q-function. Before doing so, we should notice the constraints: \sum_{j=1}^M P(w_j|z_k ) = 1) and \sum_{k=1}^K P(z_k|d_i) = 1, we can include Lagrange multipliers \tau_k and \rho_i into the Q-function to form a complete objective function:

(8)   \begin{equation*}  \begin{split} L_{new} =& \sum_{i=1}^N \sum_{j=1}^M n(d_i, w_j) \sum_{k=1}^K P (z_k|d_i, w_j)\log P(w_j|z_k) P(z_k|d_i) \\ &+ \sum_{k=1}^K \tau_k (1 - \sum_{j=1}^M P(w_j|z_k ) ) + \sum_{i=1}^N \rho_i (1 - \sum_{k=1}^K P(z_k|d_i) ) \\ \end{split} \end{equation*}

Setting equation 8 to zero w.r.t. to parameters  P(w_j|z_k) and P(z_k|d_i) yields:

(9)   \begin{equation*}  \begin{split} &\sum_{i=1}^N n(d_i, w_j) P(z_k|d_i,w_j) - \tau_k P(w_j|z_k) = 0, \ 1 \le j \le M, \ 1 \le k \le K, \\ &\sum_{j=1}^M n(d_i, w_j) P(z_k|d_i,w_j) - \rho_i P(z_k|d_i) = 0, \ 1 \le j \le M, \ 1 \le k \le K. \\ \end{split} \end{equation*}

By substituting equation 9 into normalization constraints \sum_{j=1}^M P(w_j|z_k ) = 1) and \sum_{k=1}^K P(z_k|d_i) = 1, we find a closed form to our maximized parameters estimation:

(10)   \begin{equation*} \begin{split} &$P(w_j|z_k) = \frac{\sum_{i=1}^N n(d_i,w_j) P(z_k|d_i,w_j)}{\sum_{m=1}^M\sum_{i=1}^N n(d_i,w_m) P(z_k|d_i,w_m)}$ \\ &$P(z_k|d_i) = \frac{\sum_{j=1}^M n(d_i,w_j)P(z_k|d_i,w_j)}{n(d_i)}$ \\ \end{split} \end{equation*}

To summarize, the EM algorithm for solving pLSA is:

Initialization:

Initialize the parameters p(z|d) and p(w|z) (note that implementation-wise, for p(z|d), random is better than uniform, which will get stuck, for p(w|z), uniform is better than random)

E step:

P (z_k|d_i, w_j) = \frac{P(w_j|z_k)P(z_k|d_i)}{\sum_{l=1}^K P(w_j|z_l) P(z_l|d_i)}

M step

P(w_j|z_k) = \frac{\sum_{i=1}^N n(d_i,w_j) P(z_k|d_i,w_j)}{\sum_{m=1}^M\sum_{i=1}^N n(d_i,w_m) P(z_k|d_i,w_m)}

P(z_k|d_i) = \frac{\sum_{j=1}^M n(d_i,w_j)P(z_k|d_i,w_j)}{n(d_i)}

Appendix

Two more general ways of understanding EM:

Lower bound optimization: given we are optimizing an objective function includes latent variables \log \sum_{\mathbf z} P(\mathbf{y,z}|\bm\theta), since this is directly difficult, we apply Jensen’s inequality in a specific way (see E.q. 2) to get a lower bound of original objective, now the trick of EM is that we improve the objective function by improving the tight lower bound: in E step, we estimate the distribution of latent variable such that (after E step) lower bound is equal to the objective (to do so, the equality condition of Jensen’s equality is used); in M step, we optimize the lower bound w.r.t. parameters (getting rid of constant, M step is optimizing Q-function), since from E step, the objective is tightly lower bounded, the improvement of the objective in M step is guaranteed to be greater or equal to the improvement of the lower bound. Repeat E and M steps, you can see 1. the objective will not decrease (as the lower bound will not decrease) 2. the objective will converge, since it will not decrease, and there must be a upper limit.

Similar to lower bound optimization point of view, what is shown on Bishop’s book “Pattern Recognition and Machine Learning”, chapter 9.4 is even more general. I very much recommend readers to take a look at it.

The end.

[1] T. Hofmann, “Probabilistic latent semantic analysis,” in Proceedings of the fifteenth conference on uncertainty in artificial intelligence, 1999, pp. 289-296.
[Bibtex]
@inproceedings{hofmann1999probabilistic,
title={Probabilistic latent semantic analysis},
author={Hofmann, Thomas},
booktitle={Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence},
pages={289--296},
year={1999},
organization={Morgan Kaufmann Publishers Inc.}
}
[2] T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,” Machine learning, vol. 42, iss. 1-2, pp. 177-196, 2001.
[Bibtex]
@article{hofmann2001unsupervised,
title={Unsupervised learning by probabilistic latent semantic analysis},
author={Hofmann, Thomas},
journal={Machine learning},
volume={42},
number={1-2},
pages={177--196},
year={2001},
publisher={Springer}
}