Notes on Latent Dirichlet Allocation: The Generative Model and Gibbs Sampling Learning

Latent Dirichlet Allocation, originally proposed in [1], is a generative graphical model for mining latent topics of texts or any data with similar underlying statistical structures.

The Generative Model

The generative process of words in a document are as following:


For each of the K topic:

Choose \beta_k \sim Dir(\eta)

For each document d in the corpus:

Choose topic distribution \theta_d \sim Dir(\alpha).

For each of the N_d word occurrences in the document:

Choose a topic z_{dn} \sim Mult(\theta_d).

Choose a word w \sim Mult(\beta_{z_{dn}})


Here presents a simple explanation of the notations:

\alpha is K (number of topics) dimensional hyperparameter for topic distribution.

\eta is |V| (number of words) dimensional hyperparameter for word distribution over a given topic.

\beta_{kw} is the probability of choosing word w given its drawn from topic k.

\theta_d is the topic distribution for document d.

z_{dn} is the corresponding topic for the n-th word in document d.

Later on, I am also going to use the following notations:

W = \{W_d\}, W_d = \{W_{dn}\}, are the observed words.

Z = \{Z_d\}, Z_d = \{z_{dn}\}, are the topic assignments for word occurrence.

n_{k} is a |V| dimensional vector, in which n_{kw} is the count of word w in the corpus assigned to topic k.

n_{k}^{-i} is n_{k} without counting information at word occurrence at position i.

m_{d} is a K dimensional vector, in which m_{dk} is the count of words in document d been assigned for topic k.

m_{d}^{-i} is m_{d} without counting information at word occurrence at position i.

The graphical representation of above generative process is as following (each node represents a random variable, filled with gray when observed, links denote dependences, and rectangle plates mean repetition, i.e. the big plate indicates there are M documents like that, the small plate indicates there are N occurrences words in each document):

Screenshot 2014-03-30 14.25.03

To keep (writing) this note simple, I am not going to explain the generative process in details here.

The Likelihood

What we are interested in this model are random variables \beta, \theta, and maybe z. But before going to inferences of these latent random variables, we may first follow the routine and write down the likelihood of the data, i.e.:

(1)   \begin{equation*} \begin{split} &P(W|\alpha, \eta) \\ = &\int P(\beta|\eta) [\prod_{d}^{M} \int (P(\theta_d|\alpha) \prod_{n=1}^{N_d} \sum\limits_{z_{dn}} P(z_{dn}|\theta_d) P(W_{dn}|z_{dn}, \beta)) d\theta_d] d\beta \end{split} \end{equation*}

Where

    \[ P(\beta|\eta) = \prod_k^K \frac{\Gamma(\sum_{v=1}^V \eta_v)}{\prod_{v=1}^V \Gamma(\eta_v)} \beta_{k1}^{\eta_1-1} \cdots \beta_{kV}^{\alpha_{kV}-1} \]

    \[ P(\theta_d|\alpha) = \frac{\Gamma(\sum_{k=1}^K\alpha_k)}{\prod_{k=1}^K \Gamma(\alpha_k)} \theta_{d1}^{\alpha_1-1} \cdots \theta_{dK}^{\alpha_K-1} \]

    \[ P(z_{dn}|\theta_d) = (\theta_d)_{z_{dn}} \]

    \[ P(W_{dn}|z_{dn}, \beta) = \beta_{z_{dn}W_{dn}} \]

 

Gibbs Sampling Learning

Gibbs sampling learning for LDA was first proposed in [2]. The learning problem or inference problem targeted by [2] is to estimate \theta and \beta. I didn’t find any good justification yet for using Gibbs sampling for LDA, one justification is that given the valid sample Z, we estimate P(\theta|Z, \alpha) \propto P(Z|\theta) P(\theta|\alpha), by either maximizing it or getting its expectation/mean (since this expression is a Dirichlet p.d.f., getting its expectation/mean is easy, and in literature this is often the common practice rather than using mode obtained by maximum likelihood) can give us an estimation of \theta, similar procedures hold for \beta as well. However, this justification is not satisfying in the sense that it didn’t consider the distribution of Z|W rather than just a sample of it, and it shows nothing about why a sample of Z|W will be good enough. By considering the distribution of Z|W (which is intractable thus requires sampling methods), one can derive that

(2)   \begin{equation*} \begin{split} P(\theta|W, \alpha) &= \int P(\theta|Z, \alpha) P(Z|W) dZ \\ &\simeq \frac{1}{L} \sum_{l}^L P(\theta|Z^{(l)}, \alpha) \end{split} \end{equation*}

Where each sample Z^{(l)} is drawn from distribution P(Z|W). Here the expectation/mean is easier to obtain than the mode by maximum likelihood because we have a summation over p.d.f. and the p.d.f. is of Dirichlet distribution. The expectation of a Dirichlet variable from one sample of Z^{(l)} can be found in book or Wiki, and the expectation of the same variable over a set of samples is just an average of expectations of each sample. Technically, we need to obtain multiple de-correlated samples of Z|W as according to E.q. 2, however, one sample may be good enough depends on the data (for example, it can be shown that for \beta, even one sample of Z|W, it could contain many times of samples z given the same word w).

I think E.q. 2 gives us enough motivation and justification for doing Gibbs sampling on LDA. Basically we need to be able to generate samples from P(Z|W), which is an intractable distribution, one popular way for this purpose is Gibbs sampling, which generates samples from each variable once at a time while others fixed.

Preparing for derivation of Gibbs sampling for LDA, let me first lightlight some important functions (as for the mathematics behind MCMC and Gibbs Sampling, readers can refer to some good tutorials on the Internet):

Dirichlet Distribution:

    \[ Dir(\theta;\alpha) = \frac{1}{B(\alpha)} \prod_{i=1}^{|\alpha|} \theta_i^{\alpha_i - 1} \]

Multinomial Beta function:

    \[ B(\alpha) = \frac{\prod_{i}^{|\alpha|}\Gamma(\alpha_i)}{\Gamma(\sum_{i=1}^{|\alpha|}\alpha_i)} \]

Gamma function:

    \[ \Gamma(n) = (n-1)! \]

Multinomial distribution:

    \[ Mult(\theta;x) = \frac{n!}{\prod_{i=1}^{|x|} x_i!} \prod_{i=1}^{|x|} \theta_i^{x_i} \]

Noted that in LDA or many of other bayesian statistical models, people use conjugate distribution for likelihood and prior, e.g., Multinomial and Dirichlet. The trick behind this is that posterior will be the same distribution as the prior, so that we can easily find a closed form for the integration of the posterior. In the case of Multinomial and Dirichlet, we have

    \[ Dir(\theta;x+\alpha) = Mult(\theta;x) Dir(\theta;\alpha) \]

Since

    \[ 1 = \int Dir(\theta;x+a) d\theta = \int \frac{1}{B(x+\alpha)} \prod_{i=1}^{|\alpha|} \theta^{x_i+\alpha_i-1} d\theta \]

i.e.:

    \[ \int \prod_{i=1}^{|\alpha|} \theta^{x_i+\alpha_i-1} d\theta = B(x+\alpha) \]

This piece of formula is extremely useful for deriving the following learning algorithm. (One more note would be: although people call it multinomial distribution in the generative process above, but actually it is one time multinomial, which has already degenerated into discrete distribution, however, this doesn’t affect much for the learning, as we see the formula here will also hold for a sequence of discrete distribution.)

Then let’s derive the joint distribution of W and Z which will be useful just one step later (notations are explained in the beginning of the post):

(3)   \begin{equation*} \begin{split} &P(W,Z|\alpha,\eta) \\ =& \prod_d P(W_d, Z_d| \alpha, \eta) \\ =& \prod_d P(W_d|Z_d, \eta) \prod_d P(Z_d|\alpha) \\ =& \int \prod_d P(W_d|Z_d, \beta) P(\beta|\eta) d\beta \prod_d \int P(Z_d|\theta_d) P(\theta_d|\alpha) d\theta_d \\ =& \int \prod_k \prod_w \beta^{n_{kw}} \prod_k \frac{1}{B(\eta)} \prod_w \beta^{\eta_w - 1} \\ & \ \ \ \ \prod_d \int \prod_k \theta^{m_{dk}} \frac{1}{B(\alpha)} \prod_k \theta^{\alpha_k - 1} d\theta_d \\ =& \frac{1}{(B(\eta))^K} \int \prod_k \prod_w \beta^{n_{kw}+\eta_w-1} d\beta \prod_d \frac{1}{B(\alpha)} \int \prod_k \theta_d^{m_{dk}+\alpha_k - 1} d\theta_d \\ =& \prod_k \frac{B(n_k+\eta)}{B(\eta)} \prod_d \frac{B(m_d+\alpha)}{B(\alpha)} \end{split} \end{equation*}

Now we can start to derive key formula for Gibbs sampling, which is P(z_i|Z_{-i}, W, \alpha, \eta), here Z_{-i} means the topic assignments for all word occurrences other than the one at position i, without loss of generality, we assume here that the position i is in document d and associated with the observed word w, since

    \[ P(z_i|Z_{-i}, W, \alpha, \eta) = \frac{P(Z, W|\alpha, \eta)}{P(Z_{-i}, W|\alpha, \eta)} \propto \frac{P(Z, W|\alpha, \eta)}{P(Z_{-i}, W_{-i}|\alpha, \eta)} \]

We have:

(4)   \begin{equation*} \begin{split} P(z_i=k|Z_{-i}, W, \alpha, \eta) \propto& \frac{P(z_i = k, Z_{-i}, W|\alpha, \eta)}{P(Z_{-i}, W_{-i}|\alpha, \eta)} \\ =& \frac{B(n_k + \eta)}{B(n_k^{-i} + \eta)} \frac{B(m_d + \alpha)}{B(m_d^{-i} + \alpha)} \\ =& \frac{n_{kw}+\eta_w-1}{[\sum_{w'} n_{kw'} + \eta_{w'}] - 1} \frac{m_{dk}+\alpha_k-1}{[\sum_{k'} m_{dk'} + \alpha_{k'}] - 1} \\ \propto& \frac{n_{kw}^{-i}+\eta_w}{\sum_{w'} n_{kw'}^{-i} + \eta_{w'}} (m_{dk}^{-i} + \alpha_k ) \end{split} \end{equation*}

The above E.q. 4 gives us a very clear picture for implementing Gibbs sampling, basically, one just needs to initialize Z, setting counter for three terms in final of E.q. 4, and each time when sample a word, decrease the corresponding counters, sample according to normalized E.q. 4, than increase corresponding counters according to the sampling result. After “burn-in”, one can read out (de-correlated) samples and start estimation of parameters. By E.q. 2, bayes’ rule and expectation of Dirichlet Distribution, we obtain the expectation of \theta and \beta given a sample of Z^{(l)} as following:

(5)   \begin{equation*} \theta_{dk} = \frac{n_{dk}+\alpha_k}{\sum_{k'} n_{dk'} + \alpha_{k'}} \end{equation*}

(6)   \begin{equation*} \beta_{kw} = \frac{n_{kw}+\eta_w}{\sum_{w'} n_{kw'} + \eta_{w'}} \end{equation*}

As mentioned above, one might also want to average parameters over multiple de-correlated samples according to E.q. 2.

Likelihood and perplexity 

Until now, I didn’t mention the formula for calculating the likelihood of data P(W), which can be important some time for testing of convergence (or using P(W|Z)). In [2], the authors mentioned an approximation technique for computing P(W), it is the harmonic mean of a set of values of P(W|Z^{(n)}) from N samples, where Z^{(n)} is sampled from the posterior P(Z|W). So that is:

    \[ P(W) = \frac{N}{\sum_n^N \frac{1}{P(W|Z^{(n)})}} \]

Where

    \[ Z^{(n)} \sim P(Z|W) \]

Using the previous derivation, we can easily compute that:

(7)   \begin{equation*} \begin{split} &P(W|Z^{(n)}) \\ =& \prod_k \frac{Beta(n_k + \eta)}{Beta(\eta)} \\ =& \bigg(\frac{\Gamma(|V|\eta_0)}{\Gamma(\eta_0)^{|V|}}\bigg)^K \prod_k \frac{\prod_w \Gamma(n_{kw}+\eta_0)}{\Gamma(\sum_w n_{kw} +  |V|\eta_0)} \end{split} \end{equation*}

Where we have assumed a symmetric \eta. However, the E.q. 7 can easily lead to numerical underflow/overflow if directly computed (due to the large factorial terms). Here we can take a log of it and see what we got:

(8)   \begin{equation*} \begin{split} &\log P(W|Z^{(n)}) \\ =& C + \sum_k \sum_w \log  \Gamma(n_{kw}+\eta_0) - \log \Gamma(\sum_{w'} n_{kw'} + |V|\eta_0) \end{split} \end{equation*}

Where C is constant:

    \[ C = K\log \Gamma(|V|\eta_0) - K|V| \log \Gamma(\eta_0) \]

The log gamma function should be built in the programming language so you can use it directly. But now since we only have \log P(W|Z^{(n)}), and it can be quite “negatively large” (maybe -10^{10} depending on the corpus), this again create a computational difficulty when we compute the Harmonic average of P(W|Z^{(n)}). Again we are going to log P(W), and using some trick below enable us to compute it accurately:

(9)   \begin{equation*} \begin{split} \log P(W) =& \log N - \log \sum_n e^{-t_n} \\ =& \log N - \log \sum_n e^{t_0 - t_n} + t_0 \end{split} \end{equation*}

Here t_n = \log P(W|Z^{(n)}), t_0 is introduced to make log-sum-exp computed correctly, usually we can set t_0 = \min \{t_n\}. Now, even the corpus contains millions of documents, the above likelihood calculation should still work as the scale is under the range of floating point number.

The likelihood of hold out data sometime is also an useful quantity, it reflects generalized predictive power of the model, and usually it is measured by perplexity, essentially it is the inverse of the geometric mean per-link likelihood. It is defined by:

    \[ perplexity(D_{test}) = \exp \{ - \frac{\sum_d \log P(W_d)}{\sum_d N_d} \} \]

This term can be evaluated by: first hold out some documents, and then sample topic assignment for words in those documents given a point estimate \hat\beta, then the likelihood (there go the perplexity) of hold out documents can be calculated as stated above.

Some resources on Gibbs Sampling for LDA 

GibbsLDA++: A C/C++ Implementation of Latent Dirichlet Allocation using Gibbs sampling.

WinBUGS: Gibbs sampling tools for graphical models.

[1] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” The journal of machine learning research, vol. 3, pp. 993-1022, 2003.
[Bibtex]
@article{blei2003latent,
title={Latent dirichlet allocation},
author={Blei, David M and Ng, Andrew Y and Jordan, Michael I},
journal={the Journal of machine Learning research},
volume={3},
pages={993--1022},
year={2003},
publisher={JMLR. org}
}
[2] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings of the national academy of sciences of the united states of america, vol. 101, iss. Suppl 1, pp. 5228-5235, 2004.
[Bibtex]
@article{griffiths2004finding,
title={Finding scientific topics},
author={Griffiths, Thomas L and Steyvers, Mark},
journal={Proceedings of the National academy of Sciences of the United States of America},
volume={101},
number={Suppl 1},
pages={5228--5235},
year={2004},
publisher={National Acad Sciences}
}