Some Existing Work on Linked Documents Analysis

[Could not find the bibliography file(s)

Different from Topic Modeling (which analyzes the topic of documents), Linked Documents Analysis (or Topic Modeling with Links) changes the problem setting by considering the links among documents. The goal is still similar, which is to analyze the topic of documents, expect that sometime we may also want to predict the presence of some queried link. The basic idea behind Linked Document Analysis is that, on one hand, link existence can provide some useful information on the topic distribution, on the other hand, topic distribution can also provide useful information on link existence; in other words, they are correlated. In this post, I am going to review some existing work on topic modeling with links. Basically they can be categorized into generative approaches and discriminative approaches, while former tries to explain why link existing or non-existing by topic/membership distributions of nodes, and the latter tries to utilize existing links in a discriminative way to better model topic/membership distributions.

Generative Approaches

  • Citation Influence Model [?]: The basic idea is that if a document has cited others, the way it generates the topics for word occurrences is either (1) draw from the topic distribution of its cited documents, or (2) draw from its own topic distribution. Other similar models that utilize the citation as a pathway of drawing topic assignments: [?].
  • PHITS+PLSA [?]: it treats each document as “words” when performing link generation, it is like PLSA twice, once on document-word, once on document-document. This method was later extended into “Link-LDA“, which puts a prior on topic distribution extending PLSA to LDA. Both models suffer from not (explicitly) utilizing word topical information for link analysis, link analysis is purely based on the information of link co-occurrence, and also fails to model the topical dependence between the cited and citing documents explicitly.
  • Link-PLSA-LDA[?]: overcoming the difficulties in PHITS+PLSA and “Link-LDA” by connecting word in topic-word and document topic-document by symmetric parametrization of merged cited documents.
  • LDA+MMSB[?]: this may be the most natural generative process for both words and links, it doesn’t treat documents as some kind of dictionaries as in PHITS+PLSA and Link-PLSA-LDA,  and thus can be generalized easily to new documents without previously seen links. However, due to the generative process, this model may not be scale-friendly (though some subsampling techniques might help alleviate the computational complexity).
  • LDA+ZAlign (RTM[?]): defining link probability according to the alignment of latent variables Z. the alignment is quantified by dot production, and the transformation from alignment to probability is through either a Sigmoid function or an Exponential function. The authors pointed out an common weakness of previous models (PHITS+PLSA, Link-PLSA-LDA, LDA+MMSB) that due to their underlying exchangeability assumptions, they might divide the topics into two independent subsets, and allow for links to be explained by one subset of topics, and the words to be explained by the other; to improve over this, they enforce the word and link are draw from the same set of topic assignments. In terms of scalability, since it also has to consider both existing and non-existing links, it is also not scale-friendly.
  • Others [?].

Discriminative Approaches

  • PLSA+\thetaReg (NetPLSA[?]): adding regularization loss, acts like smoothing, on \theta to original PLSA (log-)likelihood. This model can easily deal weighted links.
  • PLSA+\thetaCond (iTopicModel[?]): defining a condition of \theta_i given its neighbors, which leads to an likelihood of \theta configuration given network of documents, then add this conditioned \theta log-likelihood as a smoothing to original PLSA log-likelihood. Compared with PLSA+\thetaReg, samely this model can also easily deal weighted links, differently this model can fit directed networks more naturally, and also it has a closed form solution of \theta so it is much friendly for optimization compared with PLSA+\thetaReg.
  • Others [?].

Gibbs Sampling for Mixed Membership Stochastic Blockmodels

[Could not find the bibliography file(s)

Mixed Membership Stochastic Blockmodels (MMSB)[?] is a generative model for links in a network. Simply put, give an adjacency matrix E, MMSB assume E is governed by the memberships/topics  of each nodes. Since the original paper[?] only introduced the Variational Inference, this post derives a Gibbs Sampling method for the purpose of parameter estimation in MMSB (actually the MMSB presented in this post is slightly different from[?], as I put a prior on interaction matrix B).

The generative process of MMSB is given below:

For each node i:

Draw a membership distribution vector \theta_i \sim Dir(\alpha)

For each entry (m,n) in B:

Draw B_{mn} \sim Beta(\xi^1_{mn}, \xi^2_{mn})

For each pair of nodes (i,j):

Draw a membership Z_{i: i\rightarrow j} \sim Mult(\theta_i)

Draw a membership Z_{j: i\rightarrow j} \sim Mult(\theta_j)

Draw E_{ij} \sim Ber(B_{Z_{i: i\rightarrow j},Z_{j: i\rightarrow j}})

Here is a simple explanation for the notations:

\alpha is K (number of topics) dimensional hyperparameter for membership/topic distribution.

\xi^1_{mn}, \xi^2_{mn} are two scalar hyperparameters for B_{mn}.

\theta_d is the membership/topic distribution for node d.

B is the interaction probability matrix of memberships.

Z_{i: i\rightarrow j} is the membership of node i when it interacts with node j (the link can be treated as directional, as well as unidirectional). Note that Z_{i\rightarrow j} is abbreviated for \{Z_{i: i\rightarrow j},Z_{j: i\rightarrow j}\}.

E is the adjacency matrix, each entry is either one or zero.

Later on, I am also going to use the following notations:

 m_{ip} is the count of node i assigning to membership p when it interacts with the rest nodes in the network. For directional network, it can be divided into two sets: m_{ip\rightarrow} and m_{ip\leftarrow} (m_{ip} =m_{ip\rightarrow}+m_{ip\leftarrow}), the former is the assignment of node i when it links to other nodes, the latter is the assignment of node i when other nodes link to it.

n_{pq+} is the count of linked node pairs with membership assignments p and q (sum up (p,q) and (q,p) if don’t consider direction).

n_{pq-} is the count of un-linked node pairs with membership assignment p and q (sum up (p,q) and (q,p) if don’t consider direction).

To perform Gibbs Sampling, we need to find the posterior for joint probability of both data and latent variables Z. Well, here it is:

(1)   \begin{equation*} \begin{split} &P(E, Z|\alpha,\xi^1,\xi^2) \\ =&\prod_{(i,j)} P(E_{ij}|Z_{i\rightarrow j}, \xi^1,\xi^2) P (Z_{i\rightarrow j}|\alpha) \\ =& \int \prod_{(i,j)} P(E_{ij}|Z_{i\rightarrow j}, B) P(B|\xi^1,\xi^2) dB \\ & \ \ \ \ \prod_i \int \prod_j P(Z_{i:i\rightarrow j}|\theta)P(Z_{i:j\rightarrow i}|\theta) P(\theta|\alpha)d\theta \\ =& \int \prod_{pq} B_{pq}^{n_{pq+}} (1-B_{pq})^{n_{pq-}} \prod_{pq} \frac{B_{pq}^{\xi^1-1} (1-B_{pq}^{\xi^2-1})}{Beta(\xi^1,\xi^2)} dB \\ & \ \ \ \ \prod_i \int \prod_p \theta_{ip}^{m_{ip}} \frac{1}{Beta(\alpha)} \prod_p \theta_{ip}^{\alpha_p-1} d\theta_i \\ &= \prod_{pq} \frac{Beta(n_{pq+}+\xi^1, n_{pq-}+\xi^2)}{Beta(\xi^1,\xi^2)} \prod_i \frac{Beta(m_i+\alpha)}{Beta(\alpha)} \end{split} \end{equation*}

Now that we obtained the joint probability of data and latent variables, we can start to derive the conditioned probability of latent variable as following:

(2)   \begin{equation*} \begin{split} &P(Z_{i\rightarrow j}|Z_{-(i\rightarrow j)}E_{ij}, \alpha,\xi^1,\xi^2) \\ =& \frac{P(E, Z|\alpha,\xi^1,\xi^2)}{P(E, Z_{-(i\rightarrow j)}|\alpha,\xi^1,\xi^2)}\\ =& \frac{P(E, Z|\alpha,\xi^1,\xi^2)}{P(E_{ij})P(E_{-(ij)}, Z_{-(i\rightarrow j)}|\alpha,\xi^1,\xi^2)}\\ \propto& \frac{P(E, Z|\alpha,\xi^1,\xi^2)}{P(E_{-(ij)}, Z_{-(i\rightarrow j)}|\alpha,\xi^1,\xi^2)} \end{split} \end{equation*}


(3)   \begin{equation*} \begin{split} &P(Z_{i:i\rightarrow j}=p, Z_{j:i\rightarrow j}=q|E_{ij}, \alpha,\xi^1,\xi^2)\\ \propto& \frac{P(E, Z_{-(i\rightarrow j)}, Z_{i:i\rightarrow j}=p, Z_{j:i\rightarrow j}=q|\alpha,\xi^1,\xi^2)}{P(E_{-(ij)}, Z_{-(i\rightarrow j)}|\alpha,\xi^1,\xi^2)} \\ =&\frac{Beta(n_{pq+}+\xi^1, n_{pq-}+\xi^2)}{Beta(n^{-(i\rightarrow j)}_{pq+}+\xi^1, n^{-(i\rightarrow j)}_{pq-}+\xi^2)} \\ & \ \frac{Beta(m_i+\alpha)}{Beta(m^{-(i \rightarrow j)}_i+\alpha)}\frac{Beta(m_j+\alpha)}{Beta(m^{-(i \rightarrow j)}_j+\alpha)} \\ \propto& \frac{(n^{-(i\rightarrow j)}_{pq+}+\xi^1)^{E_{ij}} (n^{-(i\rightarrow j)}_{pq-}+\xi^2)^{1-E_{ij}}}{n^{-(i\rightarrow j)}_{pq+}+n^{-(i\rightarrow j)}_{pq-}+\xi^1+\xi^2}  (m^{-(i\rightarrow j)}_{ip}+\alpha)(m^{-(i\rightarrow j)}_{jq}+\alpha) \end{split} \end{equation*}

Given samples of Z generated from Gibbs Sampling, parameters can be estimated in the following ways:

For \theta, we have

(4)   \begin{equation*} \begin{split} &P(\theta_i|Z_i, \alpha) \\ =& \prod_j  P(Z_{i:i\rightarrow j}|\theta_i) P(Z_{i:j\rightarrow i}|\theta_i) P(\theta_i|\alpha)\\ \propto& \prod_p \theta^{m_{ip}+\alpha-1}_{ip} \end{split} \end{equation*}

E.q. 4 is a Dirichlet p.d.f, using Dirichlet variable property, we get the expectation(mean) of \theta:

    \[ \theta_{ip} \propto m_{ip} + \alpha_p \]

For B, we have

(5)   \begin{equation*} \begin{split} &P(B_{pq}|Z, \xi^1, \xi^2) \\ =&\prod_{(i,j)} P(E_{ij}|B) P(B_{pq}|\xi^1,\xi^2)  \\ \propto& B_{pq}^{n_{pq+}+\xi^1-1} (1-B_{pq})^{n_{pq-}+\xi^1-1} \end{split} \end{equation*}

E.q. 5 is a Beta p.d.f, using Beta variable property, we get the expectation(mean) of B:

    \[ B_{pq} = \frac{n_{pq+}+\xi^1}{n_{pq+}+n_{pq-}+\xi^1+\xi^2} \]

One may want to sample multiple de-correlated samples of \{Z\} to calculate the above parameters.

Likelihood and perplexity

Until now, I didn’t mention the formula for calculating the likelihood of data P(E), which can be important some time for testing of convergence (or using P(E|Z)). In [?], the authors mentioned an approximation technique for computing P(E), it is the harmonic mean of a set of values of P(E|Z^{(n)}) from N samples, where Z^{(n)} is sampled from the posterior P(Z|E). So that is:

    \[ P(E) = \frac{N}{\sum_n^N \frac{1}{P(E|Z^{(n)})}} \]


    \[ Z^{(n)} \sim P(Z|E) \]

Using the previous derivation, we can easily compute that:

(6)   \begin{equation*} \begin{split} &\log P(E|Z^{(n)}) \\ =&\log \prod_{pq} \frac{Beta(n_{pq+}+\xi^1, n_{pq-}+\xi^2)}{Beta(\xi^1,\xi^2)}\\ =& \sum_{pq} \log \Gamma(n_{np+}+\xi^1) + \log \Gamma(n_{np-}+\xi^2) - \log \Gamma(n_{pq+}+n_{pq-}+\xi^1+\xi^2) \\ &+ C \end{split} \end{equation*}

Where C is a constant:

    \[ C = K^2 [\log \Gamma(\xi^1+\xi^2) - \log\Gamma(\xi^1) - \log\Gamma(\xi^2)] \]

Now with \log P(E|Z^{(n)}), we can compute \log P(W) in the way:

(7)   \begin{equation*} \begin{split} \log P(E) =& \log N - \log \sum_n e^{-t_n} \\ =& \log N - \log \sum_n e^{t_0 - t_n} + t_0 \end{split} \end{equation*}

Where t_n = \log P(W|Z^{(n)}), t_0 is introduced to make log-sum-exp computed correctly, usually we can set t_0 = \min \{t_n\}.

The likelihood of hold out data sometime is also an useful quantity, it reflects generalized predictive power of the model, and usually it is measured by perplexity, essentially it is the inverse of the geometric mean per-link likelihood. It is defined by:

    \[ perplexity(H) = \exp \{ - \frac{\sum_{(i,j \in H)} \log P(E_{ij}|E_{observed})}{|H|} \} \]

This term can be evaluated by first hold out some subset of links and non-links, and evaluated by computing:

(8)   \begin{equation*} \begin{split} &P(E_{ij}|E_{observed}) \\ \approx &\sum_{Z_{i:i\rightarrow j}} \sum_{Z_{j:i\rightarrow j}} P(E_{ij}|Z_{i:i \rightarrow j}, \hat B) P(E_{ij}|Z_{j:i \rightarrow j}, \hat B) P(Z_{i: i \rightarrow j}|\hat\theta_i) P(Z_{j:i\rightarrow j}|\hat\theta_j) \end{split} \end{equation*}

Where \hat B and \hat \theta are estimated in the training data.


The Mixed Membership Stochastic Blockmodels is not scalable as it requires to examining O(N^2) edges (this is a lot even for medium size network with only thousands of nodes!), in sharp contrast to sparse networks as arise more often in real datasets. But in practice, for those non-existing edges, instead of using them all, we can sample a small portion of them, this could effectively reduce the computational complexity, especially for sparse networks where we expect the sample ratio of non-existing edges to be even fewer. However, it is still remained to be examined that the relationship between the loss of performance and sampling ratio of non-existing links.

Also due to the scalability issue, one may consider the MMSB without prior for each document topic distribution, which can be also by EM rather than Gibbs Sampling. Also, one can derive Gibbs Sampling for combined LDA and MMSB [?] in similar ways to Gibbs Sampling for LDA (see here) and MMSB individually (basically key is to derive P(W,E,Z) \propto P(W|Z^W) P(E|Z^E) P(Z), where P(Z) = \int P(\theta) P(Z^E|\theta) P(Z^W|\theta) d\theta).

After the Talk of Andrew McCallum

Today Prof Andrew  McCallum came Northeastern to give a talk, his topic is probabilistic knowledge base construction, which I am not very familiar with. So instead of learning something from his topics, most inspirations about his talk are actually in research methodology side (which he didn’t talk about much, but I think about a lot).

Here are three levels of innovations in big “artificial intelligence” (including machine learning, data mining,etc.) research communities (purely my opinions):

  1. Proposal of new real world problems or demands.
  2. Proposal of new problem settings.
  3. Proposal of new methodologies/trick for bounding/solving existing problems in existing settings.

Although certain people would have different preferences, but in my point of view, they should be equally treated in the sense of contributions. But in terms of the research style, a research should have certain healthy proportion of their combinations. For application driven research style, very easily, people spend “too many” effort for proposing new real world problems or demands, such that some of the proposed problems are not really contributed in the sense that the real demands are not really that much as claimed and the problem setting are equivalent to some classic problem setting has been proposed at least for a while. For theory driven research, it may run some risks being far away from the real problems/demands; and since they are purely based on a fixed setting of problem, once the setting changed, the theory may be useless for the new settings. So in a short words, application driven research tends to focus more on innovation 1, while theory driven research tends to focus more on innovation 3. But actually in some sense they can be unified or proposing together, which may be called applied theory driven research. This is my favorite one. You don’t want to propose new real problems all the time but using some classic methods on classic settings to solve them, and not a lot of people want to always focus on classic problem settings and find new solutions. A mixed proportion (align theories with real problems and their abstracted settings) of these three levels of innovations are better for certain people, like me..

To further illustrate idea above, I am listing the examples in Andrew’s talk. He is doing many NLP work such as named entity recognition, coreference resolution, etc., these are classical problems with some classical settings, researchers who work on this topic usually need innovation 3, thus they proposed CRF, and other variants. But sometimes the problem settings can also change, when they are faced with the problem of extracting entity relationships, a real demand for building knowledge base, instead of following traditional setting of the problem, which is to first define some schema of the relationship and then learn those relationships for entities, they let the words (mostly verb.) automatically form such relationships, but the learning algorithms need to group them otherwise they are not very useful (e.g. president(Obama, U.S.) and leader(Obama, U.S.) are actually same or similar); by changing the problem settings, the real demands can be better delivered (eventually we need some abstraction from real world to form solvable problem settings, once defined one may not always be reasonable or optimal). Andrew also mentioned something about probabilistic programming languages, maybe this is not his idea, but think of this demand is actually a innovation 1, in which you found that building, debugging over graphical models using “regular” programming language (such as C++, Java, etc.) can be difficult, and demands for designing programming languages to address this issue arises naturally.

A Collection of Machine Learning Classes Available Online

Machine Learning and data mining are getting more and more hot in recent year, under the background, many of machine learning classes now are available online. People who are interested in the topics found a lot of lectures cover similar things, but maybe in different level, focus or point of view. So it may not be a bad idea to make a collection of such classes and sort things out a little bit. (Noted that most of the courses listed below have video online, and links attached may not be the newest version of the classes as time goes by; this post will keep updating if I find something more)

Advanced machine learning topics:

Notes on Latent Dirichlet Allocation: The Generative Model and Gibbs Sampling Learning

[Could not find the bibliography file(s)

Latent Dirichlet Allocation, originally proposed in [?], is a generative graphical model for mining latent topics of texts or any data with similar underlying statistical structures.

The Generative Model

The generative process of words in a document are as following:

For each of the K topic:

Choose \beta_k \sim Dir(\eta)

For each document d in the corpus:

Choose topic distribution \theta_d \sim Dir(\alpha).

For each of the N_d word occurrences in the document:

Choose a topic z_{dn} \sim Mult(\theta_d).

Choose a word w \sim Mult(\beta_{z_{dn}})

Here presents a simple explanation of the notations:

\alpha is K (number of topics) dimensional hyperparameter for topic distribution.

\eta is |V| (number of words) dimensional hyperparameter for word distribution over a given topic.

\beta_{kw} is the probability of choosing word w given its drawn from topic k.

\theta_d is the topic distribution for document d.

z_{dn} is the corresponding topic for the n-th word in document d.

Later on, I am also going to use the following notations:

W = \{W_d\}, W_d = \{W_{dn}\}, are the observed words.

Z = \{Z_d\}, Z_d = \{z_{dn}\}, are the topic assignments for word occurrence.

n_{k} is a |V| dimensional vector, in which n_{kw} is the count of word w in the corpus assigned to topic k.

n_{k}^{-i} is n_{k} without counting information at word occurrence at position i.

m_{d} is a K dimensional vector, in which m_{dk} is the count of words in document d been assigned for topic k.

m_{d}^{-i} is m_{d} without counting information at word occurrence at position i.

The graphical representation of above generative process is as following (each node represents a random variable, filled with gray when observed, links denote dependences, and rectangle plates mean repetition, i.e. the big plate indicates there are M documents like that, the small plate indicates there are N occurrences words in each document):

Screenshot 2014-03-30 14.25.03

To keep (writing) this note simple, I am not going to explain the generative process in details here.

The Likelihood

What we are interested in this model are random variables \beta, \theta, and maybe z. But before going to inferences of these latent random variables, we may first follow the routine and write down the likelihood of the data, i.e.:

(1)   \begin{equation*} \begin{split} &P(W|\alpha, \eta) \\ = &\int P(\beta|\eta) [\prod_{d}^{M} \int (P(\theta_d|\alpha) \prod_{n=1}^{N_d} \sum\limits_{z_{dn}} P(z_{dn}|\theta_d) P(W_{dn}|z_{dn}, \beta)) d\theta_d] d\beta \end{split} \end{equation*}


    \[ P(\beta|\eta) = \prod_k^K \frac{\Gamma(\sum_{v=1}^V \eta_v)}{\prod_{v=1}^V \Gamma(\eta_v)} \beta_{k1}^{\eta_1-1} \cdots \beta_{kV}^{\alpha_{kV}-1} \]

    \[ P(\theta_d|\alpha) = \frac{\Gamma(\sum_{k=1}^K\alpha_k)}{\prod_{k=1}^K \Gamma(\alpha_k)} \theta_{d1}^{\alpha_1-1} \cdots \theta_{dK}^{\alpha_K-1} \]

    \[ P(z_{dn}|\theta_d) = (\theta_d)_{z_{dn}} \]

    \[ P(W_{dn}|z_{dn}, \beta) = \beta_{z_{dn}W_{dn}} \]


Gibbs Sampling Learning

Gibbs sampling learning for LDA was first proposed in [?]. The learning problem or inference problem targeted by [?] is to estimate \theta and \beta. I didn’t find any good justification yet for using Gibbs sampling for LDA, one justification is that given the valid sample Z, we estimate P(\theta|Z, \alpha) \propto P(Z|\theta) P(\theta|\alpha), by either maximizing it or getting its expectation/mean (since this expression is a Dirichlet p.d.f., getting its expectation/mean is easy, and in literature this is often the common practice rather than using mode obtained by maximum likelihood) can give us an estimation of \theta, similar procedures hold for \beta as well. However, this justification is not satisfying in the sense that it didn’t consider the distribution of Z|W rather than just a sample of it, and it shows nothing about why a sample of Z|W will be good enough. By considering the distribution of Z|W (which is intractable thus requires sampling methods), one can derive that

(2)   \begin{equation*} \begin{split} P(\theta|W, \alpha) &= \int P(\theta|Z, \alpha) P(Z|W) dZ \\ &\simeq \frac{1}{L} \sum_{l}^L P(\theta|Z^{(l)}, \alpha) \end{split} \end{equation*}

Where each sample Z^{(l)} is drawn from distribution P(Z|W). Here the expectation/mean is easier to obtain than the mode by maximum likelihood because we have a summation over p.d.f. and the p.d.f. is of Dirichlet distribution. The expectation of a Dirichlet variable from one sample of Z^{(l)} can be found in book or Wiki, and the expectation of the same variable over a set of samples is just an average of expectations of each sample. Technically, we need to obtain multiple de-correlated samples of Z|W as according to E.q. 2, however, one sample may be good enough depends on the data (for example, it can be shown that for \beta, even one sample of Z|W, it could contain many times of samples z given the same word w).

I think E.q. 2 gives us enough motivation and justification for doing Gibbs sampling on LDA. Basically we need to be able to generate samples from P(Z|W), which is an intractable distribution, one popular way for this purpose is Gibbs sampling, which generates samples from each variable once at a time while others fixed.

Preparing for derivation of Gibbs sampling for LDA, let me first lightlight some important functions (as for the mathematics behind MCMC and Gibbs Sampling, readers can refer to some good tutorials on the Internet):

Dirichlet Distribution:

    \[ Dir(\theta;\alpha) = \frac{1}{B(\alpha)} \prod_{i=1}^{|\alpha|} \theta_i^{\alpha_i - 1} \]

Multinomial Beta function:

    \[ B(\alpha) = \frac{\prod_{i}^{|\alpha|}\Gamma(\alpha_i)}{\Gamma(\sum_{i=1}^{|\alpha|}\alpha_i)} \]

Gamma function:

    \[ \Gamma(n) = (n-1)! \]

Multinomial distribution:

    \[ Mult(\theta;x) = \frac{n!}{\prod_{i=1}^{|x|} x_i!} \prod_{i=1}^{|x|} \theta_i^{x_i} \]

Noted that in LDA or many of other bayesian statistical models, people use conjugate distribution for likelihood and prior, e.g., Multinomial and Dirichlet. The trick behind this is that posterior will be the same distribution as the prior, so that we can easily find a closed form for the integration of the posterior. In the case of Multinomial and Dirichlet, we have

    \[ Dir(\theta;x+\alpha) = Mult(\theta;x) Dir(\theta;\alpha) \]


    \[ 1 = \int Dir(\theta;x+a) d\theta = \int \frac{1}{B(x+\alpha)} \prod_{i=1}^{|\alpha|} \theta^{x_i+\alpha_i-1} d\theta \]


    \[ \int \prod_{i=1}^{|\alpha|} \theta^{x_i+\alpha_i-1} d\theta = B(x+\alpha) \]

This piece of formula is extremely useful for deriving the following learning algorithm. (One more note would be: although people call it multinomial distribution in the generative process above, but actually it is one time multinomial, which has already degenerated into discrete distribution, however, this doesn’t affect much for the learning, as we see the formula here will also hold for a sequence of discrete distribution.)

Then let’s derive the joint distribution of W and Z which will be useful just one step later (notations are explained in the beginning of the post):

(3)   \begin{equation*} \begin{split} &P(W,Z|\alpha,\eta) \\ =& \prod_d P(W_d, Z_d| \alpha, \eta) \\ =& \prod_d P(W_d|Z_d, \eta) \prod_d P(Z_d|\alpha) \\ =& \int \prod_d P(W_d|Z_d, \beta) P(\beta|\eta) d\beta \prod_d \int P(Z_d|\theta_d) P(\theta_d|\alpha) d\theta_d \\ =& \int \prod_k \prod_w \beta^{n_{kw}} \prod_k \frac{1}{B(\eta)} \prod_w \beta^{\eta_w - 1} \\ & \ \ \ \ \prod_d \int \prod_k \theta^{m_{dk}} \frac{1}{B(\alpha)} \prod_k \theta^{\alpha_k - 1} d\theta_d \\ =& \frac{1}{(B(\eta))^K} \int \prod_k \prod_w \beta^{n_{kw}+\eta_w-1} d\beta \prod_d \frac{1}{B(\alpha)} \int \prod_k \theta_d^{m_{dk}+\alpha_k - 1} d\theta_d \\ =& \prod_k \frac{B(n_k+\eta)}{B(\eta)} \prod_d \frac{B(m_d+\alpha)}{B(\alpha)} \end{split} \end{equation*}

Now we can start to derive key formula for Gibbs sampling, which is P(z_i|Z_{-i}, W, \alpha, \eta), here Z_{-i} means the topic assignments for all word occurrences other than the one at position i, without loss of generality, we assume here that the position i is in document d and associated with the observed word w, since

    \[ P(z_i|Z_{-i}, W, \alpha, \eta) = \frac{P(Z, W|\alpha, \eta)}{P(Z_{-i}, W|\alpha, \eta)} \propto \frac{P(Z, W|\alpha, \eta)}{P(Z_{-i}, W_{-i}|\alpha, \eta)} \]

We have:

(4)   \begin{equation*} \begin{split} P(z_i=k|Z_{-i}, W, \alpha, \eta) \propto& \frac{P(z_i = k, Z_{-i}, W|\alpha, \eta)}{P(Z_{-i}, W_{-i}|\alpha, \eta)} \\ =& \frac{B(n_k + \eta)}{B(n_k^{-i} + \eta)} \frac{B(m_d + \alpha)}{B(m_d^{-i} + \alpha)} \\ =& \frac{n_{kw}+\eta_w-1}{[\sum_{w'} n_{kw'} + \eta_{w'}] - 1} \frac{m_{dk}+\alpha_k-1}{[\sum_{k'} m_{dk'} + \alpha_{k'}] - 1} \\ \propto& \frac{n_{kw}^{-i}+\eta_w}{\sum_{w'} n_{kw'}^{-i} + \eta_{w'}} (m_{dk}^{-i} + \alpha_k ) \end{split} \end{equation*}

The above E.q. 4 gives us a very clear picture for implementing Gibbs sampling, basically, one just needs to initialize Z, setting counter for three terms in final of E.q. 4, and each time when sample a word, decrease the corresponding counters, sample according to normalized E.q. 4, than increase corresponding counters according to the sampling result. After “burn-in”, one can read out (de-correlated) samples and start estimation of parameters. By E.q. 2, bayes’ rule and expectation of Dirichlet Distribution, we obtain the expectation of \theta and \beta given a sample of Z^{(l)} as following:

(5)   \begin{equation*} \theta_{dk} = \frac{n_{dk}+\alpha_k}{\sum_{k'} n_{dk'} + \alpha_{k'}} \end{equation*}

(6)   \begin{equation*} \beta_{kw} = \frac{n_{kw}+\eta_w}{\sum_{w'} n_{kw'} + \eta_{w'}} \end{equation*}

As mentioned above, one might also want to average parameters over multiple de-correlated samples according to E.q. 2.

Likelihood and perplexity 

Until now, I didn’t mention the formula for calculating the likelihood of data P(W), which can be important some time for testing of convergence (or using P(W|Z)). In [?], the authors mentioned an approximation technique for computing P(W), it is the harmonic mean of a set of values of P(W|Z^{(n)}) from N samples, where Z^{(n)} is sampled from the posterior P(Z|W). So that is:

    \[ P(W) = \frac{N}{\sum_n^N \frac{1}{P(W|Z^{(n)})}} \]


    \[ Z^{(n)} \sim P(Z|W) \]

Using the previous derivation, we can easily compute that:

(7)   \begin{equation*} \begin{split} &P(W|Z^{(n)}) \\ =& \prod_k \frac{Beta(n_k + \eta)}{Beta(\eta)} \\ =& \bigg(\frac{\Gamma(|V|\eta_0)}{\Gamma(\eta_0)^{|V|}}\bigg)^K \prod_k \frac{\prod_w \Gamma(n_{kw}+\eta_0)}{\Gamma(\sum_w n_{kw} +  |V|\eta_0)} \end{split} \end{equation*}

Where we have assumed a symmetric \eta. However, the E.q. 7 can easily lead to numerical underflow/overflow if directly computed (due to the large factorial terms). Here we can take a log of it and see what we got:

(8)   \begin{equation*} \begin{split} &\log P(W|Z^{(n)}) \\ =& C + \sum_k \sum_w \log  \Gamma(n_{kw}+\eta_0) - \log \Gamma(\sum_{w'} n_{kw'} + |V|\eta_0) \end{split} \end{equation*}

Where C is constant:

    \[ C = K\log \Gamma(|V|\eta_0) - K|V| \log \Gamma(\eta_0) \]

The log gamma function should be built in the programming language so you can use it directly. But now since we only have \log P(W|Z^{(n)}), and it can be quite “negatively large” (maybe -10^{10} depending on the corpus), this again create a computational difficulty when we compute the Harmonic average of P(W|Z^{(n)}). Again we are going to log P(W), and using some trick below enable us to compute it accurately:

(9)   \begin{equation*} \begin{split} \log P(W) =& \log N - \log \sum_n e^{-t_n} \\ =& \log N - \log \sum_n e^{t_0 - t_n} + t_0 \end{split} \end{equation*}

Here t_n = \log P(W|Z^{(n)}), t_0 is introduced to make log-sum-exp computed correctly, usually we can set t_0 = \min \{t_n\}. Now, even the corpus contains millions of documents, the above likelihood calculation should still work as the scale is under the range of floating point number.

The likelihood of hold out data sometime is also an useful quantity, it reflects generalized predictive power of the model, and usually it is measured by perplexity, essentially it is the inverse of the geometric mean per-link likelihood. It is defined by:

    \[ perplexity(D_{test}) = \exp \{ - \frac{\sum_d \log P(W_d)}{\sum_d N_d} \} \]

This term can be evaluated by: first hold out some documents, and then sample topic assignment for words in those documents given a point estimate \hat\beta, then the likelihood (there go the perplexity) of hold out documents can be calculated as stated above.

Some resources on Gibbs Sampling for LDA 

GibbsLDA++: A C/C++ Implementation of Latent Dirichlet Allocation using Gibbs sampling.

WinBUGS: Gibbs sampling tools for graphical models.