## Bayesian – the good and the bad

The following thoughts are general, not necessarily entitled to every models.

The good (not exclusive)
1. probabilistic modeling, principled.

The good (exclusive)
1. generative process enables to plug in “domain knowledge” very easily.
2. prior enables to plug in further “domain knowledge”.
3. integration/summation over latent variables usually yields better performance.

1. setting prior can be difficult and may require hyper-parameter turning.

1. the specific generative process with “domain knowledge” can lower the model’s ability to explore complicity in the data.
2. the design of generative process requires deep understanding about both data, “domain knowledge” and math.
2. the inference is complicated and slow, especially with integration/summation over latent variables.

## 1. Introduction

Consider the following statistical problem setting: we are given and , where is a set of observations, and is a set of latent random variables, the goal is to compute the conditional posterior distribution of given , i.e. . This is a very general setting, because here can be anything that we haven’t observed, which includes naturally aroused latent variables, as well as those parameters been considered as random variables.

## 2. Mean-field Approximation

### 2.1. Evidence Lower Bound

In many complex models, the posterior is intractable (but is tractable), which means we cannot compute directly given data. So the idea is to find a approximation distribution over , i.e. . Under the mean-field approximation, there are two ways to introduce term by approximation.

KL-divergence. We want to quantify the approximation by using KL-divergence between true distribution and proposed distribution, that is

(1)

To make the approximation accurate, we want to minimize the KL-divergence over . However, once again we encounter the intractable distribution in the KL-divergence. To solve this problem, we expend E.q.1:

Here we abbreviate distribution and as and . Now since has nothing to do with how we determine . So minimizing KL-divergence is equivalent to maximizing the evidence lower bound (this name would make more sense when we come to the other way of deriving it later), i.e. :

(2)

Here term is called variational free energy, is the entropy of . Thus to minimize the approximation error is to maximize the lower bound over :

Jensen Inequality. Another way of deriving approximation distribution is by considering the estimation of data log-likelihood:

It is also easy to see that the difference between data likelihood and lower bound is the KL-divergence between and .

### 2.2. Deriving

Now that we have evidence lower bound containing a tractable distribution (good) and a unknown distribution over all latent variables (not so good), we still need a way to quantify . Under mean-field variational Bayes, we will make an assumption: latent variables can be factorized into several independent sets (specified by users), i.e.,

(3)

Plug E.q. 3 into lower bound E.q. 2, we can derive the optimum solution of while others fixed is (expressed in logarithm):

or

Where

Optimal can be derived from here, although might be difficult to work with for some model (for most models it is not).

Once the optimal for all are found, we can alternatively update each latent variable until convergence (which is guaranteed due to the convexity of ELBO). Noted the convergent point is local optimal.

#### 2.2.1. for Exponential Family Conditionals

If is in exponential family, then optimal is in the same family of distribution as . This provides a shortcut for doing mean-field variance inference on graphical models with conditional exponential families (which is common for many graphical models): using the theory mentioned here, we can simply write down the optimal variation distributions and their undetermined variational parameters, then setting derivative of ELBO to zero w.r.t. those parameters (or using gradient ascent) can form a coordinate ascent learning procedure. See David Blei’s note.

## 3. Expectation Propagation – A glance

Same as mean-field approximation, EP also tries to find the approximation distribution by minimizing the KL-divergence. However, different from mean-field which minimizes the KL-divergence by maximizing the ELBO, EP tries directly maximize KL-divergence, which might be difficult for any distributions, but practical for some distributions, such as exponential families. The details might be contained in future posts.

## Synthetic Experiments with PLSA

Probabilistic Latent Semantic Analysis[1] is a technique for modeling the topic structures of a given set of documents, which is similar to LDA (Latent Dirichlet Allocation) [2], except that the latter has a better (more complete) generative property. In this post, I am conducting some simple experiments for testing the PLSA’s ability to recover the real parameter (here I focused on , the word distribution over topics).

The generative process for the data is described in this post, to make visualization easier, we will not use some random , instead, is set to be visualization friendly as suggested by [3]:

Here, each small box is a for topic , each small areas inside the box indicate words, the white areas mean the probabilities of the word for that given topic is high, while the black areas mean the reverse. These boxes are generated by the following procedure: consider 10 topics and 25 words, for each topic, each is a vector, reshape it into box, and uniformly fill 5 entries according to the above figure.

After the data is generated, we can test it with PLSA (optimized by EM algorithm), and see how the results below. To get more through understandings of the performance, I leverage some parameters used in generative process and learning process of PLSA to create different data and learning scenarios.

### Scenario I – Prototype data

For prototype setting, I used 100 documents, each document with expected 100 words, is set of 0.2 so that documents have more skewed topic distribution. Here shows some of the documents:

You can see they are quite mixed, but still appears some kind of pattern. For the document word matrix, we can re-organize it so that documents with the same (highest) topic will appear together (noted this re-organization require ground truth theta, which is not known to PLSA beforehand):

Here you can see the patterns even more clearly, thus it should not be too difficult for PLSA to recover those parameters like .

The figure below is what PLSA recovered:

Compared with the original boxes, they are quite similar I would say (although there are still some differences from the true , but not far away). So, not surprisingly, PLSA recover the parameters fair enough.

### Scenario II – More uniform topic distributions

Since prototype is like the most basic and simplest setting, it is not surprising PLSA works. Now, consider when documents have much more uniform topic distributions, it can be achieved by setting a grater , say 2. And other settings are the same with Scenario one. Here shows some of the documents and the re-organized document word matrix:

Compared with prototype, it is now much difficult for human to discover some patterns. Now how is PLSA doing on recovering :

Worse I would say, but still can count as fair. So it seems to be good at handling highly mixed topics.

### Scenario III – Few words

Now, I will still use the same settings as in prototype, with the changes on number of expected words per document: now it is 10 instead of 100 in prototype. Some of the documents and the re-organized document word matrix:

This is PLSA recovered:

Well, this is not very good, quite sparse compared with true . So fewer words in the documents are definitely not good for PLSA.

### Scenario IV – Few documents

How about few documents, now change the number of documents from 100 in prototype to 10. Still see some of the documents and the re-organized document word matrix:

Okay, see the recovered :

Wow… This is.. Extremely bad.. So we learn, small number of documents are definitely bad for PLSA.

### Scenario V – Different sized topics (small topics)

All above we assume the size of different topics are uniformly distributed, as we use the same alpha for each topics, now, let’s change it, I will use [1 2 3 4 5 6 7 8 9 10] for the 10 different topics, other than this, all settings are still same as in prototype. See some of the documents and the re-organized document word matrix:

What the.. I don’t think any human can learn anything about  from these figures.. But how much can PLSA recovered

:

Hoho, not much.. This suggest for different sized topics, PLSA perform poorly.

### Scenario VI – Learning with more topics

In all above scenarios, learning for PLSA is conducted with true number of topics. What if we don’t know it and use the wrong number of topics? Here I use the same settings as in prototype, but when learning PLSA, instead of using true number of topics, which is 10, I use 20. Now see what PLSA has recovered:

It seems most topics are broken down into many topics, but it wouldn’t matter too much as long as for each topic word distribution it is not polluted by other ones, and the linear combination of those topics can still well recover the true topics.

### Scenario VII – Learning with fewer topics

How about fewer topics, here using topic number of 5. Recovered :

Huh.. This is like… merged together? Definitely not very good.

## Conclusions

So here comes a summarization of performances of PLSA in recovering the true parameter :

• PLSA can recover the hidden topics well if the set of documents are more like “prototype”, meaning with a lot of documents, a lot of words per document, each document is not highly mixed with many topics.
• When the mixing of topics are high, so that each document’s topic distribution is more uniform, PLSA performs a little bit worse, but still fair enough, so highly mixing of topics is not a nightmare for PLSA.
• Few words in documents and few documents in the corpus are nightmares, especially when there are only a few documents, PLSA is likely to fail.
• When there are various sized topics, some topics are very popular and appear in many documents, while some topics are cold and only appear on a few documents, PLSA will also perform worse, especially for the cold topics.
• When using not “exact” number of topics when training, it would not be very problematic if the used number is slightly larger than real one, but it would be if the number is smaller.

It seems those conclusions from the synthetic experiments are someway similar to the conclusion of ICML 2014’s best paper ([4], in Section 4), though which is described for LDA and provide much much deeper studies than provided here.

[1] T. Hofmann, “Probabilistic latent semantic analysis,” in Proceedings of the fifteenth conference on uncertainty in artificial intelligence, 1999, pp. 289-296.
[Bibtex]
@inproceedings{hofmann1999probabilistic,
title={Probabilistic latent semantic analysis},
author={Hofmann, Thomas},
booktitle={Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence},
pages={289--296},
year={1999},
organization={Morgan Kaufmann Publishers Inc.}
}
[2] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” The journal of machine learning research, vol. 3, pp. 993-1022, 2003.
[Bibtex]
@article{blei2003latent,
title={Latent dirichlet allocation},
author={Blei, David M and Ng, Andrew Y and Jordan, Michael I},
journal={the Journal of machine Learning research},
volume={3},
pages={993--1022},
year={2003},
publisher={JMLR. org}
}
[3] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings of the national academy of sciences of the united states of america, vol. 101, iss. Suppl 1, pp. 5228-5235, 2004.
[Bibtex]
@article{griffiths2004finding,
title={Finding scientific topics},
author={Griffiths, Thomas L and Steyvers, Mark},
journal={Proceedings of the National academy of Sciences of the United States of America},
volume={101},
number={Suppl 1},
pages={5228--5235},
year={2004},
}
[4] J. Tang, Z. Meng, X. Nguyen, Q. Mei, and M. Zhang, “Understanding the limiting factors of topic modeling via posterior contraction analysis,” in Proceedings of the 31st international conference on machine learning, 2014, pp. 190-198.
[Bibtex]
@inproceedings{tang2014understanding,
title={Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis},
author={Tang, Jian and Meng, Zhaoshi and Nguyen, Xuanlong and Mei, Qiaozhu and Zhang, Ming},
booktitle={Proceedings of The 31st International Conference on Machine Learning},
pages={190--198},
year={2014}
}

## Some Existing Work on Linked Documents Analysis

Different from Topic Modeling (which analyzes the topic of documents), Linked Documents Analysis (or Topic Modeling with Links) changes the problem setting by considering the links among documents. The goal is still similar, which is to analyze the topic of documents, expect that sometime we may also want to predict the presence of some queried link. The basic idea behind Linked Document Analysis is that, on one hand, link existence can provide some useful information on the topic distribution, on the other hand, topic distribution can also provide useful information on link existence; in other words, they are correlated. In this post, I am going to review some existing work on topic modeling with links. Basically they can be categorized into generative approaches and discriminative approaches, while former tries to explain why link existing or non-existing by topic/membership distributions of nodes, and the latter tries to utilize existing links in a discriminative way to better model topic/membership distributions.

### Generative Approaches

• Citation Influence Model [1]: The basic idea is that if a document has cited others, the way it generates the topics for word occurrences is either (1) draw from the topic distribution of its cited documents, or (2) draw from its own topic distribution. Other similar models that utilize the citation as a pathway of drawing topic assignments: [2].
• PHITS+PLSA [3, 4]: it treats each document as “words” when performing link generation, it is like PLSA twice, once on document-word, once on document-document. This method was later extended into “Link-LDA“, which puts a prior on topic distribution extending PLSA to LDA. Both models suffer from not (explicitly) utilizing word topical information for link analysis, link analysis is purely based on the information of link co-occurrence, and also fails to model the topical dependence between the cited and citing documents explicitly.
• Link-PLSA-LDA[4]: overcoming the difficulties in PHITS+PLSA and “Link-LDA” by connecting word in topic-word and document topic-document by symmetric parametrization of merged cited documents.
• LDA+MMSB[5]: this may be the most natural generative process for both words and links, it doesn’t treat documents as some kind of dictionaries as in PHITS+PLSA and Link-PLSA-LDA,  and thus can be generalized easily to new documents without previously seen links. However, due to the generative process, this model may not be scale-friendly (though some subsampling techniques might help alleviate the computational complexity).
• LDA+ZAlign (RTM[6]): defining link probability according to the alignment of latent variables Z. the alignment is quantified by dot production, and the transformation from alignment to probability is through either a Sigmoid function or an Exponential function. The authors pointed out an common weakness of previous models (PHITS+PLSA, Link-PLSA-LDA, LDA+MMSB) that due to their underlying exchangeability assumptions, they might divide the topics into two independent subsets, and allow for links to be explained by one subset of topics, and the words to be explained by the other; to improve over this, they enforce the word and link are draw from the same set of topic assignments. In terms of scalability, since it also has to consider both existing and non-existing links, it is also not scale-friendly.
• Others [7].

### Discriminative Approaches

• PLSA+Reg (NetPLSA[8]): adding regularization loss, acts like smoothing, on to original PLSA (log-)likelihood. This model can easily deal weighted links.
• PLSA+Cond (iTopicModel[9]): defining a condition of given its neighbors, which leads to an likelihood of configuration given network of documents, then add this conditioned log-likelihood as a smoothing to original PLSA log-likelihood. Compared with PLSA+Reg, samely this model can also easily deal weighted links, differently this model can fit directed networks more naturally, and also it has a closed form solution of so it is much friendly for optimization compared with PLSA+Reg.
• Others [10].
[1] L. Dietz, S. Bickel, and T. Scheffer, “Unsupervised prediction of citation influences,” in Proceedings of the 24th international conference on machine learning, 2007, pp. 233-240.
[Bibtex]
@inproceedings{dietz2007unsupervised,
title={Unsupervised prediction of citation influences},
author={Dietz, Laura and Bickel, Steffen and Scheffer, Tobias},
booktitle={Proceedings of the 24th international conference on Machine learning},
pages={233--240},
year={2007},
organization={ACM}
}
[2] A. Gruber, M. Rosen-Zvi, and Y. Weiss, “Latent topic models for hypertext,” Arxiv preprint arxiv:1206.3254, 2012.
[Bibtex]
@article{gruber2012latent,
title={Latent topic models for hypertext},
author={Gruber, Amit and Rosen-Zvi, Michal and Weiss, Yair},
journal={arXiv preprint arXiv:1206.3254},
year={2012}
}
[3] D. C. T. Hofmann, “The missing link-a probabilistic model of document content and hypertext connectivity.” 2001.
[Bibtex]
@inproceedings{hofmann2001missing,
title={The missing link-a probabilistic model of document content and hypertext connectivity},
author={Hofmann, David Cohn Thomas},
year={2001}
}
[4] R. Nallapati and W. W. Cohen, “Link-plsa-lda: a new unsupervised model for topics and influence of blogs..” 2008.
[Bibtex]
@inproceedings{nallapati2008link,
title={Link-PLSA-LDA: A New Unsupervised Model for Topics and Influence of Blogs.},
author={Nallapati, Ramesh and Cohen, William W},
year={2008}
}
[5] R. M. Nallapati, A. Ahmed, E. P. Xing, and W. W. Cohen, “Joint latent topic models for text and citations,” in Proceedings of the 14th acm sigkdd international conference on knowledge discovery and data mining, 2008, pp. 542-550.
[Bibtex]
@inproceedings{nallapati2008joint,
title={Joint latent topic models for text and citations},
author={Nallapati, Ramesh M and Ahmed, Amr and Xing, Eric P and Cohen, William W},
booktitle={Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining},
pages={542--550},
year={2008},
organization={ACM}
}
[6] J. Chang and D. M. Blei, “Relational topic models for document networks,” in International conference on artificial intelligence and statistics, 2009, pp. 81-88.
[Bibtex]
@inproceedings{chang2009relational,
title={Relational topic models for document networks},
author={Chang, Jonathan and Blei, David M},
booktitle={International Conference on Artificial Intelligence and Statistics},
pages={81--88},
year={2009}
}
[7] Y. Zhu, X. Yan, L. Getoor, and C. Moore, “Scalable text and link analysis with mixed-topic link models,” in Proceedings of the 19th acm sigkdd international conference on knowledge discovery and data mining, 2013, pp. 473-481.
[Bibtex]
@inproceedings{zhu2013scalable,
author={Zhu, Yaojia and Yan, Xiaoran and Getoor, Lise and Moore, Cristopher},
booktitle={Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining},
pages={473--481},
year={2013},
organization={ACM}
}
[8] Q. Mei, D. Cai, D. Zhang, and C. Zhai, “Topic modeling with network regularization,” in Proceedings of the 17th international conference on world wide web, 2008, pp. 101-110.
[Bibtex]
@inproceedings{mei2008topic,
title={Topic modeling with network regularization},
author={Mei, Qiaozhu and Cai, Deng and Zhang, Duo and Zhai, ChengXiang},
booktitle={Proceedings of the 17th international conference on World Wide Web},
pages={101--110},
year={2008},
organization={ACM}
}
[9] Y. Sun, J. Han, J. Gao, and Y. Yu, “Itopicmodel: information network-integrated topic modeling,” in Data mining, 2009. icdm’09. ninth ieee international conference on, 2009, pp. 493-502.
[Bibtex]
@inproceedings{sun2009itopicmodel,
title={itopicmodel: Information network-integrated topic modeling},
author={Sun, Yizhou and Han, Jiawei and Gao, Jing and Yu, Yintao},
booktitle={Data Mining, 2009. ICDM'09. Ninth IEEE International Conference on},
pages={493--502},
year={2009},
organization={IEEE}
}
[10] T. Yang, R. Jin, Y. Chi, and S. Zhu, “Combining link and content for community detection: a discriminative approach,” in Proceedings of the 15th acm sigkdd international conference on knowledge discovery and data mining, 2009, pp. 927-936.
[Bibtex]
@inproceedings{yang2009combining,
title={Combining link and content for community detection: a discriminative approach},
author={Yang, Tianbao and Jin, Rong and Chi, Yun and Zhu, Shenghuo},
booktitle={Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining},
pages={927--936},
year={2009},
organization={ACM}
}

## Gibbs Sampling for Mixed Membership Stochastic Blockmodels

Mixed Membership Stochastic Blockmodels (MMSB)[1] is a generative model for links in a network. Simply put, give an adjacency matrix , MMSB assume  is governed by the memberships/topics  of each nodes. Since the original paper[1] only introduced the Variational Inference, this post derives a Gibbs Sampling method for the purpose of parameter estimation in MMSB (actually the MMSB presented in this post is slightly different from[1], as I put a prior on interaction matrix ).

The generative process of MMSB is given below:

For each node :

Draw a membership distribution vector

For each entry in :

Draw

For each pair of nodes :

Draw a membership

Draw a membership

Draw

Here is a simple explanation for the notations:

is (number of topics) dimensional hyperparameter for membership/topic distribution.

are two scalar hyperparameters for .

is the membership/topic distribution for node .

is the interaction probability matrix of memberships.

is the membership of node when it interacts with node (the link can be treated as directional, as well as unidirectional). Note that  is abbreviated for .

is the adjacency matrix, each entry is either one or zero.

Later on, I am also going to use the following notations:

is the count of node assigning to membership when it interacts with the rest nodes in the network. For directional network, it can be divided into two sets: and  (), the former is the assignment of node when it links to other nodes, the latter is the assignment of node when other nodes link to it.

is the count of linked node pairs with membership assignments and (sum up  and if don’t consider direction).

is the count of un-linked node pairs with membership assignment and (sum up and if don’t consider direction).

To perform Gibbs Sampling, we need to find the posterior for joint probability of both data and latent variables . Well, here it is:

(1)

Now that we obtained the joint probability of data and latent variables, we can start to derive the conditioned probability of latent variable as following:

(2)

(3)

Given samples of generated from Gibbs Sampling, parameters can be estimated in the following ways:

For , we have

(4)

E.q. 4 is a Dirichlet p.d.f, using Dirichlet variable property, we get the expectation(mean) of :

For , we have

(5)

E.q. 5 is a Beta p.d.f, using Beta variable property, we get the expectation(mean) of :

One may want to sample multiple de-correlated samples of to calculate the above parameters.

#### Likelihood and perplexity

Until now, I didn’t mention the formula for calculating the likelihood of data , which can be important some time for testing of convergence (or using ). In [2], the authors mentioned an approximation technique for computing , it is the harmonic mean of a set of values of from samples, where is sampled from the posterior . So that is:

Where

Using the previous derivation, we can easily compute that:

(6)

Where is a constant:

Now with , we can compute in the way:

(7)

Where , is introduced to make log-sum-exp computed correctly, usually we can set .

The likelihood of hold out data sometime is also an useful quantity, it reflects generalized predictive power of the model, and usually it is measured by perplexity, essentially it is the inverse of the geometric mean per-link likelihood. It is defined by:

This term can be evaluated by first hold out some subset of links and non-links, and evaluated by computing:

(8)

Where and are estimated in the training data.

#### Discussions

The Mixed Membership Stochastic Blockmodels is not scalable as it requires to examining edges (this is a lot even for medium size network with only thousands of nodes!), in sharp contrast to sparse networks as arise more often in real datasets. But in practice, for those non-existing edges, instead of using them all, we can sample a small portion of them, this could effectively reduce the computational complexity, especially for sparse networks where we expect the sample ratio of non-existing edges to be even fewer. However, it is still remained to be examined that the relationship between the loss of performance and sampling ratio of non-existing links.

Also due to the scalability issue, one may consider the MMSB without prior for each document topic distribution, which can be also by EM rather than Gibbs Sampling. Also, one can derive Gibbs Sampling for combined LDA and MMSB [3] in similar ways to Gibbs Sampling for LDA (see here) and MMSB individually (basically key is to derive , where ).

[1] E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing, “Mixed membership stochastic blockmodels,” in Advances in neural information processing systems, 2009, pp. 33-40.
[Bibtex]
@inproceedings{airoldi2009mixed,
title={Mixed membership stochastic blockmodels},
author={Airoldi, Edoardo M and Blei, David M and Fienberg, Stephen E and Xing, Eric P},
booktitle={Advances in Neural Information Processing Systems},
pages={33--40},
year={2009}
}
[2] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings of the national academy of sciences of the united states of america, vol. 101, iss. Suppl 1, pp. 5228-5235, 2004.
[Bibtex]
@article{griffiths2004finding,
title={Finding scientific topics},
author={Griffiths, Thomas L and Steyvers, Mark},
journal={Proceedings of the National academy of Sciences of the United States of America},
volume={101},
number={Suppl 1},
pages={5228--5235},
year={2004},
}
[3] R. M. Nallapati, A. Ahmed, E. P. Xing, and W. W. Cohen, “Joint latent topic models for text and citations,” in Proceedings of the 14th acm sigkdd international conference on knowledge discovery and data mining, 2008, pp. 542-550.
[Bibtex]
@inproceedings{nallapati2008joint,
title={Joint latent topic models for text and citations},
author={Nallapati, Ramesh M and Ahmed, Amr and Xing, Eric P and Cohen, William W},
booktitle={Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining},
pages={542--550},
year={2008},
organization={ACM}
}

## Notes on Latent Dirichlet Allocation: The Generative Model and Gibbs Sampling Learning

Latent Dirichlet Allocation, originally proposed in [1], is a generative graphical model for mining latent topics of texts or any data with similar underlying statistical structures.

## The Generative Model

The generative process of words in a document are as following:

For each of the topic:

Choose

For each document in the corpus:

Choose topic distribution .

For each of the word occurrences in the document:

Choose a topic .

Choose a word

Here presents a simple explanation of the notations:

is (number of topics) dimensional hyperparameter for topic distribution.

is (number of words) dimensional hyperparameter for word distribution over a given topic.

is the probability of choosing word given its drawn from topic .

is the topic distribution for document .

is the corresponding topic for the -th word in document .

Later on, I am also going to use the following notations:

, , are the observed words.

, , are the topic assignments for word occurrence.

is a dimensional vector, in which is the count of word in the corpus assigned to topic .

is without counting information at word occurrence at position .

is a dimensional vector, in which is the count of words in document been assigned for topic .

is without counting information at word occurrence at position .

The graphical representation of above generative process is as following (each node represents a random variable, filled with gray when observed, links denote dependences, and rectangle plates mean repetition, i.e. the big plate indicates there are documents like that, the small plate indicates there are occurrences words in each document):

To keep (writing) this note simple, I am not going to explain the generative process in details here.

## The Likelihood

What we are interested in this model are random variables , , and maybe . But before going to inferences of these latent random variables, we may first follow the routine and write down the likelihood of the data, i.e.:

(1)

Where

## Gibbs Sampling Learning

Gibbs sampling learning for LDA was first proposed in [2]. The learning problem or inference problem targeted by [2] is to estimate and . I didn’t find any good justification yet for using Gibbs sampling for LDA, one justification is that given the valid sample , we estimate , by either maximizing it or getting its expectation/mean (since this expression is a Dirichlet p.d.f., getting its expectation/mean is easy, and in literature this is often the common practice rather than using mode obtained by maximum likelihood) can give us an estimation of , similar procedures hold for as well. However, this justification is not satisfying in the sense that it didn’t consider the distribution of rather than just a sample of it, and it shows nothing about why a sample of will be good enough. By considering the distribution of (which is intractable thus requires sampling methods), one can derive that

(2)

Where each sample is drawn from distribution . Here the expectation/mean is easier to obtain than the mode by maximum likelihood because we have a summation over p.d.f. and the p.d.f. is of Dirichlet distribution. The expectation of a Dirichlet variable from one sample of  can be found in book or Wiki, and the expectation of the same variable over a set of samples is just an average of expectations of each sample. Technically, we need to obtain multiple de-correlated samples of as according to E.q. 2, however, one sample may be good enough depends on the data (for example, it can be shown that for , even one sample of , it could contain many times of samples given the same word ).

I think E.q. 2 gives us enough motivation and justification for doing Gibbs sampling on LDA. Basically we need to be able to generate samples from , which is an intractable distribution, one popular way for this purpose is Gibbs sampling, which generates samples from each variable once at a time while others fixed.

Preparing for derivation of Gibbs sampling for LDA, let me first lightlight some important functions (as for the mathematics behind MCMC and Gibbs Sampling, readers can refer to some good tutorials on the Internet):

Dirichlet Distribution:

Multinomial Beta function:

Gamma function:

Multinomial distribution:

Noted that in LDA or many of other bayesian statistical models, people use conjugate distribution for likelihood and prior, e.g., Multinomial and Dirichlet. The trick behind this is that posterior will be the same distribution as the prior, so that we can easily find a closed form for the integration of the posterior. In the case of Multinomial and Dirichlet, we have

Since

i.e.:

This piece of formula is extremely useful for deriving the following learning algorithm. (One more note would be: although people call it multinomial distribution in the generative process above, but actually it is one time multinomial, which has already degenerated into discrete distribution, however, this doesn’t affect much for the learning, as we see the formula here will also hold for a sequence of discrete distribution.)

Then let’s derive the joint distribution of and which will be useful just one step later (notations are explained in the beginning of the post):

(3)

Now we can start to derive key formula for Gibbs sampling, which is , here means the topic assignments for all word occurrences other than the one at position , without loss of generality, we assume here that the position is in document and associated with the observed word , since

We have:

(4)

The above E.q. 4 gives us a very clear picture for implementing Gibbs sampling, basically, one just needs to initialize , setting counter for three terms in final of E.q. 4, and each time when sample a word, decrease the corresponding counters, sample according to normalized E.q. 4, than increase corresponding counters according to the sampling result. After “burn-in”, one can read out (de-correlated) samples and start estimation of parameters. By E.q. 2, bayes’ rule and expectation of Dirichlet Distribution, we obtain the expectation of and given a sample of as following:

(5)

(6)

As mentioned above, one might also want to average parameters over multiple de-correlated samples according to E.q. 2.

Likelihood and perplexity

Until now, I didn’t mention the formula for calculating the likelihood of data , which can be important some time for testing of convergence (or using ). In [2], the authors mentioned an approximation technique for computing , it is the harmonic mean of a set of values of from samples, where is sampled from the posterior . So that is:

Where

Using the previous derivation, we can easily compute that:

(7)

Where we have assumed a symmetric . However, the E.q. 7 can easily lead to numerical underflow/overflow if directly computed (due to the large factorial terms). Here we can take a log of it and see what we got:

(8)

Where is constant:

The log gamma function should be built in the programming language so you can use it directly. But now since we only have , and it can be quite “negatively large” (maybe depending on the corpus), this again create a computational difficulty when we compute the Harmonic average of . Again we are going to log , and using some trick below enable us to compute it accurately:

(9)

Here , is introduced to make log-sum-exp computed correctly, usually we can set . Now, even the corpus contains millions of documents, the above likelihood calculation should still work as the scale is under the range of floating point number.

The likelihood of hold out data sometime is also an useful quantity, it reflects generalized predictive power of the model, and usually it is measured by perplexity, essentially it is the inverse of the geometric mean per-link likelihood. It is defined by:

This term can be evaluated by: first hold out some documents, and then sample topic assignment for words in those documents given a point estimate , then the likelihood (there go the perplexity) of hold out documents can be calculated as stated above.

Some resources on Gibbs Sampling for LDA

GibbsLDA++: A C/C++ Implementation of Latent Dirichlet Allocation using Gibbs sampling.

WinBUGS: Gibbs sampling tools for graphical models.

[1] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” The journal of machine learning research, vol. 3, pp. 993-1022, 2003.
[Bibtex]
@article{blei2003latent,
title={Latent dirichlet allocation},
author={Blei, David M and Ng, Andrew Y and Jordan, Michael I},
journal={the Journal of machine Learning research},
volume={3},
pages={993--1022},
year={2003},
publisher={JMLR. org}
}
[2] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings of the national academy of sciences of the united states of america, vol. 101, iss. Suppl 1, pp. 5228-5235, 2004.
[Bibtex]
@article{griffiths2004finding,
title={Finding scientific topics},
author={Griffiths, Thomas L and Steyvers, Mark},
journal={Proceedings of the National academy of Sciences of the United States of America},
volume={101},
number={Suppl 1},
pages={5228--5235},
year={2004},
}

## Graphical model and factorization

Original I want to write more backgrounds and illustrated by real life examples, but I found this process both time and words consuming, so I will assume readers with some related backgrounds and only write down important technical part of the contents in this post, and also include some references for more details.

A graphical model is a model build on a graph where nodes are random variables and edges are their dependencies. In a factor graph , where is a set of random variables,  is a set of factors connecting random variables, and is the set of edges between and . For illustration purpose, below is a diagram of the relationship between naive Bayes, logistic regression, HMMs, linear- chain CRFs, generative models, and general CRFs from [1] (which is a good tutorial for this topic). In the lower row of the diagram they are the targets of this post: undirected graphical models using factor graph modeling. It is worth mentioning, in the figure, the features is also denoted as node, however, in the notation of this post, features are encoded into the factor (so each node would have a single connection to a factor that associates the configuration of this node with the features for this node), nodes only represent to random variables.

The probability of random variables in the factor graph is (here I use to denote factor graph structure, and for features associated with the factor nodes) can be factorized as following:

(1)

Where is a set of clique templates where in each clique template all cliques share the same potential function  (a clique formed by a set of random variables connected by a factor, it can be a single node, a pair of nodes or a set of nodes), and is a potential function based on configurations  of the clique, features of the clique and parameters of clique template . can be thought as a way of expressing how likely a clique with specific configurations and features is. is a global normalization term to make the whole thing a probability.

## Instantiation by Exponential Family

In equation 1  we have not specified the potential function defined in each clique instance, the most common practice for instantiating it is using the Exponential Family (similar to log linear model):

(2)

Here is the number of parameters or features in clique template , is the feature function (or factor function), which turns a set of feature and labels into a real or binary value. You can self define the factor functions according to some domain knowledge, but a requirement has to be met is that factor function should be unique for each label , i.e. if and only if . To be more specific, let us consider two type of common used features: node attribute feature for a single random variable and label correlation feature among a set of variables. For node attribute feature, originally it should be a dimensional constant vector for a node over all different configurations, however, to meet the above requirement, one has to form a dimensional feature vector associated with labels ( is the number of possible labels), for each label, feature vector is sparse, only non-zero dimensions correspond to the original feature, since for each label the non-zero dimensions are disjoint, so the feature vector is unique w.r.t. the label; for label correlation feature, it can be only defined on cliques that involve more than one node, for example,  suppose each node can take 0/1 two states, then a  dimensional feature vector can be defined over a pair of node, for each label, the feature vector is only non-zero on the 2 associated dimensions (one for two nodes taking the same states, the other for them taking differently), again the feature vector is unique w.r.t. the label as non-zero parts of the feature vector are disjoint. (Regards to these two type of features, an interesting fact is: if only consider the node attribute features, the factor graph model degenerates to logistic regression)

With above instantiation, the normalization constant can also be specified below:

(3)

To model the real world problem, this instantiation is still not enough, not only the structures of the graph depend on the real problem, but also those features are also needed to be specified by the model designer. Once a factor graph model is specified, there are two main problems left: parameter learning problem and Inference problem, which are covered in following two sections.

## Parameter learning

The goal of parameter learning is to learn a set of parameter that can best explain observed data under the factor graph model (suppose we are given all labels during training), which is a maximum likelihood estimation of the parameters, the objective function is:

(4)

A straight-forward method is gradient based approach, such as gradient ascent, or Newton-Raphson method, which all rely on the gradients of parameters. So let us take a look at the derivative of  of objective function (detailed derivation can be found in the appendix of the post):

(5)

Where is the expectation of the factor function (for a specific clique template and feature ) under the empirical distribution of , i.e. labels are given, and is the expectation of the factor function under the model distribution of , i.e. label distributions are predicted by current model. This is fairly intuitive as when we learn parameters, we want these two terms to be close so that model distribution can fit empirical distribution well.

However, we still have one problem left: how to calculate the derivatives containing marginal probability of ? Here is a solution: we can first obtain the marginal probability for each random variable in the graph using inference techniques (introduced in the next section), and then calculate using the chain rule for each clique that consist of more than one nodes. You can see in this solution, the computation complexity is constrained by the size of the clique and the way we specify the factor function, if we utilize factor functions that rely on the permutation of labels, then the correlation feature space would be exponential to the size of the clique according to the chain rule (this is also why factorizing into in derivative of the normalization term is important).

## Inference

Inference has double meanings here: one is to obtain the marginal probabilities of random variables, e.g. ; the other is to infer best configuration of the graph, i.e. find out such that can be maximized. Here I introduce two algorithms I have used: for obtaining marginal probability, sum-product algorithm, and for the best configuration problem, max-sum algorithm.

Sum-product, or also known as belief propagation, which is more explanatory as the algorithm runs by propagating belief in the network. So what is belief? Belief is the estimated marginal probability of a random variable node. However, although it is called belief propagation, those been really passed among factor graph are messages, a partial and unnormalized belief. There are two types of messages, one is the message from a variable node to a factor node telling that ‘s states based on other factors’ messages; the other is the message from a factor node to a variable node telling the states of based on other variables’ messages.

To be more specific, a message from a variable to a neighbor factor is computed as following:

(6)

Here and in the subscript indicate a random variable node and a factor node, and in parentheses is a real valued scalar indicate likelihood for a particular state of random variable , a message vector then can be formed by concatenating the scalars for all possible states of the random variable . Note that  a message may not sum to one, you need to normalize it when a message is gathered.

A message from a factor to a neighbor variable is computed as following:

(7)

Here are the neighbor variables of .

There are two type of special nodes: leaf nodes that only have one neighbor and observed variable that its state is fixed. For a random variable that is observed with evidence , the message to its neighbor factor should take the following form:

(8)

For a leaf factor , the message to its neighbor variable  should take the following form:

(9)

Once the messages are computed, we can then turn to calculate the believes, or marginal probabilities, of those random variables. The marginal probability of variable can be calculated by the production of all messages from its neighbors:

(10)

The marginal probability for a factor is:

(11)

Here are neighbor variables of .

So to sum up, the algorithm works like this: first form a tree (by, e.g., BFS) of graph, from leaf to root and from root to leaf, passing message according to E.q. 6 to 9, then you can calculated the marginal probabilities of all variable nodes with E.q. 10. It is worth noting that if the graph is a tree, then the marginal probabilities are exact, however, if it is not, then we can still do it in the same or similar ways (with more iterations of “leaf-root and root-leaf”), that is called loopy belief propagation, the algorithm will not be guaranteed to converge, but empirically works nicely [2].

Although the Sum-product algorithm can be utilized to find best configuration by taking the highest probability for each variable node, but this may not lead to the global optimal as we only consider the marginal probability of each node at a time, this may also lead to some illegal configurations. A better algorithm for finding the best configuration is call Max-sum, which works very similar to Sum-product, we can still follow the similar procedure and utilize E.q. 6 to 9, but need to change E.q. 7 to the following:

(12)

Note that when doing Max-sum, we only need to passing message from leaf to root but no need to passing messages back, while passing the message using operation, we also need to keep track of the best configurations of variables so that we can retrieve the global best configuration at the end. The best likelihood of the configuration can be computed at the root as following:

(13)

Again, this work exactly on a tree structure, with loops, it can not be guaranteed to work properly.

## Learn and infer in one graph – a semi-supervised learning approach

In some factor graphs, a part of random variables are labeled while others are not labeled. If we are given such a partially labeled factor graph, is there a better way for us to learn the parameters and infer the unseen labels simultaneously (rather than learn the parameters on observed subgraph and apply it on the unobserved subgraph)?
Although we can do coordinate ascent, but a more straight forward way is to learn by optimize , which naturally utilize the structure of the whole graph including those unseen labels as we sum over all possible . The log-likelihood function is:

(14)

Here ,  are variables whose labels are known to us (i.e. labels are fixed), wheres  are the ones whose labels are unknown to us.

The semi-supervised learning of parameters can follow exact the same gradient based approach as supervised learning introduced above, the derivative for parameter is

(15)

Once again, we got to this gradient with the form of two expectation terms, the first expectation of factors is under the model distribution with labeled variables fixed, the second expectation of factors is under the model distribution with none labels fixed. It is still very intuitive in this semi-supervised setting. For the calculation of terms in gradient, it follows the similar procedure with the above supervised setting.

## Appendix

Derivation of gradient from normalization term from E.q. 4. It may not be that obvious see how the second term in E.q. 5 (corresponds to normalization term in objective function in E.q. 4) been derived. So here I attach a derivation of gradient for term :

(16)

Derivation for E.q. 15 from E.q. 14 also follows the similar derivation.

Calculation/approximation of likelihood E.q. 4. It is difficult to directly compute the likelihood function as it sum over all states of all variables in the graph, so approximation needs to be adopted for a more efficient calculation. The detail will be posted in Part II of the note in the future.

Piecewise & pseudo-likelihood and more. Since The difficulty in computing global normalization, there are some ways to approximate the original likelihood 4, two effective ways are Piecewise training and pseudo-likelihood [3]. Detail about his part will also be included in Part II of the note in the future.

## References

[1] C. Sutton and A. McCallum, “An introduction to conditional random fields,” Arxiv preprint arxiv:1011.4088, 2010.
[Bibtex]
@article{sutton2010introduction,
title={An introduction to conditional random fields},
author={Sutton, Charles and McCallum, Andrew},
journal={arXiv preprint arXiv:1011.4088},
year={2010}
}
[2] K. P. Murphy, Y. Weiss, and M. I. Jordan, “Loopy belief propagation for approximate inference: an empirical study,” in Proceedings of the fifteenth conference on uncertainty in artificial intelligence, 1999, pp. 467-475.
[Bibtex]
@inproceedings{murphy1999loopy,
title={Loopy belief propagation for approximate inference: An empirical study},
author={Murphy, Kevin P and Weiss, Yair and Jordan, Michael I},
booktitle={Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence},
pages={467--475},
year={1999},
organization={Morgan Kaufmann Publishers Inc.}
}
[3] C. Sutton and A. McCallum, “Piecewise pseudolikelihood for efficient training of conditional random fields,” in Proceedings of the 24th international conference on machine learning, 2007, pp. 863-870.
[Bibtex]
@inproceedings{sutton2007piecewise,
title={Piecewise pseudolikelihood for efficient training of conditional random fields},
author={Sutton, Charles and McCallum, Andrew},
booktitle={Proceedings of the 24th international conference on Machine learning},
pages={863--870},
year={2007},
organization={ACM}
}