1. Introduction

Consider the following statistical problem setting: we are given and , where is a set of observations, and is a set of latent random variables, the goal is to compute the conditional posterior distribution of given , i.e. . This is a very general setting, because here can be anything that we haven’t observed, which includes naturally aroused latent variables, as well as those parameters been considered as random variables.

2. Mean-field Approximation

2.1. Evidence Lower Bound

In many complex models, the posterior is intractable (but is tractable), which means we cannot compute directly given data. So the idea is to find a approximation distribution over , i.e. . Under the mean-field approximation, there are two ways to introduce term by approximation.

KL-divergence. We want to quantify the approximation by using KL-divergence between true distribution and proposed distribution, that is

(1)

To make the approximation accurate, we want to minimize the KL-divergence over . However, once again we encounter the intractable distribution in the KL-divergence. To solve this problem, we expend E.q.1:

Here we abbreviate distribution and as and . Now since has nothing to do with how we determine . So minimizing KL-divergence is equivalent to maximizing the evidence lower bound (this name would make more sense when we come to the other way of deriving it later), i.e. :

(2)

Here term is called variational free energy, is the entropy of . Thus to minimize the approximation error is to maximize the lower bound over :

Jensen Inequality. Another way of deriving approximation distribution is by considering the estimation of data log-likelihood:

It is also easy to see that the difference between data likelihood and lower bound is the KL-divergence between and .

2.2. Deriving

Now that we have evidence lower bound containing a tractable distribution (good) and a unknown distribution over all latent variables (not so good), we still need a way to quantify . Under mean-field variational Bayes, we will make an assumption: latent variables can be factorized into several independent sets (specified by users), i.e.,

(3)

Plug E.q. 3 into lower bound E.q. 2, we can derive the optimum solution of while others fixed is (expressed in logarithm):

or

Where

Optimal can be derived from here, although might be difficult to work with for some model (for most models it is not).

Once the optimal for all are found, we can alternatively update each latent variable until convergence (which is guaranteed due to the convexity of ELBO). Noted the convergent point is local optimal.

2.2.1. for Exponential Family Conditionals

If is in exponential family, then optimal is in the same family of distribution as . This provides a shortcut for doing mean-field variance inference on graphical models with conditional exponential families (which is common for many graphical models): using the theory mentioned here, we can simply write down the optimal variation distributions and their undetermined variational parameters, then setting derivative of ELBO to zero w.r.t. those parameters (or using gradient ascent) can form a coordinate ascent learning procedure. See David Blei’s note.

3. Expectation Propagation – A glance

Same as mean-field approximation, EP also tries to find the approximation distribution by minimizing the KL-divergence. However, different from mean-field which minimizes the KL-divergence by maximizing the ELBO, EP tries directly maximize KL-divergence, which might be difficult for any distributions, but practical for some distributions, such as exponential families. The details might be contained in future posts.