Notes on Factor Graph Model (Part I)

Graphical model and factorization

Original I want to write more backgrounds and illustrated by real life examples, but I found this process both time and words consuming, so I will assume readers with some related backgrounds and only write down important technical part of the contents in this post, and also include some references for more details.

A graphical model is a model build on a graph where nodes are random variables and edges are their dependencies. In a factor graph FG=\{Y, F, E\}, where Y is a set of random variables, F is a set of factors connecting random variables, and E is the set of edges between Y and F. For illustration purpose, below is a diagram of the relationship between naive Bayes, logistic regression, HMMs, linear- chain CRFs, generative models, and general CRFs from [1] (which is a good tutorial for this topic). In the lower row of the diagram they are the targets of this post: undirected graphical models using factor graph modeling. It is worth mentioning, in the figure, the features X is also denoted as node, however, in the notation of this post, features are encoded into the factor (so each node would have a single connection to a factor that associates the configuration of this node with the features for this node), nodes only represent to random variables.

graphical model ill

The probability of random variables in the factor graph is P(Y|G, X) (here I use G to denote factor graph structure, and X for features associated with the factor nodes) can be factorized as following:

(1)   \begin{equation*} P(Y|G, X;\bm\theta) = \frac{1}{Z(G, X;\bm\theta)} \prod_{C_p \in \bm{C}} \prod_{c \in C_p} \Psi_{cp} (\bm{y}_c, \bm{x}_c; \bm\theta_p) \end{equation*}

Where \bm{C}=\{C_1, C_2, \cdots, C_p\} is a set of clique templates where in each clique template all cliques share the same potential function \Psi_{cp} (a clique formed by a set of random variables connected by a factor, it can be a single node, a pair of nodes or a set of nodes), and \Psi_{cp} is a potential function based on configurations \bm{y}_c of the clique, features \bm{x}_c of the clique and parameters \bm\theta_p of clique template C_p. \Psi_{cp} can be thought as a way of expressing how likely a clique with specific configurations and features is. Z(G, X;\bm\theta) = \sum_Y \prod_{C_p \in \bm{C}} \prod_{c \in C_p} \Psi_{cp} (\bm{y}_c; \bm\theta_p) is a global normalization term to make the whole thing a probability.

Instantiation by Exponential Family

In equation 1  we have not specified the potential function \Psi_{cp} defined in each clique instance, the most common practice for instantiating it is using the Exponential Family (similar to log linear model):

(2)   \begin{equation*} \Psi_{cp}(\bm{y}_c, \bm{x}_c; \bm\theta_p) = \exp \sum_{k=1}^{K(p)} \bm\theta_{pk} f_{pk} (\bm{y}_c, \bm{x}_c) \end{equation*}

Here K(p) is the number of parameters or features in clique template C_p, f_{pk} (\bm{y}_c, \bm{x}_c) is the feature function (or factor function), which turns a set of feature \bm{x}_c and labels \bm{y}_c into a real or binary value. You can self define the factor functions according to some domain knowledge, but a requirement has to be met is that factor function should be unique for each label \bm{y}_c, i.e. f_{pk}(\bm{y}_c, \bm{x}_c) = f_{pk}(\bm{y}'_c, \bm{x}_c) if and only if \bm{y}_c = \bm{y}'_c. To be more specific, let us consider two type of common used features: node attribute feature for a single random variable and label correlation feature among a set of variables. For node attribute feature, originally it should be a k dimensional constant vector for a node over all different configurations, however, to meet the above requirement, one has to form a N_l * k dimensional feature vector associated with N_l labels (N_l is the number of possible labels), for each label, feature vector is sparse, only k non-zero dimensions correspond to the original feature, since for each label the non-zero dimensions are disjoint, so the feature vector is unique w.r.t. the label; for label correlation feature, it can be only defined on cliques that involve more than one node, for example,  suppose each node can take 0/1 two states, then a N_l * 2 dimensional feature vector can be defined over a pair of node, for each label, the feature vector is only non-zero on the 2 associated dimensions (one for two nodes taking the same states, the other for them taking differently), again the feature vector is unique w.r.t. the label as non-zero parts of the feature vector are disjoint. (Regards to these two type of features, an interesting fact is: if only consider the node attribute features, the factor graph model degenerates to logistic regression)

With above instantiation, the normalization constant Z(G, X; \bm\theta) can also be specified below:

(3)   \begin{equation*} Z(G, X;\bm\theta) = \sum_Y \prod_{C_p \in \bm{C}} \prod_{c \in C_p} \exp \sum_{k=1}^{K(p)} \bm\theta_{pk} f_{pk} (\bm{y}_c, \bm{x}_c) \end{equation*}

To model the real world problem, this instantiation is still not enough, not only the structures of the graph depend on the real problem, but also those features are also needed to be specified by the model designer. Once a factor graph model is specified, there are two main problems left: parameter learning problem and Inference problem, which are covered in following two sections.

Parameter learning

The goal of parameter learning is to learn a set of parameter \bm\theta_{p} that can best explain observed data under the factor graph model (suppose we are given all labels Y during training), which is a maximum likelihood estimation of the parameters, the objective function is:

(4)   \begin{equation*} \begin{split} &\underset{\bm\theta}{arg\ max} \log P(Y|G,X;\bm\theta) \\ =&\underset{\bm\theta}{arg\ max} \{\sum_{C_p \in \bm{C}} \sum_{c \in C_p} \sum_{k=1}^{K(p)} \bm\theta_{pk} f_{pk} (\bm{y}_c, \bm{x}_c) \\ &\ \ \ \ \ \ \ \ \ \ \ \ \ - \log \sum_Y \prod_{C_p \in \bm{C}} \prod_{c \in C_p} \exp \sum_{k=1}^{K(p)} \bm\theta_{pk} f_{pk} (\bm{y}_c, \bm{x}_c) \} \\ \end{split} \end{equation*}

A straight-forward method is gradient based approach, such as gradient ascent, or Newton-Raphson method, which all rely on the gradients of parameters. So let us take a look at the derivative of \bm\theta_{pk} of objective function P(Y|G,X;\bm\theta) (detailed derivation can be found in the appendix of the post):

(5)   \begin{equation*} \begin{split} \nabla_{\bm\theta_{pk}} &= \sum_{c \in C_p} f_{pk} (\bm{y}_c, \bm{x}_c) - \sum_{c \in C_p} \sum_{\bm{y}_c} p(\bm{y}_c|G, \bm{x}_c) f_{pk} (\bm{y}_c, \bm{x}_c) \\ &= \mathbb{E}_{\widetilde{Y}}[f_{pk}] - \mathbb{E}_{P(Y|G, X;\bm\theta)} [f_{pk}]\\ \end{split} \end{equation*}

Where \mathbb{E}_{\widetilde{Y}}[f_{pk}] is the expectation of the factor function (for a specific clique template p and feature k) under the empirical distribution of Y, i.e. labels are given, and \mathbb{E}_{P(Y|G,X;\bm\theta)} [f_{pk}] is the expectation of the factor function under the model distribution of Y, i.e. label distributions are predicted by current model. This is fairly intuitive as when we learn parameters, we want these two terms to be close so that model distribution can fit empirical distribution well.

However, we still have one problem left: how to calculate the derivatives \nabla_{\bm\theta} containing marginal probability of \bm{y}_c? Here is a solution: we can first obtain the marginal probability for each random variable in the graph p(y|G, X) using inference techniques (introduced in the next section), and then calculate p(\bm{y}_c|G, X) using the chain rule for each clique c that consist of more than one nodes. You can see in this solution, the computation complexity is constrained by the size of the clique and the way we specify the factor function, if we utilize factor functions that rely on the permutation of labels, then the correlation feature space would be exponential to the size of the clique according to the chain rule (this is also why factorizing Y into \bm{y}_c in derivative of the normalization term is important).

Inference

Inference has double meanings here: one is to obtain the marginal probabilities of random variables, e.g. p(y|G, X); the other is to infer best configuration of the graph, i.e. find out Y such that P(Y|G, X) can be maximized. Here I introduce two algorithms I have used: for obtaining marginal probability, sum-product algorithm, and for the best configuration problem, max-sum algorithm.

Sum-product, or also known as belief propagation, which is more explanatory as the algorithm runs by propagating belief in the network. So what is belief? Belief is the estimated marginal probability of a random variable node. However, although it is called belief propagation, those been really passed among factor graph are messages, a partial and unnormalized belief. There are two types of messages, one is the message from a variable node u to a factor node v telling v that u‘s states based on other factors’ messages; the other is the message from a factor node v to a variable node u telling the states of u based on other variables’ messages.

To be more specific, a message from a variable x to a neighbor factor f is computed as following:

(6)   \begin{equation*} \mu_{x \to f} (x) = \prod_{f' \in ne(x) \backslash f} \mu_{f' \to x} (x) \end{equation*}

Here x and f in the subscript indicate a random variable node and a factor node, and x in parentheses is a real valued scalar indicate likelihood for a particular state of random variable x, a message vector m then can be formed by concatenating the scalars for all possible states of the random variable x. Note that  a message may not sum to one, you need to normalize it when a message is gathered.

A message from a factor f to a neighbor variable x is computed as following:

(7)   \begin{equation*} \mu_{f \to x} (x) = \sum_{x_1} \cdots \sum_{x_N} \Psi_f(x, x_1, \cdots, x_N) \prod_{n = 1}^N \mu_{x_n \to f} (x_n) \end{equation*}

Here x_1, \cdots, x_N are the neighbor variables of x.

There are two type of special nodes: leaf nodes that only have one neighbor and observed variable that its state is fixed. For a random variable x that is observed with evidence x_0, the message to its neighbor factor f should take the following form:

(8)   \begin{equation*} \mu_{x \to f} = \begin{cases} 1, x = x_0 \\ 0, \text{otherwise} \\ \end{cases} \end{equation*}

For a leaf factor f, the message to its neighbor variable x should take the following form:

(9)   \begin{equation*} \mu_{f \to x} (x) = \Psi_f (x) \end{equation*}

Once the messages are computed, we can then turn to calculate the believes, or marginal probabilities, of those random variables. The marginal probability of variable x can be calculated by the production of all messages from its neighbors:

(10)   \begin{equation*} p(x) \propto \prod_{f \in ne(x)} u_{f \to x} (x) \end{equation*}

The marginal probability for a factor f is:

(11)   \begin{equation*} p(x_1, \cdots, x_n) = \Psi_f (x_1, \cdots, x_n) \mu_{x_1 \to f} (x_1) \cdots \mu_{x_n \to f} (x_n) \end{equation*}

Here x_1, \cdots, x_n are neighbor variables of f.

So to sum up, the algorithm works like this: first form a tree (by, e.g., BFS) of graph, from leaf to root and from root to leaf, passing message according to E.q. 6 to 9, then you can calculated the marginal probabilities of all variable nodes with E.q. 10. It is worth noting that if the graph is a tree, then the marginal probabilities are exact, however, if it is not, then we can still do it in the same or similar ways (with more iterations of “leaf-root and root-leaf”), that is called loopy belief propagation, the algorithm will not be guaranteed to converge, but empirically works nicely [2].

Although the Sum-product algorithm can be utilized to find best configuration by taking the highest probability for each variable node, but this may not lead to the global optimal as we only consider the marginal probability of each node at a time, this may also lead to some illegal configurations. A better algorithm for finding the best configuration is call Max-sum, which works very similar to Sum-product, we can still follow the similar procedure and utilize E.q. 6 to 9, but need to change E.q. 7 to the following:

(12)   \begin{equation*} \mu_{f \to x} (x) = \max_{x_1} \cdots \max_{x_N} \Psi_f(x, x_1, \cdots, x_N) \prod_{n = 1}^N \mu_{x_n \to f} (x_n) \end{equation*}

Note that when doing Max-sum, we only need to passing message from leaf to root but no need to passing messages back, while passing the message using \max operation, we also need to keep track of the best configurations of variables so that we can retrieve the global best configuration at the end. The best likelihood of the configuration can be computed at the root as following:

(13)   \begin{equation*} p^{\max} = \max_{x} (\prod_{f \in ne(x)} \mu_{f \to x }(x) ) \end{equation*}

Again, this work exactly on a tree structure, with loops, it can not be guaranteed to work properly.

Learn and infer in one graph – a semi-supervised learning approach

In some factor graphs, a part of random variables are labeled Y^L while others Y^U are not labeled. If we are given such a partially labeled factor graph, is there a better way for us to learn the parameters and infer the unseen labels simultaneously (rather than learn the parameters on observed subgraph and apply it on the unobserved subgraph)?
Although we can do coordinate ascent, but a more straight forward way is to learn by optimize P(Y^L|G), which naturally utilize the structure of the whole graph including those unseen labels Y^U as we sum over all possible Y^U. The log-likelihood function is:

(14)   \begin{equation*} \begin{split} &\log p(Y^L|G, X; \bm{\theta}) \\ =& \log \sum_{Y^U} P(Y^L, Y^U|G, X; \bm\theta) \\ =& \log \sum_{Y^U} \prod_{C_p \in \bm{C}} \prod_{c \in C_p} \exp\{ \sum_{k=1}^{K(p)} \bm\theta_{pk} f_{pk} (\bm{y}_c, \bm{x}_c) \} - \log Z(G, X;\bm\theta) \\ \end{split} \end{equation*}

Here \bm{y}_c = \{ \bm{y}^l_c, \bm{y}^u_c \}, \bm{y}^l_c are variables whose labels are known to us (i.e. labels are fixed), wheres \bm{y}^u_c are the ones whose labels are unknown to us.

The semi-supervised learning of parameters can follow exact the same gradient based approach as supervised learning introduced above, the derivative for parameter \bm\theta_{pk} is

(15)   \begin{equation*} \begin{split} \nabla_{\bm\theta_{pk}} &= \sum_{c \in C_p} \sum_{\bm{y}^u_c} p(\bm{y}^l_c, \bm{y}^u_c|G, X) f_{pk} (\bm{y}^l_c, \bm{y}^u_c, \bm{x}_c) - \sum_{c \in C_p} \sum_{\bm{y}_c} p(\bm{y}_c|G, X) f_{pk}(\bm{y}_c, \bm{x}_c) \\ &= \mathbb{E}_{P(Y^U|Y^L, G, X;\bm\theta)}[f_{pk}] - \mathbb{E}_{P(Y|G, X;\bm\theta)} [f_{pk}]\\ \end{split} \end{equation*}

Once again, we got to this gradient with the form of two expectation terms, the first expectation of factors is under the model distribution with labeled variables fixed, the second expectation of factors is under the model distribution with none labels fixed. It is still very intuitive in this semi-supervised setting. For the calculation of terms in gradient, it follows the similar procedure with the above supervised setting.

Appendix

Derivation of gradient from normalization term from E.q. 4. It may not be that obvious see how the second term in E.q. 5 (corresponds to normalization term in objective function in E.q. 4) been derived. So here I attach a derivation of gradient for term \log Z(G, X;\bm\theta):

(16)   \begin{equation*} \begin{split} &\nabla_{\bm\theta_{pk}} \log Z(G, X;\bm\theta) \\ =& \frac{ \sum_Y\{ \prod_{C_p \in C} \prod_{c \in C_p} \exp(\sum_k \bm\theta_{pk} f_{pk}(\bm{y}_c, \bm{x}_c)) \sum_{c \in C_p} f_{pk} (\bm{y}_c), \bm{x}_c\} }{ \sum_Y \prod_{C_p \in C} \prod_{c \in C_p} \exp(\sum_k \bm\theta_{pk} f_{pk}(\bm{y}_c, \bm{x}_c)) } \\ =& \sum_Y P(Y|G, X;\bm\theta) \sum_{c \in C_p} f_{pk}(\bm{y}_c, \bm{x}_c) \\ =& \sum_Y \sum_{c \in C_p} P(Y|G, X;\bm\theta) f_{pk}(\bm{y}_c, \bm{x}_c) \\ =& \sum_{c \in C_p} \sum_Y P(Y|G, X;\bm\theta) f_{pk}(\bm{y}_c, \bm{x}_c) \\ =& \sum_{c \in C_p} \sum_{\bm{y}_c} p(\bm{y}_c|G, \bm{x}_c;\bm\theta) f_{pk}(\bm{y}_c, \bm{x}_c) \\ \end{split} \end{equation*}

Derivation for E.q. 15 from E.q. 14 also follows the similar derivation.

Calculation/approximation of likelihood E.q. 4. It is difficult to directly compute the likelihood function as it sum over all states of all variables in the graph, so approximation needs to be adopted for a more efficient calculation. The detail will be posted in Part II of the note in the future.

Piecewise & pseudo-likelihood and more. Since The difficulty in computing global normalization, there are some ways to approximate the original likelihood 4, two effective ways are Piecewise training and pseudo-likelihood [3]. Detail about his part will also be included in Part II of the note in the future.

References

[1] C. Sutton and A. McCallum, “An introduction to conditional random fields,” Arxiv preprint arxiv:1011.4088, 2010.
[Bibtex]
@article{sutton2010introduction,
title={An introduction to conditional random fields},
author={Sutton, Charles and McCallum, Andrew},
journal={arXiv preprint arXiv:1011.4088},
year={2010}
}
[2] K. P. Murphy, Y. Weiss, and M. I. Jordan, “Loopy belief propagation for approximate inference: an empirical study,” in Proceedings of the fifteenth conference on uncertainty in artificial intelligence, 1999, pp. 467-475.
[Bibtex]
@inproceedings{murphy1999loopy,
title={Loopy belief propagation for approximate inference: An empirical study},
author={Murphy, Kevin P and Weiss, Yair and Jordan, Michael I},
booktitle={Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence},
pages={467--475},
year={1999},
organization={Morgan Kaufmann Publishers Inc.}
}
[3] C. Sutton and A. McCallum, “Piecewise pseudolikelihood for efficient training of conditional random fields,” in Proceedings of the 24th international conference on machine learning, 2007, pp. 863-870.
[Bibtex]
@inproceedings{sutton2007piecewise,
title={Piecewise pseudolikelihood for efficient training of conditional random fields},
author={Sutton, Charles and McCallum, Andrew},
booktitle={Proceedings of the 24th international conference on Machine learning},
pages={863--870},
year={2007},
organization={ACM}
}