## Graphical model and factorization

*Original I want to write more backgrounds and illustrated by real life examples, but I found this process both time and words consuming, so I will assume readers with some related backgrounds and only write down important technical part of the contents in this post, and also include some references for more details.*

A graphical model is a model build on a graph where nodes are random variables and edges are their dependencies. In a factor graph , where is a set of random variables, is a set of factors connecting random variables, and is the set of edges between and . For illustration purpose, below is a diagram of the relationship between naive Bayes, logistic regression, HMMs, linear- chain CRFs, generative models, and general CRFs from [1] (which is a good tutorial for this topic). In the lower row of the diagram they are the targets of this post: undirected graphical models using factor graph modeling. It is worth mentioning, in the figure, the features is also denoted as node, however, in the notation of this post, features are encoded into the factor (so each node would have a single connection to a factor that associates the configuration of this node with the features for this node), nodes only represent to random variables.

The probability of random variables in the factor graph is (here I use to denote factor graph structure, and for features associated with the factor nodes) can be factorized as following:

(1)

Where is a set of clique templates where in each clique template all cliques share the same potential function (a clique formed by a set of random variables connected by a factor, it can be a single node, a pair of nodes or a set of nodes), and is a potential function based on configurations of the clique, features of the clique and parameters of clique template . can be thought as a way of expressing how likely a clique with specific configurations and features is. is a global normalization term to make the whole thing a probability.

## Instantiation by Exponential Family

In equation 1 we have not specified the potential function defined in each clique instance, the most common practice for instantiating it is using the Exponential Family (similar to log linear model):

(2)

Here is the number of parameters or features in clique template , is the feature function (or factor function), which turns a set of feature and labels into a real or binary value. You can self define the factor functions according to some domain knowledge, but a requirement has to be met is that factor function should be unique for each label , i.e. if and only if . To be more specific, let us consider two type of common used features: node attribute feature for a single random variable and label correlation feature among a set of variables. For node attribute feature, originally it should be a dimensional constant vector for a node over all different configurations, however, to meet the above requirement, one has to form a dimensional feature vector associated with labels ( is the number of possible labels), for each label, feature vector is sparse, only non-zero dimensions correspond to the original feature, since for each label the non-zero dimensions are disjoint, so the feature vector is unique w.r.t. the label; for label correlation feature, it can be only defined on cliques that involve more than one node, for example, suppose each node can take 0/1 two states, then a dimensional feature vector can be defined over a pair of node, for each label, the feature vector is only non-zero on the 2 associated dimensions (one for two nodes taking the same states, the other for them taking differently), again the feature vector is unique w.r.t. the label as non-zero parts of the feature vector are disjoint. (Regards to these two type of features, an interesting fact is: if only consider the node attribute features, the factor graph model degenerates to logistic regression)

With above instantiation, the normalization constant can also be specified below:

(3)

To model the real world problem, this instantiation is still not enough, not only the structures of the graph depend on the real problem, but also those features are also needed to be specified by the model designer. Once a factor graph model is specified, there are two main problems left: parameter learning problem and Inference problem, which are covered in following two sections.

## Parameter learning

The goal of parameter learning is to learn a set of parameter that can best explain observed data under the factor graph model (suppose we are given all labels during training), which is a maximum likelihood estimation of the parameters, the objective function is:

(4)

A straight-forward method is gradient based approach, such as gradient ascent, or Newton-Raphson method, which all rely on the gradients of parameters. So let us take a look at the derivative of of objective function (detailed derivation can be found in the appendix of the post):

(5)

Where is the expectation of the factor function (for a specific clique template and feature ) under the empirical distribution of , i.e. labels are given, and is the expectation of the factor function under the model distribution of , i.e. label distributions are predicted by current model. This is fairly intuitive as when we learn parameters, we want these two terms to be close so that model distribution can fit empirical distribution well.

However, we still have one problem left: how to calculate the derivatives containing marginal probability of ? Here is a solution: we can first obtain the marginal probability for each random variable in the graph using inference techniques (introduced in the next section), and then calculate using the chain rule for each clique that consist of more than one nodes. You can see in this solution, the computation complexity is constrained by the size of the clique and the way we specify the factor function, if we utilize factor functions that rely on the permutation of labels, then the correlation feature space would be exponential to the size of the clique according to the chain rule (this is also why factorizing into in derivative of the normalization term is important).

## Inference

Inference has double meanings here: one is to obtain the marginal probabilities of random variables, e.g. ; the other is to infer best configuration of the graph, i.e. find out such that can be maximized. Here I introduce two algorithms I have used: for obtaining marginal probability, sum-product algorithm, and for the best configuration problem, max-sum algorithm.

**Sum-product**, or also known as **belief propagation**, which is more explanatory as the algorithm runs by propagating belief in the network. So what is belief? Belief is the estimated marginal probability of a random variable node. However, although it is called belief propagation, those been really passed among factor graph are messages, a partial and unnormalized belief. There are two types of messages, one is the message from a variable node to a factor node telling that ‘s states based on other factors’ messages; the other is the message from a factor node to a variable node telling the states of based on other variables’ messages.

To be more specific, a message from a variable to a neighbor factor is computed as following:

(6)

Here and in the subscript indicate a random variable node and a factor node, and in parentheses is a real valued scalar indicate likelihood for a particular state of random variable , a message vector then can be formed by concatenating the scalars for all possible states of the random variable . Note that a message may not sum to one, you need to normalize it when a message is gathered.

A message from a factor to a neighbor variable is computed as following:

(7)

Here are the neighbor variables of .

There are two type of special nodes: leaf nodes that only have one neighbor and observed variable that its state is fixed. For a random variable that is observed with evidence , the message to its neighbor factor should take the following form:

(8)

For a leaf factor , the message to its neighbor variable should take the following form:

(9)

Once the messages are computed, we can then turn to calculate the believes, or marginal probabilities, of those random variables. The marginal probability of variable can be calculated by the production of all messages from its neighbors:

(10)

The marginal probability for a factor is:

(11)

Here are neighbor variables of .

So to sum up, the algorithm works like this: first form a tree (by, e.g., BFS) of graph, from leaf to root and from root to leaf, passing message according to E.q. 6 to 9, then you can calculated the marginal probabilities of all variable nodes with E.q. 10. It is worth noting that if the graph is a tree, then the marginal probabilities are exact, however, if it is not, then we can still do it in the same or similar ways (with more iterations of “leaf-root and root-leaf”), that is called **loopy belief propagation**, the algorithm will not be guaranteed to converge, but empirically works nicely [2].

Although the Sum-product algorithm can be utilized to find best configuration by taking the highest probability for each variable node, but this may not lead to the global optimal as we only consider the marginal probability of each node at a time, this may also lead to some illegal configurations. A better algorithm for finding the best configuration is call **Max-sum**, which works very similar to Sum-product, we can still follow the similar procedure and utilize E.q. 6 to 9, but need to change E.q. 7 to the following:

(12)

Note that when doing Max-sum, we only need to passing message from leaf to root but no need to passing messages back, while passing the message using operation, we also need to keep track of the best configurations of variables so that we can retrieve the global best configuration at the end. The best likelihood of the configuration can be computed at the root as following:

(13)

Again, this work exactly on a tree structure, with loops, it can not be guaranteed to work properly.

## Learn and infer in one graph – a semi-supervised learning approach

In some factor graphs, a part of random variables are labeled while others are not labeled. If we are given such a partially labeled factor graph, is there a better way for us to learn the parameters and infer the unseen labels simultaneously (rather than learn the parameters on observed subgraph and apply it on the unobserved subgraph)?

Although we can do coordinate ascent, but a more straight forward way is to learn by optimize , which naturally utilize the structure of the whole graph including those unseen labels as we sum over all possible . The log-likelihood function is:

(14)

Here , are variables whose labels are known to us (i.e. labels are fixed), wheres are the ones whose labels are unknown to us.

The semi-supervised learning of parameters can follow exact the same gradient based approach as supervised learning introduced above, the derivative for parameter is

(15)

Once again, we got to this gradient with the form of two expectation terms, the first expectation of factors is under the model distribution with labeled variables fixed, the second expectation of factors is under the model distribution with none labels fixed. It is still very intuitive in this semi-supervised setting. For the calculation of terms in gradient, it follows the similar procedure with the above supervised setting.

## Appendix

**Derivation of gradient from normalization term from E.q. 4**. It may not be that obvious see how the second term in E.q. 5 (corresponds to normalization term in objective function in E.q. 4) been derived. So here I attach a derivation of gradient for term :

(16)

Derivation for E.q. 15 from E.q. 14 also follows the similar derivation.

**Calculation/approximation of likelihood E.q. 4**. It is difficult to directly compute the likelihood function as it sum over all states of all variables in the graph, so approximation needs to be adopted for a more efficient calculation. The detail will be posted in Part II of the note in the future.

**Piecewise & pseudo-likelihood and more**. Since The difficulty in computing global normalization, there are some ways to approximate the original likelihood 4, two effective ways are Piecewise training and pseudo-likelihood [3]. Detail about his part will also be included in Part II of the note in the future.

## References

[1] C. Sutton and A. McCallum, “An introduction to conditional random fields,”

Arxiv preprint arxiv:1011.4088, 2010.

[Bibtex]
```
@article{sutton2010introduction,
title={An introduction to conditional random fields},
author={Sutton, Charles and McCallum, Andrew},
journal={arXiv preprint arXiv:1011.4088},
year={2010}
}
```

[2] K. P. Murphy, Y. Weiss, and M. I. Jordan, “Loopy belief propagation for approximate inference: an empirical study,” in

Proceedings of the fifteenth conference on uncertainty in artificial intelligence, 1999, pp. 467-475.

[Bibtex]
```
@inproceedings{murphy1999loopy,
title={Loopy belief propagation for approximate inference: An empirical study},
author={Murphy, Kevin P and Weiss, Yair and Jordan, Michael I},
booktitle={Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence},
pages={467--475},
year={1999},
organization={Morgan Kaufmann Publishers Inc.}
}
```

[3] C. Sutton and A. McCallum, “Piecewise pseudolikelihood for efficient training of conditional random fields,” in

Proceedings of the 24th international conference on machine learning, 2007, pp. 863-870.

[Bibtex]
```
@inproceedings{sutton2007piecewise,
title={Piecewise pseudolikelihood for efficient training of conditional random fields},
author={Sutton, Charles and McCallum, Andrew},
booktitle={Proceedings of the 24th international conference on Machine learning},
pages={863--870},
year={2007},
organization={ACM}
}
```