After the Talk of Andrew McCallum

Today Prof Andrew  McCallum came Northeastern to give a talk, his topic is probabilistic knowledge base construction, which I am not very familiar with. So instead of learning something from his topics, most inspirations about his talk are actually in research methodology side (which he didn’t talk about much, but I think about a lot).

Here are three levels of innovations in big “artificial intelligence” (including machine learning, data mining,etc.) research communities (purely my opinions):

  1. Proposal of new real world problems or demands.
  2. Proposal of new problem settings.
  3. Proposal of new methodologies/trick for bounding/solving existing problems in existing settings.

Although certain people would have different preferences, but in my point of view, they should be equally treated in the sense of contributions. But in terms of the research style, a research should have certain healthy proportion of their combinations. For application driven research style, very easily, people spend “too many” effort for proposing new real world problems or demands, such that some of the proposed problems are not really contributed in the sense that the real demands are not really that much as claimed and the problem setting are equivalent to some classic problem setting has been proposed at least for a while. For theory driven research, it may run some risks being far away from the real problems/demands; and since they are purely based on a fixed setting of problem, once the setting changed, the theory may be useless for the new settings. So in a short words, application driven research tends to focus more on innovation 1, while theory driven research tends to focus more on innovation 3. But actually in some sense they can be unified or proposing together, which may be called applied theory driven research. This is my favorite one. You don’t want to propose new real problems all the time but using some classic methods on classic settings to solve them, and not a lot of people want to always focus on classic problem settings and find new solutions. A mixed proportion (align theories with real problems and their abstracted settings) of these three levels of innovations are better for certain people, like me..

To further illustrate idea above, I am listing the examples in Andrew’s talk. He is doing many NLP work such as named entity recognition, coreference resolution, etc., these are classical problems with some classical settings, researchers who work on this topic usually need innovation 3, thus they proposed CRF, and other variants. But sometimes the problem settings can also change, when they are faced with the problem of extracting entity relationships, a real demand for building knowledge base, instead of following traditional setting of the problem, which is to first define some schema of the relationship and then learn those relationships for entities, they let the words (mostly verb.) automatically form such relationships, but the learning algorithms need to group them otherwise they are not very useful (e.g. president(Obama, U.S.) and leader(Obama, U.S.) are actually same or similar); by changing the problem settings, the real demands can be better delivered (eventually we need some abstraction from real world to form solvable problem settings, once defined one may not always be reasonable or optimal). Andrew also mentioned something about probabilistic programming languages, maybe this is not his idea, but think of this demand is actually a innovation 1, in which you found that building, debugging over graphical models using “regular” programming language (such as C++, Java, etc.) can be difficult, and demands for designing programming languages to address this issue arises naturally.

A Collection of Machine Learning Classes Available Online

Machine Learning and data mining are getting more and more hot in recent year, under the background, many of machine learning classes now are available online. People who are interested in the topics found a lot of lectures cover similar things, but maybe in different level, focus or point of view. So it may not be a bad idea to make a collection of such classes and sort things out a little bit. (Noted that most of the courses listed below have video online, and links attached may not be the newest version of the classes as time goes by; this post will keep updating if I find something more)

Advanced machine learning topics:

Notes on Latent Dirichlet Allocation: The Generative Model and Gibbs Sampling Learning

Latent Dirichlet Allocation, originally proposed in [1], is a generative graphical model for mining latent topics of texts or any data with similar underlying statistical structures.

The Generative Model

The generative process of words in a document are as following:


For each of the K topic:

Choose \beta_k \sim Dir(\eta)

For each document d in the corpus:

Choose topic distribution \theta_d \sim Dir(\alpha).

For each of the N_d word occurrences in the document:

Choose a topic z_{dn} \sim Mult(\theta_d).

Choose a word w \sim Mult(\beta_{z_{dn}})


Here presents a simple explanation of the notations:

\alpha is K (number of topics) dimensional hyperparameter for topic distribution.

\eta is |V| (number of words) dimensional hyperparameter for word distribution over a given topic.

\beta_{kw} is the probability of choosing word w given its drawn from topic k.

\theta_d is the topic distribution for document d.

z_{dn} is the corresponding topic for the n-th word in document d.

Later on, I am also going to use the following notations:

W = \{W_d\}, W_d = \{W_{dn}\}, are the observed words.

Z = \{Z_d\}, Z_d = \{z_{dn}\}, are the topic assignments for word occurrence.

n_{k} is a |V| dimensional vector, in which n_{kw} is the count of word w in the corpus assigned to topic k.

n_{k}^{-i} is n_{k} without counting information at word occurrence at position i.

m_{d} is a K dimensional vector, in which m_{dk} is the count of words in document d been assigned for topic k.

m_{d}^{-i} is m_{d} without counting information at word occurrence at position i.

The graphical representation of above generative process is as following (each node represents a random variable, filled with gray when observed, links denote dependences, and rectangle plates mean repetition, i.e. the big plate indicates there are M documents like that, the small plate indicates there are N occurrences words in each document):

Screenshot 2014-03-30 14.25.03

To keep (writing) this note simple, I am not going to explain the generative process in details here.

The Likelihood

What we are interested in this model are random variables \beta, \theta, and maybe z. But before going to inferences of these latent random variables, we may first follow the routine and write down the likelihood of the data, i.e.:

(1)   \begin{equation*} \begin{split} &P(W|\alpha, \eta) \\ = &\int P(\beta|\eta) [\prod_{d}^{M} \int (P(\theta_d|\alpha) \prod_{n=1}^{N_d} \sum\limits_{z_{dn}} P(z_{dn}|\theta_d) P(W_{dn}|z_{dn}, \beta)) d\theta_d] d\beta \end{split} \end{equation*}

Where

    \[ P(\beta|\eta) = \prod_k^K \frac{\Gamma(\sum_{v=1}^V \eta_v)}{\prod_{v=1}^V \Gamma(\eta_v)} \beta_{k1}^{\eta_1-1} \cdots \beta_{kV}^{\alpha_{kV}-1} \]

    \[ P(\theta_d|\alpha) = \frac{\Gamma(\sum_{k=1}^K\alpha_k)}{\prod_{k=1}^K \Gamma(\alpha_k)} \theta_{d1}^{\alpha_1-1} \cdots \theta_{dK}^{\alpha_K-1} \]

    \[ P(z_{dn}|\theta_d) = (\theta_d)_{z_{dn}} \]

    \[ P(W_{dn}|z_{dn}, \beta) = \beta_{z_{dn}W_{dn}} \]

 

Gibbs Sampling Learning

Gibbs sampling learning for LDA was first proposed in [2]. The learning problem or inference problem targeted by [2] is to estimate \theta and \beta. I didn’t find any good justification yet for using Gibbs sampling for LDA, one justification is that given the valid sample Z, we estimate P(\theta|Z, \alpha) \propto P(Z|\theta) P(\theta|\alpha), by either maximizing it or getting its expectation/mean (since this expression is a Dirichlet p.d.f., getting its expectation/mean is easy, and in literature this is often the common practice rather than using mode obtained by maximum likelihood) can give us an estimation of \theta, similar procedures hold for \beta as well. However, this justification is not satisfying in the sense that it didn’t consider the distribution of Z|W rather than just a sample of it, and it shows nothing about why a sample of Z|W will be good enough. By considering the distribution of Z|W (which is intractable thus requires sampling methods), one can derive that

(2)   \begin{equation*} \begin{split} P(\theta|W, \alpha) &= \int P(\theta|Z, \alpha) P(Z|W) dZ \\ &\simeq \frac{1}{L} \sum_{l}^L P(\theta|Z^{(l)}, \alpha) \end{split} \end{equation*}

Where each sample Z^{(l)} is drawn from distribution P(Z|W). Here the expectation/mean is easier to obtain than the mode by maximum likelihood because we have a summation over p.d.f. and the p.d.f. is of Dirichlet distribution. The expectation of a Dirichlet variable from one sample of Z^{(l)} can be found in book or Wiki, and the expectation of the same variable over a set of samples is just an average of expectations of each sample. Technically, we need to obtain multiple de-correlated samples of Z|W as according to E.q. 2, however, one sample may be good enough depends on the data (for example, it can be shown that for \beta, even one sample of Z|W, it could contain many times of samples z given the same word w).

I think E.q. 2 gives us enough motivation and justification for doing Gibbs sampling on LDA. Basically we need to be able to generate samples from P(Z|W), which is an intractable distribution, one popular way for this purpose is Gibbs sampling, which generates samples from each variable once at a time while others fixed.

Preparing for derivation of Gibbs sampling for LDA, let me first lightlight some important functions (as for the mathematics behind MCMC and Gibbs Sampling, readers can refer to some good tutorials on the Internet):

Dirichlet Distribution:

    \[ Dir(\theta;\alpha) = \frac{1}{B(\alpha)} \prod_{i=1}^{|\alpha|} \theta_i^{\alpha_i - 1} \]

Multinomial Beta function:

    \[ B(\alpha) = \frac{\prod_{i}^{|\alpha|}\Gamma(\alpha_i)}{\Gamma(\sum_{i=1}^{|\alpha|}\alpha_i)} \]

Gamma function:

    \[ \Gamma(n) = (n-1)! \]

Multinomial distribution:

    \[ Mult(\theta;x) = \frac{n!}{\prod_{i=1}^{|x|} x_i!} \prod_{i=1}^{|x|} \theta_i^{x_i} \]

Noted that in LDA or many of other bayesian statistical models, people use conjugate distribution for likelihood and prior, e.g., Multinomial and Dirichlet. The trick behind this is that posterior will be the same distribution as the prior, so that we can easily find a closed form for the integration of the posterior. In the case of Multinomial and Dirichlet, we have

    \[ Dir(\theta;x+\alpha) = Mult(\theta;x) Dir(\theta;\alpha) \]

Since

    \[ 1 = \int Dir(\theta;x+a) d\theta = \int \frac{1}{B(x+\alpha)} \prod_{i=1}^{|\alpha|} \theta^{x_i+\alpha_i-1} d\theta \]

i.e.:

    \[ \int \prod_{i=1}^{|\alpha|} \theta^{x_i+\alpha_i-1} d\theta = B(x+\alpha) \]

This piece of formula is extremely useful for deriving the following learning algorithm. (One more note would be: although people call it multinomial distribution in the generative process above, but actually it is one time multinomial, which has already degenerated into discrete distribution, however, this doesn’t affect much for the learning, as we see the formula here will also hold for a sequence of discrete distribution.)

Then let’s derive the joint distribution of W and Z which will be useful just one step later (notations are explained in the beginning of the post):

(3)   \begin{equation*} \begin{split} &P(W,Z|\alpha,\eta) \\ =& \prod_d P(W_d, Z_d| \alpha, \eta) \\ =& \prod_d P(W_d|Z_d, \eta) \prod_d P(Z_d|\alpha) \\ =& \int \prod_d P(W_d|Z_d, \beta) P(\beta|\eta) d\beta \prod_d \int P(Z_d|\theta_d) P(\theta_d|\alpha) d\theta_d \\ =& \int \prod_k \prod_w \beta^{n_{kw}} \prod_k \frac{1}{B(\eta)} \prod_w \beta^{\eta_w - 1} \\ & \ \ \ \ \prod_d \int \prod_k \theta^{m_{dk}} \frac{1}{B(\alpha)} \prod_k \theta^{\alpha_k - 1} d\theta_d \\ =& \frac{1}{(B(\eta))^K} \int \prod_k \prod_w \beta^{n_{kw}+\eta_w-1} d\beta \prod_d \frac{1}{B(\alpha)} \int \prod_k \theta_d^{m_{dk}+\alpha_k - 1} d\theta_d \\ =& \prod_k \frac{B(n_k+\eta)}{B(\eta)} \prod_d \frac{B(m_d+\alpha)}{B(\alpha)} \end{split} \end{equation*}

Now we can start to derive key formula for Gibbs sampling, which is P(z_i|Z_{-i}, W, \alpha, \eta), here Z_{-i} means the topic assignments for all word occurrences other than the one at position i, without loss of generality, we assume here that the position i is in document d and associated with the observed word w, since

    \[ P(z_i|Z_{-i}, W, \alpha, \eta) = \frac{P(Z, W|\alpha, \eta)}{P(Z_{-i}, W|\alpha, \eta)} \propto \frac{P(Z, W|\alpha, \eta)}{P(Z_{-i}, W_{-i}|\alpha, \eta)} \]

We have:

(4)   \begin{equation*} \begin{split} P(z_i=k|Z_{-i}, W, \alpha, \eta) \propto& \frac{P(z_i = k, Z_{-i}, W|\alpha, \eta)}{P(Z_{-i}, W_{-i}|\alpha, \eta)} \\ =& \frac{B(n_k + \eta)}{B(n_k^{-i} + \eta)} \frac{B(m_d + \alpha)}{B(m_d^{-i} + \alpha)} \\ =& \frac{n_{kw}+\eta_w-1}{[\sum_{w'} n_{kw'} + \eta_{w'}] - 1} \frac{m_{dk}+\alpha_k-1}{[\sum_{k'} m_{dk'} + \alpha_{k'}] - 1} \\ \propto& \frac{n_{kw}^{-i}+\eta_w}{\sum_{w'} n_{kw'}^{-i} + \eta_{w'}} (m_{dk}^{-i} + \alpha_k ) \end{split} \end{equation*}

The above E.q. 4 gives us a very clear picture for implementing Gibbs sampling, basically, one just needs to initialize Z, setting counter for three terms in final of E.q. 4, and each time when sample a word, decrease the corresponding counters, sample according to normalized E.q. 4, than increase corresponding counters according to the sampling result. After “burn-in”, one can read out (de-correlated) samples and start estimation of parameters. By E.q. 2, bayes’ rule and expectation of Dirichlet Distribution, we obtain the expectation of \theta and \beta given a sample of Z^{(l)} as following:

(5)   \begin{equation*} \theta_{dk} = \frac{n_{dk}+\alpha_k}{\sum_{k'} n_{dk'} + \alpha_{k'}} \end{equation*}

(6)   \begin{equation*} \beta_{kw} = \frac{n_{kw}+\eta_w}{\sum_{w'} n_{kw'} + \eta_{w'}} \end{equation*}

As mentioned above, one might also want to average parameters over multiple de-correlated samples according to E.q. 2.

Likelihood and perplexity 

Until now, I didn’t mention the formula for calculating the likelihood of data P(W), which can be important some time for testing of convergence (or using P(W|Z)). In [2], the authors mentioned an approximation technique for computing P(W), it is the harmonic mean of a set of values of P(W|Z^{(n)}) from N samples, where Z^{(n)} is sampled from the posterior P(Z|W). So that is:

    \[ P(W) = \frac{N}{\sum_n^N \frac{1}{P(W|Z^{(n)})}} \]

Where

    \[ Z^{(n)} \sim P(Z|W) \]

Using the previous derivation, we can easily compute that:

(7)   \begin{equation*} \begin{split} &P(W|Z^{(n)}) \\ =& \prod_k \frac{Beta(n_k + \eta)}{Beta(\eta)} \\ =& \bigg(\frac{\Gamma(|V|\eta_0)}{\Gamma(\eta_0)^{|V|}}\bigg)^K \prod_k \frac{\prod_w \Gamma(n_{kw}+\eta_0)}{\Gamma(\sum_w n_{kw} +  |V|\eta_0)} \end{split} \end{equation*}

Where we have assumed a symmetric \eta. However, the E.q. 7 can easily lead to numerical underflow/overflow if directly computed (due to the large factorial terms). Here we can take a log of it and see what we got:

(8)   \begin{equation*} \begin{split} &\log P(W|Z^{(n)}) \\ =& C + \sum_k \sum_w \log  \Gamma(n_{kw}+\eta_0) - \log \Gamma(\sum_{w'} n_{kw'} + |V|\eta_0) \end{split} \end{equation*}

Where C is constant:

    \[ C = K\log \Gamma(|V|\eta_0) - K|V| \log \Gamma(\eta_0) \]

The log gamma function should be built in the programming language so you can use it directly. But now since we only have \log P(W|Z^{(n)}), and it can be quite “negatively large” (maybe -10^{10} depending on the corpus), this again create a computational difficulty when we compute the Harmonic average of P(W|Z^{(n)}). Again we are going to log P(W), and using some trick below enable us to compute it accurately:

(9)   \begin{equation*} \begin{split} \log P(W) =& \log N - \log \sum_n e^{-t_n} \\ =& \log N - \log \sum_n e^{t_0 - t_n} + t_0 \end{split} \end{equation*}

Here t_n = \log P(W|Z^{(n)}), t_0 is introduced to make log-sum-exp computed correctly, usually we can set t_0 = \min \{t_n\}. Now, even the corpus contains millions of documents, the above likelihood calculation should still work as the scale is under the range of floating point number.

The likelihood of hold out data sometime is also an useful quantity, it reflects generalized predictive power of the model, and usually it is measured by perplexity, essentially it is the inverse of the geometric mean per-link likelihood. It is defined by:

    \[ perplexity(D_{test}) = \exp \{ - \frac{\sum_d \log P(W_d)}{\sum_d N_d} \} \]

This term can be evaluated by: first hold out some documents, and then sample topic assignment for words in those documents given a point estimate \hat\beta, then the likelihood (there go the perplexity) of hold out documents can be calculated as stated above.

Some resources on Gibbs Sampling for LDA 

GibbsLDA++: A C/C++ Implementation of Latent Dirichlet Allocation using Gibbs sampling.

WinBUGS: Gibbs sampling tools for graphical models.

[1] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” The journal of machine learning research, vol. 3, pp. 993-1022, 2003.
[Bibtex]
@article{blei2003latent,
title={Latent dirichlet allocation},
author={Blei, David M and Ng, Andrew Y and Jordan, Michael I},
journal={the Journal of machine Learning research},
volume={3},
pages={993--1022},
year={2003},
publisher={JMLR. org}
}
[2] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings of the national academy of sciences of the united states of america, vol. 101, iss. Suppl 1, pp. 5228-5235, 2004.
[Bibtex]
@article{griffiths2004finding,
title={Finding scientific topics},
author={Griffiths, Thomas L and Steyvers, Mark},
journal={Proceedings of the National academy of Sciences of the United States of America},
volume={101},
number={Suppl 1},
pages={5228--5235},
year={2004},
publisher={National Acad Sciences}
}

Notes on Neural Networks Part I – Backpropagation

As far as I know, the story behind deep learning is like this: original there are neural networks with only one hidden layer, which is not powerful enough; to increase its power, one may add more hidden layers, however, with the standard gradient-base learning algorithm, i.e. backpropagation, the gradients vanish in hidden layers fastly when the hidden layers are more than one, and also due to non-convexity of the objective, deep neural networks suffer deep local optimal and overfitting, which makes neural networks stuck in its downturn for a while. Neural networks seem to be borned with the concept of “feature/representation learning”, but it was purely act in a supervised fashion, until 2006, Hinton et al. proposed unsupervised feature/representation learning for addressing the critical issues in training a deep neural nets, which called unsupervised pre-training, aiming to initialize the deep neural networks with much better weights than randomly assigned one. After pre-training, the followed supervised learning (called fine-tuning) will work much better compared with original supervised learning. There are mainly two ways to do the pre-training, one is using Restricted Boltzmann Machines, the other is to use stacked Autoencoder (however, some recent studies suggest that using rectifier units, deep neural nets can be well trained even without unsupervised pre-training).

Okay, back to the basics, let me scribe the classic Backpropagation learning algorithm for neural networks in here:

Notations:

a_i^k : scalar output of node i at k-th layer; a^k is an n^k \times 1 matrix; note a^1 at 1st layer, i.e. input layer, is the input vector x_j of j-th instance, and a^K at output layer is output prediction t.

\delta_i^k : “Pseudo-error” for node i at k-th level, used for calculating gradients; \delta^k is an n^k \times 1 matrix.

W_{ij}^k : weight of link from node i at k-th layer to node j at k+1-th layer; W^k is an n^k \times n^{k+1} matrix.

b_j^k : bias from k-th layer to node j at k+1-th layer; b^k is an n^{k+1} \times 1matrix.

h(x) : element-wise activation function for x, usually it would be sigmoid, tanh, or rectifier.

\eta : learning rate.

x \circ y : element-wise product of two matrices.

Backpropagation with stochastic gradient descent on squared error loss function and sigmoid activation function h(x):


Initialization

For each training instances repeat following until convergence or maximum number of iteration reached:

% forward propagation

For layer bigger than 1:

(1)   \begin{equation*} a^k = h( (W^{k-1})^T a^{k-1} + b^{k-1}) \end{equation*}

% Back propagation of “pseudo errors”

For top layer, say the K-th layer,

(2)   \begin{equation*} \delta^K = a^K \circ (\mathbf{1} - a^K) \circ (a^K - t) \end{equation*}

For each k-th hidden layer,

(3)   \begin{equation*} \delta^k = a^k \circ (\mathbf{1} - a^k) \circ ( (W^k)^T \delta^{k+1}) \end{equation*}

% Calculate and apply gradients

For k-th layer (exclude the output layer),

(4)   \begin{equation*} \nabla_{W^k} = a^k (\delta^{k+1})^T \end{equation*}

(5)   \begin{equation*} \nabla_{b^k} = \delta^{k+1} \end{equation*}

(6)   \begin{equation*} W_{new} = W_{old} - \eta \nabla_{W} \end{equation*}

(7)   \begin{equation*} b_{new} = b_{old} - \eta \nabla_{b} \end{equation*}


For general loss functions and activation functions, we only need to replace equations 2 and 3 with:

(8)   \begin{equation*} \delta^K = h'(a^K) \circ \frac{\partial E}{\partial h(a^K)} \end{equation*}

And,

(9)   \begin{equation*} \delta^k = h'(a^k) \circ ( (W^k)^T \delta^{k+1}) \end{equation*}

Where h'(a^k) is a gradient of each dimension of function h to its corresponding dimension in the input vector a^k. \frac{\partial E}{\partial h(a^K)} is the gradient of error/loss function to predicted output function h(a^K), “element-wisely”.

Notes on Factor Graph Model (Part I)

Graphical model and factorization

Original I want to write more backgrounds and illustrated by real life examples, but I found this process both time and words consuming, so I will assume readers with some related backgrounds and only write down important technical part of the contents in this post, and also include some references for more details.

A graphical model is a model build on a graph where nodes are random variables and edges are their dependencies. In a factor graph FG=\{Y, F, E\}, where Y is a set of random variables, F is a set of factors connecting random variables, and E is the set of edges between Y and F. For illustration purpose, below is a diagram of the relationship between naive Bayes, logistic regression, HMMs, linear- chain CRFs, generative models, and general CRFs from [1] (which is a good tutorial for this topic). In the lower row of the diagram they are the targets of this post: undirected graphical models using factor graph modeling. It is worth mentioning, in the figure, the features X is also denoted as node, however, in the notation of this post, features are encoded into the factor (so each node would have a single connection to a factor that associates the configuration of this node with the features for this node), nodes only represent to random variables.

graphical model ill

The probability of random variables in the factor graph is P(Y|G, X) (here I use G to denote factor graph structure, and X for features associated with the factor nodes) can be factorized as following:

(1)   \begin{equation*} P(Y|G, X;\bm\theta) = \frac{1}{Z(G, X;\bm\theta)} \prod_{C_p \in \bm{C}} \prod_{c \in C_p} \Psi_{cp} (\bm{y}_c, \bm{x}_c; \bm\theta_p) \end{equation*}

Where \bm{C}=\{C_1, C_2, \cdots, C_p\} is a set of clique templates where in each clique template all cliques share the same potential function \Psi_{cp} (a clique formed by a set of random variables connected by a factor, it can be a single node, a pair of nodes or a set of nodes), and \Psi_{cp} is a potential function based on configurations \bm{y}_c of the clique, features \bm{x}_c of the clique and parameters \bm\theta_p of clique template C_p. \Psi_{cp} can be thought as a way of expressing how likely a clique with specific configurations and features is. Z(G, X;\bm\theta) = \sum_Y \prod_{C_p \in \bm{C}} \prod_{c \in C_p} \Psi_{cp} (\bm{y}_c; \bm\theta_p) is a global normalization term to make the whole thing a probability.

Instantiation by Exponential Family

In equation 1  we have not specified the potential function \Psi_{cp} defined in each clique instance, the most common practice for instantiating it is using the Exponential Family (similar to log linear model):

(2)   \begin{equation*} \Psi_{cp}(\bm{y}_c, \bm{x}_c; \bm\theta_p) = \exp \sum_{k=1}^{K(p)} \bm\theta_{pk} f_{pk} (\bm{y}_c, \bm{x}_c) \end{equation*}

Here K(p) is the number of parameters or features in clique template C_p, f_{pk} (\bm{y}_c, \bm{x}_c) is the feature function (or factor function), which turns a set of feature \bm{x}_c and labels \bm{y}_c into a real or binary value. You can self define the factor functions according to some domain knowledge, but a requirement has to be met is that factor function should be unique for each label \bm{y}_c, i.e. f_{pk}(\bm{y}_c, \bm{x}_c) = f_{pk}(\bm{y}'_c, \bm{x}_c) if and only if \bm{y}_c = \bm{y}'_c. To be more specific, let us consider two type of common used features: node attribute feature for a single random variable and label correlation feature among a set of variables. For node attribute feature, originally it should be a k dimensional constant vector for a node over all different configurations, however, to meet the above requirement, one has to form a N_l * k dimensional feature vector associated with N_l labels (N_l is the number of possible labels), for each label, feature vector is sparse, only k non-zero dimensions correspond to the original feature, since for each label the non-zero dimensions are disjoint, so the feature vector is unique w.r.t. the label; for label correlation feature, it can be only defined on cliques that involve more than one node, for example,  suppose each node can take 0/1 two states, then a N_l * 2 dimensional feature vector can be defined over a pair of node, for each label, the feature vector is only non-zero on the 2 associated dimensions (one for two nodes taking the same states, the other for them taking differently), again the feature vector is unique w.r.t. the label as non-zero parts of the feature vector are disjoint. (Regards to these two type of features, an interesting fact is: if only consider the node attribute features, the factor graph model degenerates to logistic regression)

With above instantiation, the normalization constant Z(G, X; \bm\theta) can also be specified below:

(3)   \begin{equation*} Z(G, X;\bm\theta) = \sum_Y \prod_{C_p \in \bm{C}} \prod_{c \in C_p} \exp \sum_{k=1}^{K(p)} \bm\theta_{pk} f_{pk} (\bm{y}_c, \bm{x}_c) \end{equation*}

To model the real world problem, this instantiation is still not enough, not only the structures of the graph depend on the real problem, but also those features are also needed to be specified by the model designer. Once a factor graph model is specified, there are two main problems left: parameter learning problem and Inference problem, which are covered in following two sections.

Parameter learning

The goal of parameter learning is to learn a set of parameter \bm\theta_{p} that can best explain observed data under the factor graph model (suppose we are given all labels Y during training), which is a maximum likelihood estimation of the parameters, the objective function is:

(4)   \begin{equation*} \begin{split} &\underset{\bm\theta}{arg\ max} \log P(Y|G,X;\bm\theta) \\ =&\underset{\bm\theta}{arg\ max} \{\sum_{C_p \in \bm{C}} \sum_{c \in C_p} \sum_{k=1}^{K(p)} \bm\theta_{pk} f_{pk} (\bm{y}_c, \bm{x}_c) \\ &\ \ \ \ \ \ \ \ \ \ \ \ \ - \log \sum_Y \prod_{C_p \in \bm{C}} \prod_{c \in C_p} \exp \sum_{k=1}^{K(p)} \bm\theta_{pk} f_{pk} (\bm{y}_c, \bm{x}_c) \} \\ \end{split} \end{equation*}

A straight-forward method is gradient based approach, such as gradient ascent, or Newton-Raphson method, which all rely on the gradients of parameters. So let us take a look at the derivative of \bm\theta_{pk} of objective function P(Y|G,X;\bm\theta) (detailed derivation can be found in the appendix of the post):

(5)   \begin{equation*} \begin{split} \nabla_{\bm\theta_{pk}} &= \sum_{c \in C_p} f_{pk} (\bm{y}_c, \bm{x}_c) - \sum_{c \in C_p} \sum_{\bm{y}_c} p(\bm{y}_c|G, \bm{x}_c) f_{pk} (\bm{y}_c, \bm{x}_c) \\ &= \mathbb{E}_{\widetilde{Y}}[f_{pk}] - \mathbb{E}_{P(Y|G, X;\bm\theta)} [f_{pk}]\\ \end{split} \end{equation*}

Where \mathbb{E}_{\widetilde{Y}}[f_{pk}] is the expectation of the factor function (for a specific clique template p and feature k) under the empirical distribution of Y, i.e. labels are given, and \mathbb{E}_{P(Y|G,X;\bm\theta)} [f_{pk}] is the expectation of the factor function under the model distribution of Y, i.e. label distributions are predicted by current model. This is fairly intuitive as when we learn parameters, we want these two terms to be close so that model distribution can fit empirical distribution well.

However, we still have one problem left: how to calculate the derivatives \nabla_{\bm\theta} containing marginal probability of \bm{y}_c? Here is a solution: we can first obtain the marginal probability for each random variable in the graph p(y|G, X) using inference techniques (introduced in the next section), and then calculate p(\bm{y}_c|G, X) using the chain rule for each clique c that consist of more than one nodes. You can see in this solution, the computation complexity is constrained by the size of the clique and the way we specify the factor function, if we utilize factor functions that rely on the permutation of labels, then the correlation feature space would be exponential to the size of the clique according to the chain rule (this is also why factorizing Y into \bm{y}_c in derivative of the normalization term is important).

Inference

Inference has double meanings here: one is to obtain the marginal probabilities of random variables, e.g. p(y|G, X); the other is to infer best configuration of the graph, i.e. find out Y such that P(Y|G, X) can be maximized. Here I introduce two algorithms I have used: for obtaining marginal probability, sum-product algorithm, and for the best configuration problem, max-sum algorithm.

Sum-product, or also known as belief propagation, which is more explanatory as the algorithm runs by propagating belief in the network. So what is belief? Belief is the estimated marginal probability of a random variable node. However, although it is called belief propagation, those been really passed among factor graph are messages, a partial and unnormalized belief. There are two types of messages, one is the message from a variable node u to a factor node v telling v that u‘s states based on other factors’ messages; the other is the message from a factor node v to a variable node u telling the states of u based on other variables’ messages.

To be more specific, a message from a variable x to a neighbor factor f is computed as following:

(6)   \begin{equation*} \mu_{x \to f} (x) = \prod_{f' \in ne(x) \backslash f} \mu_{f' \to x} (x) \end{equation*}

Here x and f in the subscript indicate a random variable node and a factor node, and x in parentheses is a real valued scalar indicate likelihood for a particular state of random variable x, a message vector m then can be formed by concatenating the scalars for all possible states of the random variable x. Note that  a message may not sum to one, you need to normalize it when a message is gathered.

A message from a factor f to a neighbor variable x is computed as following:

(7)   \begin{equation*} \mu_{f \to x} (x) = \sum_{x_1} \cdots \sum_{x_N} \Psi_f(x, x_1, \cdots, x_N) \prod_{n = 1}^N \mu_{x_n \to f} (x_n) \end{equation*}

Here x_1, \cdots, x_N are the neighbor variables of x.

There are two type of special nodes: leaf nodes that only have one neighbor and observed variable that its state is fixed. For a random variable x that is observed with evidence x_0, the message to its neighbor factor f should take the following form:

(8)   \begin{equation*} \mu_{x \to f} = \begin{cases} 1, x = x_0 \\ 0, \text{otherwise} \\ \end{cases} \end{equation*}

For a leaf factor f, the message to its neighbor variable x should take the following form:

(9)   \begin{equation*} \mu_{f \to x} (x) = \Psi_f (x) \end{equation*}

Once the messages are computed, we can then turn to calculate the believes, or marginal probabilities, of those random variables. The marginal probability of variable x can be calculated by the production of all messages from its neighbors:

(10)   \begin{equation*} p(x) \propto \prod_{f \in ne(x)} u_{f \to x} (x) \end{equation*}

The marginal probability for a factor f is:

(11)   \begin{equation*} p(x_1, \cdots, x_n) = \Psi_f (x_1, \cdots, x_n) \mu_{x_1 \to f} (x_1) \cdots \mu_{x_n \to f} (x_n) \end{equation*}

Here x_1, \cdots, x_n are neighbor variables of f.

So to sum up, the algorithm works like this: first form a tree (by, e.g., BFS) of graph, from leaf to root and from root to leaf, passing message according to E.q. 6 to 9, then you can calculated the marginal probabilities of all variable nodes with E.q. 10. It is worth noting that if the graph is a tree, then the marginal probabilities are exact, however, if it is not, then we can still do it in the same or similar ways (with more iterations of “leaf-root and root-leaf”), that is called loopy belief propagation, the algorithm will not be guaranteed to converge, but empirically works nicely [2].

Although the Sum-product algorithm can be utilized to find best configuration by taking the highest probability for each variable node, but this may not lead to the global optimal as we only consider the marginal probability of each node at a time, this may also lead to some illegal configurations. A better algorithm for finding the best configuration is call Max-sum, which works very similar to Sum-product, we can still follow the similar procedure and utilize E.q. 6 to 9, but need to change E.q. 7 to the following:

(12)   \begin{equation*} \mu_{f \to x} (x) = \max_{x_1} \cdots \max_{x_N} \Psi_f(x, x_1, \cdots, x_N) \prod_{n = 1}^N \mu_{x_n \to f} (x_n) \end{equation*}

Note that when doing Max-sum, we only need to passing message from leaf to root but no need to passing messages back, while passing the message using \max operation, we also need to keep track of the best configurations of variables so that we can retrieve the global best configuration at the end. The best likelihood of the configuration can be computed at the root as following:

(13)   \begin{equation*} p^{\max} = \max_{x} (\prod_{f \in ne(x)} \mu_{f \to x }(x) ) \end{equation*}

Again, this work exactly on a tree structure, with loops, it can not be guaranteed to work properly.

Learn and infer in one graph – a semi-supervised learning approach

In some factor graphs, a part of random variables are labeled Y^L while others Y^U are not labeled. If we are given such a partially labeled factor graph, is there a better way for us to learn the parameters and infer the unseen labels simultaneously (rather than learn the parameters on observed subgraph and apply it on the unobserved subgraph)?
Although we can do coordinate ascent, but a more straight forward way is to learn by optimize P(Y^L|G), which naturally utilize the structure of the whole graph including those unseen labels Y^U as we sum over all possible Y^U. The log-likelihood function is:

(14)   \begin{equation*} \begin{split} &\log p(Y^L|G, X; \bm{\theta}) \\ =& \log \sum_{Y^U} P(Y^L, Y^U|G, X; \bm\theta) \\ =& \log \sum_{Y^U} \prod_{C_p \in \bm{C}} \prod_{c \in C_p} \exp\{ \sum_{k=1}^{K(p)} \bm\theta_{pk} f_{pk} (\bm{y}_c, \bm{x}_c) \} - \log Z(G, X;\bm\theta) \\ \end{split} \end{equation*}

Here \bm{y}_c = \{ \bm{y}^l_c, \bm{y}^u_c \}, \bm{y}^l_c are variables whose labels are known to us (i.e. labels are fixed), wheres \bm{y}^u_c are the ones whose labels are unknown to us.

The semi-supervised learning of parameters can follow exact the same gradient based approach as supervised learning introduced above, the derivative for parameter \bm\theta_{pk} is

(15)   \begin{equation*} \begin{split} \nabla_{\bm\theta_{pk}} &= \sum_{c \in C_p} \sum_{\bm{y}^u_c} p(\bm{y}^l_c, \bm{y}^u_c|G, X) f_{pk} (\bm{y}^l_c, \bm{y}^u_c, \bm{x}_c) - \sum_{c \in C_p} \sum_{\bm{y}_c} p(\bm{y}_c|G, X) f_{pk}(\bm{y}_c, \bm{x}_c) \\ &= \mathbb{E}_{P(Y^U|Y^L, G, X;\bm\theta)}[f_{pk}] - \mathbb{E}_{P(Y|G, X;\bm\theta)} [f_{pk}]\\ \end{split} \end{equation*}

Once again, we got to this gradient with the form of two expectation terms, the first expectation of factors is under the model distribution with labeled variables fixed, the second expectation of factors is under the model distribution with none labels fixed. It is still very intuitive in this semi-supervised setting. For the calculation of terms in gradient, it follows the similar procedure with the above supervised setting.

Appendix

Derivation of gradient from normalization term from E.q. 4. It may not be that obvious see how the second term in E.q. 5 (corresponds to normalization term in objective function in E.q. 4) been derived. So here I attach a derivation of gradient for term \log Z(G, X;\bm\theta):

(16)   \begin{equation*} \begin{split} &\nabla_{\bm\theta_{pk}} \log Z(G, X;\bm\theta) \\ =& \frac{ \sum_Y\{ \prod_{C_p \in C} \prod_{c \in C_p} \exp(\sum_k \bm\theta_{pk} f_{pk}(\bm{y}_c, \bm{x}_c)) \sum_{c \in C_p} f_{pk} (\bm{y}_c), \bm{x}_c\} }{ \sum_Y \prod_{C_p \in C} \prod_{c \in C_p} \exp(\sum_k \bm\theta_{pk} f_{pk}(\bm{y}_c, \bm{x}_c)) } \\ =& \sum_Y P(Y|G, X;\bm\theta) \sum_{c \in C_p} f_{pk}(\bm{y}_c, \bm{x}_c) \\ =& \sum_Y \sum_{c \in C_p} P(Y|G, X;\bm\theta) f_{pk}(\bm{y}_c, \bm{x}_c) \\ =& \sum_{c \in C_p} \sum_Y P(Y|G, X;\bm\theta) f_{pk}(\bm{y}_c, \bm{x}_c) \\ =& \sum_{c \in C_p} \sum_{\bm{y}_c} p(\bm{y}_c|G, \bm{x}_c;\bm\theta) f_{pk}(\bm{y}_c, \bm{x}_c) \\ \end{split} \end{equation*}

Derivation for E.q. 15 from E.q. 14 also follows the similar derivation.

Calculation/approximation of likelihood E.q. 4. It is difficult to directly compute the likelihood function as it sum over all states of all variables in the graph, so approximation needs to be adopted for a more efficient calculation. The detail will be posted in Part II of the note in the future.

Piecewise & pseudo-likelihood and more. Since The difficulty in computing global normalization, there are some ways to approximate the original likelihood 4, two effective ways are Piecewise training and pseudo-likelihood [3]. Detail about his part will also be included in Part II of the note in the future.

References

[1] C. Sutton and A. McCallum, “An introduction to conditional random fields,” Arxiv preprint arxiv:1011.4088, 2010.
[Bibtex]
@article{sutton2010introduction,
title={An introduction to conditional random fields},
author={Sutton, Charles and McCallum, Andrew},
journal={arXiv preprint arXiv:1011.4088},
year={2010}
}
[2] K. P. Murphy, Y. Weiss, and M. I. Jordan, “Loopy belief propagation for approximate inference: an empirical study,” in Proceedings of the fifteenth conference on uncertainty in artificial intelligence, 1999, pp. 467-475.
[Bibtex]
@inproceedings{murphy1999loopy,
title={Loopy belief propagation for approximate inference: An empirical study},
author={Murphy, Kevin P and Weiss, Yair and Jordan, Michael I},
booktitle={Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence},
pages={467--475},
year={1999},
organization={Morgan Kaufmann Publishers Inc.}
}
[3] C. Sutton and A. McCallum, “Piecewise pseudolikelihood for efficient training of conditional random fields,” in Proceedings of the 24th international conference on machine learning, 2007, pp. 863-870.
[Bibtex]
@inproceedings{sutton2007piecewise,
title={Piecewise pseudolikelihood for efficient training of conditional random fields},
author={Sutton, Charles and McCallum, Andrew},
booktitle={Proceedings of the 24th international conference on Machine learning},
pages={863--870},
year={2007},
organization={ACM}
}

Notes on EM and pLSA

(Appendix added on August 4th, 2014, see at the end)

Introduction to EM

EM (Expectation Maximization) is a general technique for solving model learning problem including latent variables. Given a set of observed data \mathbf{y} and a set of latent variables \mathbf{z}, we want to learn the parameters by maximizing the log-likelihood of data, i.e. L(\bm\theta) = \log P(\mathbf{y}|\bm{\theta}), including latent variable, we have the objective function for learning:

(1)   \begin{equation*} L(\bm\theta) = \log P(\mathbf y|\bm\theta) = \log \sum_{\mathbf z} P(\mathbf{y,z}|\bm\theta)$ \end{equation*}

However, this \log\sum makes it difficult for us to utilize some standard optimization techniques like Gradient Descent, etc.. Facing this difficulty, we can include two ideas to help: 1, we can do the the whole learning step by step, i.e. iteratively, so that at each step i+1 we will have a old set of parameters, \bm\theta^{(i)} from the i-th iteration, that we can use; 2, we can find an lower bound for objective function 1, and we optimize towards it, since it is an lower bound that we can guarantee to improve (or not decrease), we can (somehow) guarantee the objective function to be increasing (or not decreasing). Keep these two ideas in mind, we can do the derivation as below:

(2)   \begin{equation*}  \begin{split} L(\bm\theta) &= \log P(\mathbf y|\bm\theta) \\ &= \log \sum_{\mathbf z} P(\mathbf{y,z}|\bm\theta)$ \\ &= \log \sum_{\mathbf z} P(\mathbf z|\mathbf y, \bm\theta^{(i)}) \frac{ P(\mathbf{y,z}|\bm\theta) }{ P(\mathbf z|\mathbf y, \bm\theta^{(i)}) } \\ &\ge \sum_{\mathbf z} P(\mathbf z|\mathbf y, \bm\theta^{(i)}) \log \frac{ P(\mathbf{y,z}|\bm\theta) }{ P(\mathbf z|\mathbf y, \bm\theta^{(i)}) } \text{  (Jensen's inequality)} \\ &\propto \sum_{\mathbf z} P(\mathbf z|\mathbf y, \bm\theta^{(i)}) \log P(\mathbf{y,z}|\bm\theta) \text{  (dropping const)} \\ \end{split} \end{equation*}

In the above derivation, we use last iteration estimated parameters to find a lower bound, which we can just maximize to indirectly improve the original likelihood function. The equality holds when X in Jensen’s inequality is a constant variable, in above case that is \forall \mathbf z, \frac{ P(\mathbf{y,z}|\bm\theta) }{ P(\mathbf z|\mathbf y, \bm\theta^{(i)}) } = c, c is a constant. This equality gives us a estimate of latent variable \mathbf z under the model:

(3)   \begin{equation*} P(\mathbf z|\mathbf y, \bm\theta^{(i)}) = \frac{ P(\mathbf{y,z}|\bm\theta) }{ \sum_{\mathbf z} P(\mathbf{y,z}|\bm\theta) } \end{equation*}

Many textbook about EM utilizes the so-called Q-function:

(4)   \begin{equation*}  Q(\bm\theta|\bm\theta^{(i)}) = \sum_{\mathbf z} P(\mathbf z|\mathbf y, \bm\theta^{(i)}) \log P(\mathbf{y,z}|\bm\theta) \end{equation*}

Where we can view \log P(\mathbf{y,z}|\bm\theta) as the complete data log-likelihood (\mathbf y is incomplete data, \mathbf z are latent variables), P(\mathbf z|\mathbf y, \bm\theta^{(i)}) as predictions of latent variables by using current estimated parameters, and Q-function as an expectation of the log-likelihood of the complete data. Then we can formalize the procedure for applying the EM algorithm:

Pre-Initialization:

Prepare the Q-function 4

Initialization:

Randomly initialize the parameters \bm\theta^{(0)}

E step:

Estimate latent variables according current estimated parameters: P(\mathbf z|\mathbf y, \bm\theta^{(i)})

M step:

Maximize the Q-function 4 w.r.t. the parameters \bm\theta given last estimated parameters \bm\theta^{(i)}

Some properties about EM: it is proved that under EM procedure, if the lower bound Q-function is increasing (or not decreasing), then the log-likelihood function will be increasing (or not decreasing) and the iteration will finally go to a converged point (one way to see it is subtract L(\theta) with L(\theta^{(i)})). It is worth noting that EM can only find local maxima (a steady high point, but not the highest one), and so the initialization of EM matters.

Applying EM in pLSA

pLSA (Probabilistic Latent Semantic Analysis [1, 2]) is a technique for mining topics of texts. In pLSA, the log-likelihood function is:

(5)   \begin{equation*}  \begin{split} \log P(\mathbf{d,w}) &= \sum_{i=1}^N \sum_{j=1}^M n(d_i, w_j) \log P(d_i, w_j) \\ &= \sum_{i=1}^N n(d_i) [\log P(d_i) + \sum_{j=1}^M \frac{n(d_i, w_j)}{n(d_i)} \log \sum_{k=1}^K P(w_j|z_k) P(z_k|d_i)] \\ & \propto \sum_{i=1}^N \sum_{j=1}^M n(d_i, w_j) \log \sum_{k=1}^K P(w_j|z_k) P(z_k|d_i) \text{  (dropping const)}  \\ \end{split} \end{equation*}

Where d indicates documents, w denotes words, z means topic assignments. N is the number of documents, M is the size of vocabulary, K is number of topics.

In the log-likelihood function 5, we find a \log\sum structure with latent variable that is very similar to the log-likelihood function 1 of a general EM problem, so we can utilize the similar derivation that utilized by EM to further derive the log-likelihood function 5, however, we can also do it by applying EM algorithm. To do so, we need to derive Q-function for EM. First, as the equation 5 is not the complete data log-likelihood, which is:

(6)   \begin{equation*}  \log P(\mathbf{d,w,z}) = \sum_{i=1}^N \sum_{j=1}^M n(d_i, w_j) \log P(w_j|z_{ij}) P(z_{ij}|d_i) \end{equation*}

To apply Q-function, we have to be very careful now, as it will get very tricky as we include the random variable \mathbf z, which is a multivariate random variable that has n slots with each slot of K states. In Q-function, we have to sum over all possible \mathbf z, which in this case seems lead to an exponential of terms. However that is not the case, as p(z_k|d_i,w_j) is independent to each other. We can write the Q-function as:

(7)   \begin{equation*}  \begin{split} Q(\bm\theta|\bm\theta^{(i)}) &= \sum_{z_1, ..., z_n} \sum_{l=1}^n \log P(\mathbf{d,w},z_l) P(z_1, ..., z_n|\mathbf{d,w}) \\ &= \sum_{z_1,z_2,...,z_n} [ \log P(\mathbf{d,w},z_1) + ... + \log P(\mathbf{d,w},z_n)] \\ & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(z_1|\textbf{d,w}) P(z_2|\textbf{d,w}) ... P(z_n|\textbf{d,w}) \\ &= \sum_{z_1,z_2,...,z_n} [\log P(d_1,w_1,z_1) + ... + \log P(d_n,w_n,z_n)] \\ & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(z_1|d_1,w_1) P(z_2|d_1,w_1) ... P(z_n|d_1,w_1) \\ &= \sum_{i=1}^N \sum_{j=1}^M n(d_i, w_j) \sum_{k=1}^K P (z_k|d_i, w_j)\log P(w_j|z_k) P(z_k|d_i) \\ \end{split} \end{equation*}

(Still, the above derivation is not at the most detailed level compared to how tricky it is).

Now we have prepared the Q-function 7 and established the E step, which is to calculate P (z_k|d_i, w_j) (using Bayes’ Rule), we can go ahead and optimize the Q-function. Before doing so, we should notice the constraints: \sum_{j=1}^M P(w_j|z_k ) = 1) and \sum_{k=1}^K P(z_k|d_i) = 1, we can include Lagrange multipliers \tau_k and \rho_i into the Q-function to form a complete objective function:

(8)   \begin{equation*}  \begin{split} L_{new} =& \sum_{i=1}^N \sum_{j=1}^M n(d_i, w_j) \sum_{k=1}^K P (z_k|d_i, w_j)\log P(w_j|z_k) P(z_k|d_i) \\ &+ \sum_{k=1}^K \tau_k (1 - \sum_{j=1}^M P(w_j|z_k ) ) + \sum_{i=1}^N \rho_i (1 - \sum_{k=1}^K P(z_k|d_i) ) \\ \end{split} \end{equation*}

Setting equation 8 to zero w.r.t. to parameters  P(w_j|z_k) and P(z_k|d_i) yields:

(9)   \begin{equation*}  \begin{split} &\sum_{i=1}^N n(d_i, w_j) P(z_k|d_i,w_j) - \tau_k P(w_j|z_k) = 0, \ 1 \le j \le M, \ 1 \le k \le K, \\ &\sum_{j=1}^M n(d_i, w_j) P(z_k|d_i,w_j) - \rho_i P(z_k|d_i) = 0, \ 1 \le j \le M, \ 1 \le k \le K. \\ \end{split} \end{equation*}

By substituting equation 9 into normalization constraints \sum_{j=1}^M P(w_j|z_k ) = 1) and \sum_{k=1}^K P(z_k|d_i) = 1, we find a closed form to our maximized parameters estimation:

(10)   \begin{equation*} \begin{split} &$P(w_j|z_k) = \frac{\sum_{i=1}^N n(d_i,w_j) P(z_k|d_i,w_j)}{\sum_{m=1}^M\sum_{i=1}^N n(d_i,w_m) P(z_k|d_i,w_m)}$ \\ &$P(z_k|d_i) = \frac{\sum_{j=1}^M n(d_i,w_j)P(z_k|d_i,w_j)}{n(d_i)}$ \\ \end{split} \end{equation*}

To summarize, the EM algorithm for solving pLSA is:

Initialization:

Initialize the parameters p(z|d) and p(w|z) (note that implementation-wise, for p(z|d), random is better than uniform, which will get stuck, for p(w|z), uniform is better than random)

E step:

P (z_k|d_i, w_j) = \frac{P(w_j|z_k)P(z_k|d_i)}{\sum_{l=1}^K P(w_j|z_l) P(z_l|d_i)}

M step

P(w_j|z_k) = \frac{\sum_{i=1}^N n(d_i,w_j) P(z_k|d_i,w_j)}{\sum_{m=1}^M\sum_{i=1}^N n(d_i,w_m) P(z_k|d_i,w_m)}

P(z_k|d_i) = \frac{\sum_{j=1}^M n(d_i,w_j)P(z_k|d_i,w_j)}{n(d_i)}

Appendix

Two more general ways of understanding EM:

Lower bound optimization: given we are optimizing an objective function includes latent variables \log \sum_{\mathbf z} P(\mathbf{y,z}|\bm\theta), since this is directly difficult, we apply Jensen’s inequality in a specific way (see E.q. 2) to get a lower bound of original objective, now the trick of EM is that we improve the objective function by improving the tight lower bound: in E step, we estimate the distribution of latent variable such that (after E step) lower bound is equal to the objective (to do so, the equality condition of Jensen’s equality is used); in M step, we optimize the lower bound w.r.t. parameters (getting rid of constant, M step is optimizing Q-function), since from E step, the objective is tightly lower bounded, the improvement of the objective in M step is guaranteed to be greater or equal to the improvement of the lower bound. Repeat E and M steps, you can see 1. the objective will not decrease (as the lower bound will not decrease) 2. the objective will converge, since it will not decrease, and there must be a upper limit.

Similar to lower bound optimization point of view, what is shown on Bishop’s book “Pattern Recognition and Machine Learning”, chapter 9.4 is even more general. I very much recommend readers to take a look at it.

The end.

[1] T. Hofmann, “Probabilistic latent semantic analysis,” in Proceedings of the fifteenth conference on uncertainty in artificial intelligence, 1999, pp. 289-296.
[Bibtex]
@inproceedings{hofmann1999probabilistic,
title={Probabilistic latent semantic analysis},
author={Hofmann, Thomas},
booktitle={Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence},
pages={289--296},
year={1999},
organization={Morgan Kaufmann Publishers Inc.}
}
[2] T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,” Machine learning, vol. 42, iss. 1-2, pp. 177-196, 2001.
[Bibtex]
@article{hofmann2001unsupervised,
title={Unsupervised learning by probabilistic latent semantic analysis},
author={Hofmann, Thomas},
journal={Machine learning},
volume={42},
number={1-2},
pages={177--196},
year={2001},
publisher={Springer}
}

简单说说推荐模型

(警惕!本文乃学习贴,是笔记,而不是常年做推荐的人的经验之谈,请不要轻信本文中的断言。)

什么是推荐?就是评估用户对于物品的喜欢程度。最简单的就是给定一个稀疏的rating矩阵,代表用户对于物品的打分,由于矩阵稀疏,推荐的直接目标就是把矩阵的missing value给出来。在这种最简单的setting下面,可以有很多扩展,比如给出用户的属性、物品的属性、时间标签、外部环境、用户的社交关系以及其他行为等。因此,推荐算法/系统的好坏,不仅要从速度、准确度角度去评价,还有一些扩展性,是否能够融合不同的属性、适应新的用户、物品等。

推荐的方法一般分为基于内容的过滤和协同过滤。基于内容过滤出发点是事物本身,比如用户睡眠不好,那就给他推荐个失眠药,很直接好懂。但独立用的时候行之不太有效,因为物品一多、需求一复杂,这种直接的对应是很难找到的。协同过滤是基于用户行为的,比如用户A与用户B看电影的口味很相似,A爱的电影B可能也爱看。就现在而言,协同过滤更加有效、流行,并且内容过滤很多时候都已经融入协同过滤中。

协同过滤(CF)又分为Memory-based的和Model-based,前者是直接利用用户行为历史数据,又分为user-based与item-based两种;后者是对用户行为进行模型假设,此类中比较常见的就是矩阵分解模型(Matrix Factorization)、概率矩阵分解(PMF)。下面分别简单说说。

Memory-Based Model

User-based Recommendation

为了计算用户对于物品的打分,这种方法的基本思路是,先计算用户之间的相关度(Cosine距离,Pearson-r关联度;可以用SVD降维),作为每个用户的权,对物品的打分进行加权平均,相关性越大的用户起到的影响作用也越大。数学公式是:

    \[ R_{u,i} = \bar{r_i} + \frac{\sum_{v \in U}{sim(u,v)(R_{v,i}-\bar{r_i})}}{\sum_{v \in U}{|sim(u,v)|}} \]

其中,bar{r_i}是物品i的平均得分,U是用户全集。

Item-based Recommendation

同样为了计算用户对于物品的打分,其思路是,先计算物品之间的相关度,作为物品的权重,而用户对于物品的打分则由用户曾经打分过的物品综合给出,同样是越相关的物品的影响作用越大。公式是:

    \[ R_{u,i} = \frac{\sum_{j \in R(u)}{sim(i,j)R_{u,j}}}{\sum_{j \in R(u)} |sim(i,j)|} \]

其中R(u)是用户u有rating过的产品集合。

Item在推荐系统中相对“稳定“,变化并不那么明显。我们可以离线预先计算好Item直接的相似度,因此基于Item的推荐比基于User-based的推荐(需要计算用户之间的相似度)计算成本要小,文章《Item-Based Collaborative Filtering Recommendation Algorithms》(Badrul Sarwar等)也指出基于Item的方法在实时效果上要优于基于User的方法。

Latent Factor Model — Matrix Factorization

潜在因子模型的本质是认为每个user、每个item都有一些潜在的因子(或者认为是topic),比如user U_1 = (0.5, 0.1, 0.2),movie I_1 = (0.5, 0.2, 0.3),其中三个维度分别代表:对于动作类型电影、喜剧类电影、历史题材电影的响应值,可以看出,用户U_1的对于动作类电影的响应很高(说明他爱看动作类电影),而电影I_1在动作类电影的相应很高(说明这部影片含有很多动作元素)。用向量点乘(U_1^TI_1)得到的值就代表了用户U_1对于电影I_1的打分预期。这个线性点乘的过程很好的吻合了直觉,而潜在因子的思想也很好的反映了人类真实决策过程,所以,这个模型是目前做推荐最流行也是比较准确的模型。关于用matrix factorization来做latent factor model最简单的入门可以看这篇文章(详解了basic svd的思想、建模、优化目标与优化算法)。

融合user/item feature与social relation

如果只用rating矩阵做推荐建模,未免有点不够“尽兴”,还有一堆信息怎么不用呢?来看看:用户有属性,性别、年龄、标签等,物品有属性,类别、标签等,用户有社会关系:关注、被关注,还有些implicit的feedback信息,比如用户看过的但没有rate的电影、用户的点击历史等。如何利用这些信息来建立一个统一的matrix factorization latent factor model呢?上交的tianqi chen formalized了一种feature-based matrix factorization framework(虽然之前有类似的东西,但是似乎没有这么general的formalization),他对于用户rating的建模是:

    \[ R_{u,i} = \mu+\left(\sum_{g \in G} \gamma_{g} b_g + \sum_{m \in M} \alpha_m b^u_m + \sum_{n \in N} \beta_n b^i_n\right) +\left(\sum_{m \in M} \alpha_m \mathbf{p}_m\right)^T\left( \sum_{n \in N} \beta_n \mathbf{q}_n\right) \]

具体的公式解释参考SVDfeature,或看这篇paper: Feature-Based Matrix Factorization。其思路也蛮straight forward,我的理解是:非latent部分:feature本身对于rating分数是会有一定作用的,比如用户对于所有物品的打分的平均分,物品收到的平均打分,这就是上述公式的前两项;latent部分,首先假设有N维的latent factor,被用户与产品共享,(所以\mathbf{p}, \mathbf{q}的维度是一致的),所有的feature都可以在latent factor上进行分解,比如,不同的用户本身在latent factor上的“响应值”分布是不同的,不同的物品也是这样,而对于其他feature,比如标签或类别:“动作片”,也将其在latent factor空间进行分解(学出来的latent factor空间的“响应”应该在“动作片”相关的一个或几个factor上特别大)。如此以来,一对user-item pair对应的feature在latent空间分解后相乘(上述公式的第三项)就代表了latent factor各处的rating的预期。对于social relation,可以直接认为是一种feature,用上述公式就能融入。但是还有些更加sophisticated的方法(请自行用recommendation+social relation谷歌),但是,本质上也没有太大的差别。

基于matrix factorization的latent factor model基本都是一层latent factor,而且基本上都能通过修改上面的对R_{u,i}建模的公式来表达,写出来的公式长得也差不多。下面提到的基于Neural Netowrk的latent factor model的表达就不太一样了。但是我似乎也没看到有多层latent factor的模型。

优化目标

一般都是用用L2优化目标:

    \[ Loss = \sum_{u,i}(r_{u,i} - \hat{r}_{u,i})^2 + regularization \]

这是一个凸优化问题,直接求导然后用梯度下降或拟牛顿就能优化。

References

关于Latent Facotr Model的文章很多,各种各样的变形都有。这里列举其中的几篇文章,值得阅读:

  • Y. Koren, R.M. Bell, C. Volinsky. Matrix Factorization Techniques for Recommender Systems.IEEE Computer
  • Y. Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model.KDD 2008

另外,Netflix和KDD CUP2011,KDD CUP2012获奖者的Technical Report都是值得阅读的。

Latent Factor Model — Neural Network

主要的代表作就是用RBM做的两层神经网络(一层latent factor),具体可以看看这篇paper:Restricted Boltzmann Machines for Collaborative Filtering,也比较straight forward。

有意思的是,作者在paper最后写道将尝试用deep network来做推荐,但是这篇文章已经发了好几年了,也没有什么deep network的后继,看来效果并不是特别理想。但是,直觉上来说,latent factor也是有结构的,比如动作类电影中有成龙的动作片、有李小龙的动作片,虽然都是动作片,这种hierarchical structure和其他类型是structure似乎用deep neural network可以很好的捕获,这方面的空白确实让我感到惊讶。

计算上优化的tips

  • rating矩阵一般比较稀疏并且比较大,对于空间和时间的消耗可能会很大(10,000 x 10,000的矩阵循环赋值一遍就要接近在我的笔记本上就要1分钟,2.4主频的机器;而对于内存的消耗,如果是64位系统下用int,一个int占8字节,那就是800M左右)。因此,一个优化就是,用稀疏矩阵来保存。
  • 对于大矩阵,不要轻易用点乘、遍历等操作,复杂度太高。特别是在train和infer的过程中,不要做MN-dependant(MN是用户数和产品数),要尽量做成instances-dependant(依赖于训练测试的样本数,也就是rating的个数)。具体在代码中,不要对rating矩阵进行for循环,而是对instance进行for循环。
  • 矢量化编程,减少for循环。不妨看看
  • 如果训练数据很大,算梯度的时候可以用SGD,分成多个batch,每个batch训练完就可以更新参数,作为一个迭代,而不要等所有训练样本都累加后再算梯度。
  • 另外,也能用L-BFGS等拟牛顿法来代替Gradient Descent,下降速度会更快。
  • 关于语言选择。由于方便,我选择了python,后来发现不优化的python确实不适合做这种计算密集型的工作,就单纯的for循环,不做优化的python的速度就会比C++慢几十倍,在矩阵运算等一些地方,可能慢上百倍。这是什么概念呢?用python快速写好程序,跑一个结果要等两三个小时,但用C/C++写,可能写的比较久,但跑一个结果只要十几分钟不到。如果要调参数、要试验不同的model,这种差距更是可怕的。

Software

  • SVDfeature。极其快,不仅由于用C++来写,而且作者做过精心优化。作者用该工具在多个竞赛中拿过好成绩。
  • PREA。JAVA写的,实现了很多算法。但同样比C/C++慢。提供了一个benchmark。代码结构简单,用来学习各种算法不错。
  • MyMediaLite。C++写的,也实现了不少算法。提供了一个不同数据集下面的benchmark

(值得注意的是,本文在评价推荐算法时候大都是基于RMSE值最小化的目标,而在现实中,可能是rank-based,或者有其他更多指标,因此,并不能简单认为那些RMSE值跑的好看的算法就是实际中更实用的算法,具体的问题还需要具体的评价指标和算法优化)