Bayesian – the good and the bad

The following thoughts are general, not necessarily entitled to every models.

The good (not exclusive)
1. probabilistic modeling, principled.

The good (exclusive)
1. generative process enables to plug in “domain knowledge” very easily.
2. prior enables to plug in further “domain knowledge”.
3. integration/summation over latent variables usually yields better performance.

The bad (not exclusive)
1. setting prior can be difficult and may require hyper-parameter turning.

The bad (exclusive)
1. the specific generative process with “domain knowledge” can lower the model’s ability to explore complicity in the data.
2. the design of generative process requires deep understanding about both data, “domain knowledge” and math.
2. the inference is complicated and slow, especially with integration/summation over latent variables.

A brief review of traditional graph embedding methods

In my point of view, the traditional graph embedding problem can be summarized as follows:

Construct a proximity graph G on data points, then embed nodes (data points) into some K-dimensional space, such that the reconstructed G’ is close to original G in some sense.

Before going into some details, let’s first justify why such an idea is useful. Given data points in some high dimensional feature space, it is pretty difficult to separate them directly (assuming you have some ML experience). However, we may observe that in many real data that we care about, although the feature dimension is high, most data actually lay on some low dimensional manifold. The following figure is an example, although data are spreading over 2-d space, but they are mainly organized in two 1-d non-linear subspaces (circle-shaped manifolds). Building the proximity graphs over G can help us find such low-dimensional non-linear manifolds. For example, the two clustered found by KMeans in the figure are pretty messed-up (which means original 2-d space representation of data doesn’t preserve the manifold directly), but if we use 1NN of each data points, we may find that data points in the inner circle and data in the outer circle are completely disconnected, i.e. they have zero similarity, thus we have a better chance to separate them. Now that we have constructed a proximity graph which captures the intrinsic manifolds in data, so we may use them as features for all kinds of tasks, such as clustering, right? No. Because this Screenshot 2015-06-24 20.02.11graph G is of the size of data points, which is usually large and also sparse, directly use neighbors of a node as its features may not be a good idea. A practical idea is to embed nodes (data points) into some lower-dimensional space, such that the proximity in original graph G is preserved, which will be measured by the “closeness” of original proximity graph G and reconstructed graph G’.


Given that we have justified why the graph embedding task is meaningful, there are several essential design choices that one has to make before we can implement it, which lead to different concrete methods. Those questions are:

1. How to construct the proximity graph G?

If we simply construct it using original feature space, such as Euclidean distance of data points, we end up with MDS (Multi-Dimensional Scaling), which is not desirable since this cannot directly preserve manifolds. We can use KNN plus shortest path as in Isomap, which may be easily hindered by noisy data points  (short circuited issue). Another good choice is Gaussian kernel, in which the similarity diverges exponentially, this practice is adopted by many methods, including Laplacian Eigenmaps, t-SNE, etc.. — I am just too lazy to add any references…

What if we are given a graph (adjacency matrix) instead of data points in high dimensional space? There is something called “structure preserving embedding”, which aims to learn a kernel that best “align” with given graph, and treat it as proximity graph.

Anyway, there are some more considerations in this part which I will not discuss here.

2. How to “reconstruct” G?

Once we have the proximity graph G, we now turn to embed nodes in the graph that preserve the graph structure. We achieve this by “reconstruct” proximity graph G directly or indirectly using embedding vectors. In MDS, Isomap, Laplacian Eigenmap, and so on, we “reconstruct” by simply adopting Euclidean distance between two vectors in embedding space. In MDS and Isomap, the distance is forced to match the weight between a pair of nodes, but in Laplacian Eigenmap, the distances are forced to align with weights using inner product. A less often seen in the community of traditional graph embedding, but popular in many representation learning methods choice is to use inner product of two vectors.

3. How to measure the “closeness” between G and G’?

Now that we have a way to “reconstruct” the original graph G, we need to measure the quality of such reconstruction such that it can provide a  mean for learning those embedding vectors. This can be seen as the choice of loss function. Well, L2 loss are commonly seen, such as in MDS, Isomap. As for Laplacian Eigenmap, the alignment loss by inner product is used. Other loss such as KL-divergence, hinge-loss, log-loss, etc., can also be adopted in appropriate contexts.

Other extensions

As I stated in the beginning of the post, the idea is to construct a proximity graph and embedding data points that preserve structures of the proximity graph. However, we may also learn meaningful low dimensional representations directly from data in original feature space. Auto-encoder is an example. Incorporating supervision is also a difficult but meaningful extension. And even with the “data->graph->embedding” idea, we still have other choices of constructing the graph, embedding nodes, and defining the loss based on real problems.

AI and data – do you need to be worried?

Recently some tech celebrities (Stephen Hawking, Elon Musk, Bill Gates, etc., you can google it for the details, here’s a news) have expressed their concerns or worries about harmful AI and human’s future. But really, do you need to worry? On the subjects and beyond, I got a few words to say.

Nowadays, it seems to me most people do not directly say they are working on AI, I feel the word AI has been abandoned by the academia for a few years since the “letdown” of the first generation AI, centered around rule based logic and reasoning. Because AI is hard, people have found other ways to work on it, mainly via machine learning, data mining, computer vision, voice recognition, natural language processing, etc.. With the arise of these fields, relying on the availability of data, today we say big data more often than AI. But what does data have to do with AI?

Here I give an informal definition of “new AI”: AI is the ability to understand real data and generate (non-random and original) data. This definition might be imprecise, also I am not giving a rigorous proof here, but think about it: if a machine can understand all data, from visual data, voice data, natural language text data, to network data, human interaction data, and more, is it less intelligent than human? Maybe there’s something missing: if the machine can create (non-random and original) data, e.g., generate non-random figures, sounds, texts, theorems, etc., then basically we can call it a painter, a writer, or a scientist, and so on, because it has all the expertise and can do creative work, it is then intelligent than most people.

The data point of view is superior than originally whatever it is called AI, because it enables us to make real progress and do so much more (I am also not going to show it here, and I suppose many “AI” researchers would have similar points of view). And if we look at what we are currently doing on “AI” research, we are basically doing so-called data mining (please don’t mix the concept of data mining with the data mining community in academia), especially focusing on data understanding. For example, machine learning, for which the basically principle is that feed data to machines, and make machines understand/recognize it on their own, so they can extract something useful, or make predictions, and so on. But not create! Machine learning currently is not focusing on generating real data (although there might be some trends).

If we say the machine’s ability to understand real data is weak AI, and machine’s ability to generate (non-random and original) data is strong AI. We are so in the phase of weak AI. And we can easily imagine, without strong AI, the weak-AI machines are not so dangerous. You can say cars, or weapons are dangerous, maybe they are, but eventually that depends on the people and conditions they are used. However, people’s worry about AI machines is different: the AI machines can get out of control and may destroy human race — which might be true, but I think weak AI can never do this (without human).

So how far aways are we from strong AI? I think it is pretty far away. We might achieve a little bit of strong AI in some special fields, but general strong AI is still way beyond human’s reachability. But we are getting there eventually, people might need to worry about it at some point in the future before it comes, but I guess not now. Of course, this might just be the pessimism of a practitioner, but the opposite can also be the optimism from non-practitioners.

To conclude, I think the take-aways from the post are:

  • We need to adopt a new point of view about AI, which is all about data, and there are so much we can do about it without achieving what people usually think as human-like AI agents (we did not build a big bird to fly around, and we did not build a robot arm to wash clothes, did we?).
  • For researchers working with data, we really need to think about the big picture of AI, and work towards it solidly, by things like establishing the fundamental principles and methodologies of learning from data, rather than being trapped by all kinds of data applications.
  • Let’s not worry about the harmful AI right now (well you should worry about something like information security, which is kind of related), people wouldn’t do so before the car or plane come around right? (well, maybe some people do..) The “weak AI” (defined above) is something powerful than but similar to cars and planes, they are eventually controlled by human, they can be dangerous if human mis-handle them. The real danger is possible, machines are expected to get out of control and conquer human being, and you need to worry about it, but not before we can really approach the strong AI (so don’t worry about it now).

Notes on Variational Inference Part I

1. Introduction

Consider the following statistical problem setting: we are given \mathbf X and \mathbf Z, where \mathbf X is a set of observations, and \mathbf Z is a set of latent random variables, the goal is to compute the conditional posterior distribution of \mathbf Z given \mathbf X, i.e. P(\mathbf Z|\mathbf X). This is a very general setting, because \mathbf Z here can be anything that we haven’t observed, which includes naturally aroused latent variables, as well as those parameters been considered as random variables.

2. Mean-field Approximation

2.1. Evidence Lower Bound

In many complex models, the posterior P(\mathbf Z|\mathbf X) is intractable (but P(\mathbf Z, \mathbf X) is tractable), which means we cannot compute directly given data. So the idea is to find a approximation distribution over \mathbf Z, i.e. q(\mathbf Z). Under the mean-field approximation, there are two ways to introduce term q(\mathbf Z) by approximation.

KL-divergence. We want to quantify the approximation by using KL-divergence between true distribution and proposed distribution, that is

(1)   \begin{equation*}  D_{KL}(q(\mathbf Z)||P(\mathbf Z|\mathbf X)) = \int q(\mathbf Z) \log \frac{q(\mathbf Z)}{P(\mathbf Z|\mathbf X)} d\mathbf Z \end{equation*}

To make the approximation accurate, we want to minimize the KL-divergence over q. However, once again we encounter the intractable distribution P(\mathbf Z|\mathbf X) in the KL-divergence. To solve this problem, we expend E.q.1:

    \[ D_{KL}(q||p) = \int \log \frac{q(\mathbf Z)}{P(\mathbf Z,\mathbf X)} d\mathbf Z + \log P(\mathbf X) \]

Here we abbreviate distribution q(Z) and P(\mathbf Z|\mathbf X) as q and p. Now since P(X) has nothing to do with how we determine q. So minimizing KL-divergence is equivalent to maximizing the evidence lower bound (this name would make more sense when we come to the other way of deriving it later), i.e. \mathcal L(q):

(2)   \begin{equation*}  \mathcal L(q) = - \int \log \frac{q(\mathbf Z)}{P(\mathbf Z,\mathbf X)} d\mathbf Z = \mathbb E_q [\log P(\mathbf Z, \mathbf X)] - \mathbb E_q [\log q(\mathbf Z)] \end{equation*}

Here term \mathbb E_q [\log P(\mathbf Z, \mathbf X)] is called variational free energy, \mathbb E_q [\log q(\mathbf Z)] is the entropy of q. Thus to minimize the approximation error is to maximize the lower bound over q:

    \[ \max_{q} \mathcal L(q) \]

Jensen Inequality. Another way of deriving approximation distribution q is by considering the estimation of data log-likelihood:

    \begin{equation*} \begin{split} \log P(\mathbf X) &= \log \int P(\mathbf Z, \mathbf X) d\mathbf Z\\ &= \log \int q(\mathbf Z) \frac{P(\mathbf Z, \mathbf X)}{q(\mathbf Z)} d\mathbf Z\\ &\ge \int q(\mathbf Z) \log \frac{P(\mathbf Z, \mathbf X)}{q(\mathbf Z)} d\mathbf Z\\ &= \mathbb E_q [\log P(\mathbf Z, \mathbf X)] - \mathbb E_q [\log q(\mathbf Z)]\\ &= \mathcal L(q) \end{split} \end{equation*}

It is also easy to see that the difference between data likelihood \log P(\mathbf Z) and lower bound \mathcal L(q) is the KL-divergence between q and p.

2.2. Deriving q(\mathbf Z)

Now that we have evidence lower bound containing a tractable distribution P(\mathbf Z, \mathbf X) (good) and a unknown distribution over all latent variables q(\mathbf Z)(not so good), we still need a way to quantify q(\mathbf Z). Under mean-field variational Bayes, we will make an assumption: latent variables can be factorized into several independent sets \{\mathbf Z_i\} (specified by users), i.e.,

(3)   \begin{equation*}  q(\mathbf Z) = \prod_i q_i(\mathbf Z_i| \mathbf X) \end{equation*}

Plug E.q. 3 into lower bound E.q. 2, we can derive the optimum solution of q_i while others fixed is (expressed in logarithm):

    \[ \log q_i^*(\mathbf Z_i) = \mathbb E_{-i} [\log P(\mathbf Z, \mathbf X)] + \text{const} \]


    \begin{equation*} \begin{split} q_i^*(\mathbf Z_i) &\propto \exp \{ \mathbb E_{-i} [\log P(\mathbf Z, \mathbf X)] \} \\ &\propto \exp \{ \mathbb E_{-i} [\log P(\mathbf Z_i|\mathbf Z_{-i}, \mathbf X)] \} \end{split} \end{equation*}


    \[ \mathbb E_{-i}[\log P(z_i|\mathbf Z_i, \mathbf X)] = \int \log P(\mathbf Z_i |\mathbf Z_{-i}, \mathbf X) \prod_{j \neq i} q_j(\mathbf Z_j) d\mathbf Z_j \]

Optimal q_i(z_i) can be derived from here, although might be difficult to work with for some model (for most models it is not).

Once the optimal \{q_i\} for all \mathbf Z are found, we can alternatively update each latent variable until convergence (which is guaranteed due to the convexity of ELBO). Noted the convergent point is local optimal.

2.2.1. q(\mathbf Z) for Exponential Family Conditionals

If P(\mathbf Z_i |\mathbf Z_{-i}, \mathbf X) is in exponential family, then optimal q_i(\mathbf Z_i) is in the same family of distribution as P(\mathbf Z_i |\mathbf Z_{-i}, \mathbf X). This provides a shortcut for doing mean-field variance inference on graphical models with conditional exponential families (which is common for many graphical models): using the theory mentioned here, we can simply write down the optimal variation distributions and their undetermined variational parameters, then setting derivative of ELBO to zero w.r.t. those parameters (or using gradient ascent) can form a coordinate ascent learning procedure. See David Blei’s note.

3. Expectation Propagation – A glance

Same as mean-field approximation, EP also tries to find the approximation distribution q by minimizing the KL-divergence. However, different from mean-field which minimizes the KL-divergence by maximizing the ELBO, EP tries directly maximize KL-divergence, which might be difficult for any distributions, but practical for some distributions, such as exponential families. The details might be contained in future posts.


Synthetic Experiments with PLSA

[Could not find the bibliography file(s)

Probabilistic Latent Semantic Analysis[?] is a technique for modeling the topic structures of a given set of documents, which is similar to LDA (Latent Dirichlet Allocation) [?], except that the latter has a better (more complete) generative property. In this post, I am conducting some simple experiments for testing the PLSA’s ability to recover the real parameter (here I focused on \beta, the word distribution over topics).

The generative process for the data is described in this post, to make visualization easier, we will not use some random \beta, instead, \beta is set to be visualization friendly as suggested by [?]:



Here, each small box is a \beta_k for topic k, each small areas inside the box indicate words, the white areas mean the probabilities of the word for that given topic is high, while the black areas mean the reverse. These \beta boxes are generated by the following procedure: consider 10 topics and 25 words, for each topic, each \beta_k is a 1x25 vector, reshape it into 5x5 box, and uniformly fill 5 entries according to the above figure.

After the data is generated, we can test it with PLSA (optimized by EM algorithm), and see how the results below. To get more through understandings of the performance, I leverage some parameters used in generative process and learning process of PLSA to create different data and learning scenarios.

Scenario I – Prototype data

For prototype setting, I used 100 documents, each document with expected 100 words, \alpha is set of 0.2 so that documents have more skewed topic distribution. Here shows some of the documents:


You can see they are quite mixed, but still appears some kind of pattern. For the document word matrix, we can re-organize it so that documents with the same (highest) topic will appear together (noted this re-organization require ground truth theta, which is not known to PLSA beforehand):


Here you can see the patterns even more clearly, thus it should not be too difficult for PLSA to recover those parameters like \beta.

The figure below is what PLSA recovered:


Compared with the original \beta boxes, they are quite similar I would say (although there are still some differences from the true \beta, but not far away). So, not surprisingly, PLSA recover the parameters fair enough.

Scenario II – More uniform topic distributions

Since prototype is like the most basic and simplest setting, it is not surprising PLSA works. Now, consider when documents have much more uniform topic distributions, it can be achieved by setting a grater \alpha, say 2. And other settings are the same with Scenario one. Here shows some of the documents and the re-organized document word matrix:


Compared with prototype, it is now much difficult for human to discover some patterns. Now how is PLSA doing on recovering \beta:



Worse I would say, but still can count as fair. So it seems to be good at handling highly mixed topics.

Scenario III – Few words

Now, I will still use the same settings as in prototype, with the changes on number of expected words per document: now it is 10 instead of 100 in prototype. Some of the documents and the re-organized document word matrix:


This is PLSA recovered:


Well, this is not very good, quite sparse compared with true \beta. So fewer words in the documents are definitely not good for PLSA.

Scenario IV – Few documents

How about few documents, now change the number of documents from 100 in prototype to 10. Still see some of the documents and the re-organized document word matrix:


Okay, see the recovered \beta:


Wow… This is.. Extremely bad.. So we learn, small number of documents are definitely bad for PLSA.

Scenario V – Different sized topics (small topics)

All above we assume the size of different topics are uniformly distributed, as we use the same alpha for each topics, now, let’s change it, I will use [1 2 3 4 5 6 7 8 9 10] for the 10 different topics, other than this, all settings are still same as in prototype. See some of the documents and the re-organized document word matrix:


What the.. I don’t think any human can learn anything about \beta from these figures.. But how much can PLSA recovered


Hoho, not much.. This suggest for different sized topics, PLSA perform poorly.

Scenario VI – Learning with more topics

In all above scenarios, learning for PLSA is conducted with true number of topics. What if we don’t know it and use the wrong number of topics? Here I use the same settings as in prototype, but when learning PLSA, instead of using true number of topics, which is 10, I use 20. Now see what PLSA has recovered:


It seems most topics are broken down into many topics, but it wouldn’t matter too much as long as for each topic word distribution it is not polluted by other ones, and the linear combination of those topics can still well recover the true topics.

Scenario VII – Learning with fewer topics

How about fewer topics, here using topic number of 5. Recovered \beta:


Huh.. This is like… merged together? Definitely not very good.


So here comes a summarization of performances of PLSA in recovering the true parameter \beta:

  • PLSA can recover the hidden topics well if the set of documents are more like “prototype”, meaning with a lot of documents, a lot of words per document, each document is not highly mixed with many topics.
  • When the mixing of topics are high, so that each document’s topic distribution is more uniform, PLSA performs a little bit worse, but still fair enough, so highly mixing of topics is not a nightmare for PLSA.
  • Few words in documents and few documents in the corpus are nightmares, especially when there are only a few documents, PLSA is likely to fail.
  • When there are various sized topics, some topics are very popular and appear in many documents, while some topics are cold and only appear on a few documents, PLSA will also perform worse, especially for the cold topics.
  • When using not “exact” number of topics when training, it would not be very problematic if the used number is slightly larger than real one, but it would be if the number is smaller.

It seems those conclusions from the synthetic experiments are someway similar to the conclusion of ICML 2014’s best paper ([?], in Section 4), though which is described for LDA and provide much much deeper studies than provided here.

Some Existing Work on Linked Documents Analysis

[Could not find the bibliography file(s)

Different from Topic Modeling (which analyzes the topic of documents), Linked Documents Analysis (or Topic Modeling with Links) changes the problem setting by considering the links among documents. The goal is still similar, which is to analyze the topic of documents, expect that sometime we may also want to predict the presence of some queried link. The basic idea behind Linked Document Analysis is that, on one hand, link existence can provide some useful information on the topic distribution, on the other hand, topic distribution can also provide useful information on link existence; in other words, they are correlated. In this post, I am going to review some existing work on topic modeling with links. Basically they can be categorized into generative approaches and discriminative approaches, while former tries to explain why link existing or non-existing by topic/membership distributions of nodes, and the latter tries to utilize existing links in a discriminative way to better model topic/membership distributions.

Generative Approaches

  • Citation Influence Model [?]: The basic idea is that if a document has cited others, the way it generates the topics for word occurrences is either (1) draw from the topic distribution of its cited documents, or (2) draw from its own topic distribution. Other similar models that utilize the citation as a pathway of drawing topic assignments: [?].
  • PHITS+PLSA [?]: it treats each document as “words” when performing link generation, it is like PLSA twice, once on document-word, once on document-document. This method was later extended into “Link-LDA“, which puts a prior on topic distribution extending PLSA to LDA. Both models suffer from not (explicitly) utilizing word topical information for link analysis, link analysis is purely based on the information of link co-occurrence, and also fails to model the topical dependence between the cited and citing documents explicitly.
  • Link-PLSA-LDA[?]: overcoming the difficulties in PHITS+PLSA and “Link-LDA” by connecting word in topic-word and document topic-document by symmetric parametrization of merged cited documents.
  • LDA+MMSB[?]: this may be the most natural generative process for both words and links, it doesn’t treat documents as some kind of dictionaries as in PHITS+PLSA and Link-PLSA-LDA,  and thus can be generalized easily to new documents without previously seen links. However, due to the generative process, this model may not be scale-friendly (though some subsampling techniques might help alleviate the computational complexity).
  • LDA+ZAlign (RTM[?]): defining link probability according to the alignment of latent variables Z. the alignment is quantified by dot production, and the transformation from alignment to probability is through either a Sigmoid function or an Exponential function. The authors pointed out an common weakness of previous models (PHITS+PLSA, Link-PLSA-LDA, LDA+MMSB) that due to their underlying exchangeability assumptions, they might divide the topics into two independent subsets, and allow for links to be explained by one subset of topics, and the words to be explained by the other; to improve over this, they enforce the word and link are draw from the same set of topic assignments. In terms of scalability, since it also has to consider both existing and non-existing links, it is also not scale-friendly.
  • Others [?].

Discriminative Approaches

  • PLSA+\thetaReg (NetPLSA[?]): adding regularization loss, acts like smoothing, on \theta to original PLSA (log-)likelihood. This model can easily deal weighted links.
  • PLSA+\thetaCond (iTopicModel[?]): defining a condition of \theta_i given its neighbors, which leads to an likelihood of \theta configuration given network of documents, then add this conditioned \theta log-likelihood as a smoothing to original PLSA log-likelihood. Compared with PLSA+\thetaReg, samely this model can also easily deal weighted links, differently this model can fit directed networks more naturally, and also it has a closed form solution of \theta so it is much friendly for optimization compared with PLSA+\thetaReg.
  • Others [?].

Gibbs Sampling for Mixed Membership Stochastic Blockmodels

[Could not find the bibliography file(s)

Mixed Membership Stochastic Blockmodels (MMSB)[?] is a generative model for links in a network. Simply put, give an adjacency matrix E, MMSB assume E is governed by the memberships/topics  of each nodes. Since the original paper[?] only introduced the Variational Inference, this post derives a Gibbs Sampling method for the purpose of parameter estimation in MMSB (actually the MMSB presented in this post is slightly different from[?], as I put a prior on interaction matrix B).

The generative process of MMSB is given below:

For each node i:

Draw a membership distribution vector \theta_i \sim Dir(\alpha)

For each entry (m,n) in B:

Draw B_{mn} \sim Beta(\xi^1_{mn}, \xi^2_{mn})

For each pair of nodes (i,j):

Draw a membership Z_{i: i\rightarrow j} \sim Mult(\theta_i)

Draw a membership Z_{j: i\rightarrow j} \sim Mult(\theta_j)

Draw E_{ij} \sim Ber(B_{Z_{i: i\rightarrow j},Z_{j: i\rightarrow j}})

Here is a simple explanation for the notations:

\alpha is K (number of topics) dimensional hyperparameter for membership/topic distribution.

\xi^1_{mn}, \xi^2_{mn} are two scalar hyperparameters for B_{mn}.

\theta_d is the membership/topic distribution for node d.

B is the interaction probability matrix of memberships.

Z_{i: i\rightarrow j} is the membership of node i when it interacts with node j (the link can be treated as directional, as well as unidirectional). Note that Z_{i\rightarrow j} is abbreviated for \{Z_{i: i\rightarrow j},Z_{j: i\rightarrow j}\}.

E is the adjacency matrix, each entry is either one or zero.

Later on, I am also going to use the following notations:

 m_{ip} is the count of node i assigning to membership p when it interacts with the rest nodes in the network. For directional network, it can be divided into two sets: m_{ip\rightarrow} and m_{ip\leftarrow} (m_{ip} =m_{ip\rightarrow}+m_{ip\leftarrow}), the former is the assignment of node i when it links to other nodes, the latter is the assignment of node i when other nodes link to it.

n_{pq+} is the count of linked node pairs with membership assignments p and q (sum up (p,q) and (q,p) if don’t consider direction).

n_{pq-} is the count of un-linked node pairs with membership assignment p and q (sum up (p,q) and (q,p) if don’t consider direction).

To perform Gibbs Sampling, we need to find the posterior for joint probability of both data and latent variables Z. Well, here it is:

(1)   \begin{equation*} \begin{split} &P(E, Z|\alpha,\xi^1,\xi^2) \\ =&\prod_{(i,j)} P(E_{ij}|Z_{i\rightarrow j}, \xi^1,\xi^2) P (Z_{i\rightarrow j}|\alpha) \\ =& \int \prod_{(i,j)} P(E_{ij}|Z_{i\rightarrow j}, B) P(B|\xi^1,\xi^2) dB \\ & \ \ \ \ \prod_i \int \prod_j P(Z_{i:i\rightarrow j}|\theta)P(Z_{i:j\rightarrow i}|\theta) P(\theta|\alpha)d\theta \\ =& \int \prod_{pq} B_{pq}^{n_{pq+}} (1-B_{pq})^{n_{pq-}} \prod_{pq} \frac{B_{pq}^{\xi^1-1} (1-B_{pq}^{\xi^2-1})}{Beta(\xi^1,\xi^2)} dB \\ & \ \ \ \ \prod_i \int \prod_p \theta_{ip}^{m_{ip}} \frac{1}{Beta(\alpha)} \prod_p \theta_{ip}^{\alpha_p-1} d\theta_i \\ &= \prod_{pq} \frac{Beta(n_{pq+}+\xi^1, n_{pq-}+\xi^2)}{Beta(\xi^1,\xi^2)} \prod_i \frac{Beta(m_i+\alpha)}{Beta(\alpha)} \end{split} \end{equation*}

Now that we obtained the joint probability of data and latent variables, we can start to derive the conditioned probability of latent variable as following:

(2)   \begin{equation*} \begin{split} &P(Z_{i\rightarrow j}|Z_{-(i\rightarrow j)}E_{ij}, \alpha,\xi^1,\xi^2) \\ =& \frac{P(E, Z|\alpha,\xi^1,\xi^2)}{P(E, Z_{-(i\rightarrow j)}|\alpha,\xi^1,\xi^2)}\\ =& \frac{P(E, Z|\alpha,\xi^1,\xi^2)}{P(E_{ij})P(E_{-(ij)}, Z_{-(i\rightarrow j)}|\alpha,\xi^1,\xi^2)}\\ \propto& \frac{P(E, Z|\alpha,\xi^1,\xi^2)}{P(E_{-(ij)}, Z_{-(i\rightarrow j)}|\alpha,\xi^1,\xi^2)} \end{split} \end{equation*}


(3)   \begin{equation*} \begin{split} &P(Z_{i:i\rightarrow j}=p, Z_{j:i\rightarrow j}=q|E_{ij}, \alpha,\xi^1,\xi^2)\\ \propto& \frac{P(E, Z_{-(i\rightarrow j)}, Z_{i:i\rightarrow j}=p, Z_{j:i\rightarrow j}=q|\alpha,\xi^1,\xi^2)}{P(E_{-(ij)}, Z_{-(i\rightarrow j)}|\alpha,\xi^1,\xi^2)} \\ =&\frac{Beta(n_{pq+}+\xi^1, n_{pq-}+\xi^2)}{Beta(n^{-(i\rightarrow j)}_{pq+}+\xi^1, n^{-(i\rightarrow j)}_{pq-}+\xi^2)} \\ & \ \frac{Beta(m_i+\alpha)}{Beta(m^{-(i \rightarrow j)}_i+\alpha)}\frac{Beta(m_j+\alpha)}{Beta(m^{-(i \rightarrow j)}_j+\alpha)} \\ \propto& \frac{(n^{-(i\rightarrow j)}_{pq+}+\xi^1)^{E_{ij}} (n^{-(i\rightarrow j)}_{pq-}+\xi^2)^{1-E_{ij}}}{n^{-(i\rightarrow j)}_{pq+}+n^{-(i\rightarrow j)}_{pq-}+\xi^1+\xi^2}  (m^{-(i\rightarrow j)}_{ip}+\alpha)(m^{-(i\rightarrow j)}_{jq}+\alpha) \end{split} \end{equation*}

Given samples of Z generated from Gibbs Sampling, parameters can be estimated in the following ways:

For \theta, we have

(4)   \begin{equation*} \begin{split} &P(\theta_i|Z_i, \alpha) \\ =& \prod_j  P(Z_{i:i\rightarrow j}|\theta_i) P(Z_{i:j\rightarrow i}|\theta_i) P(\theta_i|\alpha)\\ \propto& \prod_p \theta^{m_{ip}+\alpha-1}_{ip} \end{split} \end{equation*}

E.q. 4 is a Dirichlet p.d.f, using Dirichlet variable property, we get the expectation(mean) of \theta:

    \[ \theta_{ip} \propto m_{ip} + \alpha_p \]

For B, we have

(5)   \begin{equation*} \begin{split} &P(B_{pq}|Z, \xi^1, \xi^2) \\ =&\prod_{(i,j)} P(E_{ij}|B) P(B_{pq}|\xi^1,\xi^2)  \\ \propto& B_{pq}^{n_{pq+}+\xi^1-1} (1-B_{pq})^{n_{pq-}+\xi^1-1} \end{split} \end{equation*}

E.q. 5 is a Beta p.d.f, using Beta variable property, we get the expectation(mean) of B:

    \[ B_{pq} = \frac{n_{pq+}+\xi^1}{n_{pq+}+n_{pq-}+\xi^1+\xi^2} \]

One may want to sample multiple de-correlated samples of \{Z\} to calculate the above parameters.

Likelihood and perplexity

Until now, I didn’t mention the formula for calculating the likelihood of data P(E), which can be important some time for testing of convergence (or using P(E|Z)). In [?], the authors mentioned an approximation technique for computing P(E), it is the harmonic mean of a set of values of P(E|Z^{(n)}) from N samples, where Z^{(n)} is sampled from the posterior P(Z|E). So that is:

    \[ P(E) = \frac{N}{\sum_n^N \frac{1}{P(E|Z^{(n)})}} \]


    \[ Z^{(n)} \sim P(Z|E) \]

Using the previous derivation, we can easily compute that:

(6)   \begin{equation*} \begin{split} &\log P(E|Z^{(n)}) \\ =&\log \prod_{pq} \frac{Beta(n_{pq+}+\xi^1, n_{pq-}+\xi^2)}{Beta(\xi^1,\xi^2)}\\ =& \sum_{pq} \log \Gamma(n_{np+}+\xi^1) + \log \Gamma(n_{np-}+\xi^2) - \log \Gamma(n_{pq+}+n_{pq-}+\xi^1+\xi^2) \\ &+ C \end{split} \end{equation*}

Where C is a constant:

    \[ C = K^2 [\log \Gamma(\xi^1+\xi^2) - \log\Gamma(\xi^1) - \log\Gamma(\xi^2)] \]

Now with \log P(E|Z^{(n)}), we can compute \log P(W) in the way:

(7)   \begin{equation*} \begin{split} \log P(E) =& \log N - \log \sum_n e^{-t_n} \\ =& \log N - \log \sum_n e^{t_0 - t_n} + t_0 \end{split} \end{equation*}

Where t_n = \log P(W|Z^{(n)}), t_0 is introduced to make log-sum-exp computed correctly, usually we can set t_0 = \min \{t_n\}.

The likelihood of hold out data sometime is also an useful quantity, it reflects generalized predictive power of the model, and usually it is measured by perplexity, essentially it is the inverse of the geometric mean per-link likelihood. It is defined by:

    \[ perplexity(H) = \exp \{ - \frac{\sum_{(i,j \in H)} \log P(E_{ij}|E_{observed})}{|H|} \} \]

This term can be evaluated by first hold out some subset of links and non-links, and evaluated by computing:

(8)   \begin{equation*} \begin{split} &P(E_{ij}|E_{observed}) \\ \approx &\sum_{Z_{i:i\rightarrow j}} \sum_{Z_{j:i\rightarrow j}} P(E_{ij}|Z_{i:i \rightarrow j}, \hat B) P(E_{ij}|Z_{j:i \rightarrow j}, \hat B) P(Z_{i: i \rightarrow j}|\hat\theta_i) P(Z_{j:i\rightarrow j}|\hat\theta_j) \end{split} \end{equation*}

Where \hat B and \hat \theta are estimated in the training data.


The Mixed Membership Stochastic Blockmodels is not scalable as it requires to examining O(N^2) edges (this is a lot even for medium size network with only thousands of nodes!), in sharp contrast to sparse networks as arise more often in real datasets. But in practice, for those non-existing edges, instead of using them all, we can sample a small portion of them, this could effectively reduce the computational complexity, especially for sparse networks where we expect the sample ratio of non-existing edges to be even fewer. However, it is still remained to be examined that the relationship between the loss of performance and sampling ratio of non-existing links.

Also due to the scalability issue, one may consider the MMSB without prior for each document topic distribution, which can be also by EM rather than Gibbs Sampling. Also, one can derive Gibbs Sampling for combined LDA and MMSB [?] in similar ways to Gibbs Sampling for LDA (see here) and MMSB individually (basically key is to derive P(W,E,Z) \propto P(W|Z^W) P(E|Z^E) P(Z), where P(Z) = \int P(\theta) P(Z^E|\theta) P(Z^W|\theta) d\theta).