Some Existing Work on Linked Documents Analysis

Different from Topic Modeling (which analyzes the topic of documents), Linked Documents Analysis (or Topic Modeling with Links) changes the problem setting by considering the links among documents. The goal is still similar, which is to analyze the topic of documents, expect that sometime we may also want to predict the presence of some queried link. The basic idea behind Linked Document Analysis is that, on one hand, link existence can provide some useful information on the topic distribution, on the other hand, topic distribution can also provide useful information on link existence; in other words, they are correlated. In this post, I am going to review some existing work on topic modeling with links. Basically they can be categorized into generative approaches and discriminative approaches, while former tries to explain why link existing or non-existing by topic/membership distributions of nodes, and the latter tries to utilize existing links in a discriminative way to better model topic/membership distributions.

Generative Approaches

  • Citation Influence Model [1]: The basic idea is that if a document has cited others, the way it generates the topics for word occurrences is either (1) draw from the topic distribution of its cited documents, or (2) draw from its own topic distribution. Other similar models that utilize the citation as a pathway of drawing topic assignments: [2].
  • PHITS+PLSA [3, 4]: it treats each document as “words” when performing link generation, it is like PLSA twice, once on document-word, once on document-document. This method was later extended into “Link-LDA“, which puts a prior on topic distribution extending PLSA to LDA. Both models suffer from not (explicitly) utilizing word topical information for link analysis, link analysis is purely based on the information of link co-occurrence, and also fails to model the topical dependence between the cited and citing documents explicitly.
  • Link-PLSA-LDA[4]: overcoming the difficulties in PHITS+PLSA and “Link-LDA” by connecting word in topic-word and document topic-document by symmetric parametrization of merged cited documents.
  • LDA+MMSB[5]: this may be the most natural generative process for both words and links, it doesn’t treat documents as some kind of dictionaries as in PHITS+PLSA and Link-PLSA-LDA,  and thus can be generalized easily to new documents without previously seen links. However, due to the generative process, this model may not be scale-friendly (though some subsampling techniques might help alleviate the computational complexity).
  • LDA+ZAlign (RTM[6]): defining link probability according to the alignment of latent variables Z. the alignment is quantified by dot production, and the transformation from alignment to probability is through either a Sigmoid function or an Exponential function. The authors pointed out an common weakness of previous models (PHITS+PLSA, Link-PLSA-LDA, LDA+MMSB) that due to their underlying exchangeability assumptions, they might divide the topics into two independent subsets, and allow for links to be explained by one subset of topics, and the words to be explained by the other; to improve over this, they enforce the word and link are draw from the same set of topic assignments. In terms of scalability, since it also has to consider both existing and non-existing links, it is also not scale-friendly.
  • Others [7].

Discriminative Approaches

  • PLSA+\thetaReg (NetPLSA[8]): adding regularization loss, acts like smoothing, on \theta to original PLSA (log-)likelihood. This model can easily deal weighted links.
  • PLSA+\thetaCond (iTopicModel[9]): defining a condition of \theta_i given its neighbors, which leads to an likelihood of \theta configuration given network of documents, then add this conditioned \theta log-likelihood as a smoothing to original PLSA log-likelihood. Compared with PLSA+\thetaReg, samely this model can also easily deal weighted links, differently this model can fit directed networks more naturally, and also it has a closed form solution of \theta so it is much friendly for optimization compared with PLSA+\thetaReg.
  • Others [10].
[1] L. Dietz, S. Bickel, and T. Scheffer, “Unsupervised prediction of citation influences,” in Proceedings of the 24th international conference on machine learning, 2007, pp. 233-240.
title={Unsupervised prediction of citation influences},
author={Dietz, Laura and Bickel, Steffen and Scheffer, Tobias},
booktitle={Proceedings of the 24th international conference on Machine learning},
[2] A. Gruber, M. Rosen-Zvi, and Y. Weiss, “Latent topic models for hypertext,” Arxiv preprint arxiv:1206.3254, 2012.
title={Latent topic models for hypertext},
author={Gruber, Amit and Rosen-Zvi, Michal and Weiss, Yair},
journal={arXiv preprint arXiv:1206.3254},
[3] D. C. T. Hofmann, “The missing link-a probabilistic model of document content and hypertext connectivity.” 2001.
title={The missing link-a probabilistic model of document content and hypertext connectivity},
author={Hofmann, David Cohn Thomas},
[4] R. Nallapati and W. W. Cohen, “Link-plsa-lda: a new unsupervised model for topics and influence of blogs..” 2008.
title={Link-PLSA-LDA: A New Unsupervised Model for Topics and Influence of Blogs.},
author={Nallapati, Ramesh and Cohen, William W},
[5] R. M. Nallapati, A. Ahmed, E. P. Xing, and W. W. Cohen, “Joint latent topic models for text and citations,” in Proceedings of the 14th acm sigkdd international conference on knowledge discovery and data mining, 2008, pp. 542-550.
title={Joint latent topic models for text and citations},
author={Nallapati, Ramesh M and Ahmed, Amr and Xing, Eric P and Cohen, William W},
booktitle={Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining},
[6] J. Chang and D. M. Blei, “Relational topic models for document networks,” in International conference on artificial intelligence and statistics, 2009, pp. 81-88.
title={Relational topic models for document networks},
author={Chang, Jonathan and Blei, David M},
booktitle={International Conference on Artificial Intelligence and Statistics},
[7] Y. Zhu, X. Yan, L. Getoor, and C. Moore, “Scalable text and link analysis with mixed-topic link models,” in Proceedings of the 19th acm sigkdd international conference on knowledge discovery and data mining, 2013, pp. 473-481.
title={Scalable text and link analysis with mixed-topic link models},
author={Zhu, Yaojia and Yan, Xiaoran and Getoor, Lise and Moore, Cristopher},
booktitle={Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining},
[8] Q. Mei, D. Cai, D. Zhang, and C. Zhai, “Topic modeling with network regularization,” in Proceedings of the 17th international conference on world wide web, 2008, pp. 101-110.
title={Topic modeling with network regularization},
author={Mei, Qiaozhu and Cai, Deng and Zhang, Duo and Zhai, ChengXiang},
booktitle={Proceedings of the 17th international conference on World Wide Web},
[9] Y. Sun, J. Han, J. Gao, and Y. Yu, “Itopicmodel: information network-integrated topic modeling,” in Data mining, 2009. icdm’09. ninth ieee international conference on, 2009, pp. 493-502.
title={itopicmodel: Information network-integrated topic modeling},
author={Sun, Yizhou and Han, Jiawei and Gao, Jing and Yu, Yintao},
booktitle={Data Mining, 2009. ICDM'09. Ninth IEEE International Conference on},
[10] T. Yang, R. Jin, Y. Chi, and S. Zhu, “Combining link and content for community detection: a discriminative approach,” in Proceedings of the 15th acm sigkdd international conference on knowledge discovery and data mining, 2009, pp. 927-936.
title={Combining link and content for community detection: a discriminative approach},
author={Yang, Tianbao and Jin, Rong and Chi, Yun and Zhu, Shenghuo},
booktitle={Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining},

Gibbs Sampling for Mixed Membership Stochastic Blockmodels

Mixed Membership Stochastic Blockmodels (MMSB)[1] is a generative model for links in a network. Simply put, give an adjacency matrix E, MMSB assume E is governed by the memberships/topics  of each nodes. Since the original paper[1] only introduced the Variational Inference, this post derives a Gibbs Sampling method for the purpose of parameter estimation in MMSB (actually the MMSB presented in this post is slightly different from[1], as I put a prior on interaction matrix B).

The generative process of MMSB is given below:

For each node i:

Draw a membership distribution vector \theta_i \sim Dir(\alpha)

For each entry (m,n) in B:

Draw B_{mn} \sim Beta(\xi^1_{mn}, \xi^2_{mn})

For each pair of nodes (i,j):

Draw a membership Z_{i: i\rightarrow j} \sim Mult(\theta_i)

Draw a membership Z_{j: i\rightarrow j} \sim Mult(\theta_j)

Draw E_{ij} \sim Ber(B_{Z_{i: i\rightarrow j},Z_{j: i\rightarrow j}})

Here is a simple explanation for the notations:

\alpha is K (number of topics) dimensional hyperparameter for membership/topic distribution.

\xi^1_{mn}, \xi^2_{mn} are two scalar hyperparameters for B_{mn}.

\theta_d is the membership/topic distribution for node d.

B is the interaction probability matrix of memberships.

Z_{i: i\rightarrow j} is the membership of node i when it interacts with node j (the link can be treated as directional, as well as unidirectional). Note that Z_{i\rightarrow j} is abbreviated for \{Z_{i: i\rightarrow j},Z_{j: i\rightarrow j}\}.

E is the adjacency matrix, each entry is either one or zero.

Later on, I am also going to use the following notations:

 m_{ip} is the count of node i assigning to membership p when it interacts with the rest nodes in the network. For directional network, it can be divided into two sets: m_{ip\rightarrow} and m_{ip\leftarrow} (m_{ip} =m_{ip\rightarrow}+m_{ip\leftarrow}), the former is the assignment of node i when it links to other nodes, the latter is the assignment of node i when other nodes link to it.

n_{pq+} is the count of linked node pairs with membership assignments p and q (sum up (p,q) and (q,p) if don’t consider direction).

n_{pq-} is the count of un-linked node pairs with membership assignment p and q (sum up (p,q) and (q,p) if don’t consider direction).

To perform Gibbs Sampling, we need to find the posterior for joint probability of both data and latent variables Z. Well, here it is:

(1)   \begin{equation*} \begin{split} &P(E, Z|\alpha,\xi^1,\xi^2) \\ =&\prod_{(i,j)} P(E_{ij}|Z_{i\rightarrow j}, \xi^1,\xi^2) P (Z_{i\rightarrow j}|\alpha) \\ =& \int \prod_{(i,j)} P(E_{ij}|Z_{i\rightarrow j}, B) P(B|\xi^1,\xi^2) dB \\ & \ \ \ \ \prod_i \int \prod_j P(Z_{i:i\rightarrow j}|\theta)P(Z_{i:j\rightarrow i}|\theta) P(\theta|\alpha)d\theta \\ =& \int \prod_{pq} B_{pq}^{n_{pq+}} (1-B_{pq})^{n_{pq-}} \prod_{pq} \frac{B_{pq}^{\xi^1-1} (1-B_{pq}^{\xi^2-1})}{Beta(\xi^1,\xi^2)} dB \\ & \ \ \ \ \prod_i \int \prod_p \theta_{ip}^{m_{ip}} \frac{1}{Beta(\alpha)} \prod_p \theta_{ip}^{\alpha_p-1} d\theta_i \\ &= \prod_{pq} \frac{Beta(n_{pq+}+\xi^1, n_{pq-}+\xi^2)}{Beta(\xi^1,\xi^2)} \prod_i \frac{Beta(m_i+\alpha)}{Beta(\alpha)} \end{split} \end{equation*}

Now that we obtained the joint probability of data and latent variables, we can start to derive the conditioned probability of latent variable as following:

(2)   \begin{equation*} \begin{split} &P(Z_{i\rightarrow j}|Z_{-(i\rightarrow j)}E_{ij}, \alpha,\xi^1,\xi^2) \\ =& \frac{P(E, Z|\alpha,\xi^1,\xi^2)}{P(E, Z_{-(i\rightarrow j)}|\alpha,\xi^1,\xi^2)}\\ =& \frac{P(E, Z|\alpha,\xi^1,\xi^2)}{P(E_{ij})P(E_{-(ij)}, Z_{-(i\rightarrow j)}|\alpha,\xi^1,\xi^2)}\\ \propto& \frac{P(E, Z|\alpha,\xi^1,\xi^2)}{P(E_{-(ij)}, Z_{-(i\rightarrow j)}|\alpha,\xi^1,\xi^2)} \end{split} \end{equation*}


(3)   \begin{equation*} \begin{split} &P(Z_{i:i\rightarrow j}=p, Z_{j:i\rightarrow j}=q|E_{ij}, \alpha,\xi^1,\xi^2)\\ \propto& \frac{P(E, Z_{-(i\rightarrow j)}, Z_{i:i\rightarrow j}=p, Z_{j:i\rightarrow j}=q|\alpha,\xi^1,\xi^2)}{P(E_{-(ij)}, Z_{-(i\rightarrow j)}|\alpha,\xi^1,\xi^2)} \\ =&\frac{Beta(n_{pq+}+\xi^1, n_{pq-}+\xi^2)}{Beta(n^{-(i\rightarrow j)}_{pq+}+\xi^1, n^{-(i\rightarrow j)}_{pq-}+\xi^2)} \\ & \ \frac{Beta(m_i+\alpha)}{Beta(m^{-(i \rightarrow j)}_i+\alpha)}\frac{Beta(m_j+\alpha)}{Beta(m^{-(i \rightarrow j)}_j+\alpha)} \\ \propto& \frac{(n^{-(i\rightarrow j)}_{pq+}+\xi^1)^{E_{ij}} (n^{-(i\rightarrow j)}_{pq-}+\xi^2)^{1-E_{ij}}}{n^{-(i\rightarrow j)}_{pq+}+n^{-(i\rightarrow j)}_{pq-}+\xi^1+\xi^2}  (m^{-(i\rightarrow j)}_{ip}+\alpha)(m^{-(i\rightarrow j)}_{jq}+\alpha) \end{split} \end{equation*}

Given samples of Z generated from Gibbs Sampling, parameters can be estimated in the following ways:

For \theta, we have

(4)   \begin{equation*} \begin{split} &P(\theta_i|Z_i, \alpha) \\ =& \prod_j  P(Z_{i:i\rightarrow j}|\theta_i) P(Z_{i:j\rightarrow i}|\theta_i) P(\theta_i|\alpha)\\ \propto& \prod_p \theta^{m_{ip}+\alpha-1}_{ip} \end{split} \end{equation*}

E.q. 4 is a Dirichlet p.d.f, using Dirichlet variable property, we get the expectation(mean) of \theta:

    \[ \theta_{ip} \propto m_{ip} + \alpha_p \]

For B, we have

(5)   \begin{equation*} \begin{split} &P(B_{pq}|Z, \xi^1, \xi^2) \\ =&\prod_{(i,j)} P(E_{ij}|B) P(B_{pq}|\xi^1,\xi^2)  \\ \propto& B_{pq}^{n_{pq+}+\xi^1-1} (1-B_{pq})^{n_{pq-}+\xi^1-1} \end{split} \end{equation*}

E.q. 5 is a Beta p.d.f, using Beta variable property, we get the expectation(mean) of B:

    \[ B_{pq} = \frac{n_{pq+}+\xi^1}{n_{pq+}+n_{pq-}+\xi^1+\xi^2} \]

One may want to sample multiple de-correlated samples of \{Z\} to calculate the above parameters.

Likelihood and perplexity

Until now, I didn’t mention the formula for calculating the likelihood of data P(E), which can be important some time for testing of convergence (or using P(E|Z)). In [2], the authors mentioned an approximation technique for computing P(E), it is the harmonic mean of a set of values of P(E|Z^{(n)}) from N samples, where Z^{(n)} is sampled from the posterior P(Z|E). So that is:

    \[ P(E) = \frac{N}{\sum_n^N \frac{1}{P(E|Z^{(n)})}} \]


    \[ Z^{(n)} \sim P(Z|E) \]

Using the previous derivation, we can easily compute that:

(6)   \begin{equation*} \begin{split} &\log P(E|Z^{(n)}) \\ =&\log \prod_{pq} \frac{Beta(n_{pq+}+\xi^1, n_{pq-}+\xi^2)}{Beta(\xi^1,\xi^2)}\\ =& \sum_{pq} \log \Gamma(n_{np+}+\xi^1) + \log \Gamma(n_{np-}+\xi^2) - \log \Gamma(n_{pq+}+n_{pq-}+\xi^1+\xi^2) \\ &+ C \end{split} \end{equation*}

Where C is a constant:

    \[ C = K^2 [\log \Gamma(\xi^1+\xi^2) - \log\Gamma(\xi^1) - \log\Gamma(\xi^2)] \]

Now with \log P(E|Z^{(n)}), we can compute \log P(W) in the way:

(7)   \begin{equation*} \begin{split} \log P(E) =& \log N - \log \sum_n e^{-t_n} \\ =& \log N - \log \sum_n e^{t_0 - t_n} + t_0 \end{split} \end{equation*}

Where t_n = \log P(W|Z^{(n)}), t_0 is introduced to make log-sum-exp computed correctly, usually we can set t_0 = \min \{t_n\}.

The likelihood of hold out data sometime is also an useful quantity, it reflects generalized predictive power of the model, and usually it is measured by perplexity, essentially it is the inverse of the geometric mean per-link likelihood. It is defined by:

    \[ perplexity(H) = \exp \{ - \frac{\sum_{(i,j \in H)} \log P(E_{ij}|E_{observed})}{|H|} \} \]

This term can be evaluated by first hold out some subset of links and non-links, and evaluated by computing:

(8)   \begin{equation*} \begin{split} &P(E_{ij}|E_{observed}) \\ \approx &\sum_{Z_{i:i\rightarrow j}} \sum_{Z_{j:i\rightarrow j}} P(E_{ij}|Z_{i:i \rightarrow j}, \hat B) P(E_{ij}|Z_{j:i \rightarrow j}, \hat B) P(Z_{i: i \rightarrow j}|\hat\theta_i) P(Z_{j:i\rightarrow j}|\hat\theta_j) \end{split} \end{equation*}

Where \hat B and \hat \theta are estimated in the training data.


The Mixed Membership Stochastic Blockmodels is not scalable as it requires to examining O(N^2) edges (this is a lot even for medium size network with only thousands of nodes!), in sharp contrast to sparse networks as arise more often in real datasets. But in practice, for those non-existing edges, instead of using them all, we can sample a small portion of them, this could effectively reduce the computational complexity, especially for sparse networks where we expect the sample ratio of non-existing edges to be even fewer. However, it is still remained to be examined that the relationship between the loss of performance and sampling ratio of non-existing links.

Also due to the scalability issue, one may consider the MMSB without prior for each document topic distribution, which can be also by EM rather than Gibbs Sampling. Also, one can derive Gibbs Sampling for combined LDA and MMSB [3] in similar ways to Gibbs Sampling for LDA (see here) and MMSB individually (basically key is to derive P(W,E,Z) \propto P(W|Z^W) P(E|Z^E) P(Z), where P(Z) = \int P(\theta) P(Z^E|\theta) P(Z^W|\theta) d\theta).

[1] E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing, “Mixed membership stochastic blockmodels,” in Advances in neural information processing systems, 2009, pp. 33-40.
title={Mixed membership stochastic blockmodels},
author={Airoldi, Edoardo M and Blei, David M and Fienberg, Stephen E and Xing, Eric P},
booktitle={Advances in Neural Information Processing Systems},
[2] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings of the national academy of sciences of the united states of america, vol. 101, iss. Suppl 1, pp. 5228-5235, 2004.
title={Finding scientific topics},
author={Griffiths, Thomas L and Steyvers, Mark},
journal={Proceedings of the National academy of Sciences of the United States of America},
number={Suppl 1},
publisher={National Acad Sciences}
[3] R. M. Nallapati, A. Ahmed, E. P. Xing, and W. W. Cohen, “Joint latent topic models for text and citations,” in Proceedings of the 14th acm sigkdd international conference on knowledge discovery and data mining, 2008, pp. 542-550.
title={Joint latent topic models for text and citations},
author={Nallapati, Ramesh M and Ahmed, Amr and Xing, Eric P and Cohen, William W},
booktitle={Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining},