Some Existing Work on Linked Documents Analysis

Different from Topic Modeling (which analyzes the topic of documents), Linked Documents Analysis (or Topic Modeling with Links) changes the problem setting by considering the links among documents. The goal is still similar, which is to analyze the topic of documents, expect that sometime we may also want to predict the presence of some queried link. The basic idea behind Linked Document Analysis is that, on one hand, link existence can provide some useful information on the topic distribution, on the other hand, topic distribution can also provide useful information on link existence; in other words, they are correlated. In this post, I am going to review some existing work on topic modeling with links. Basically they can be categorized into generative approaches and discriminative approaches, while former tries to explain why link existing or non-existing by topic/membership distributions of nodes, and the latter tries to utilize existing links in a discriminative way to better model topic/membership distributions.

Generative Approaches

  • Citation Influence Model [1]: The basic idea is that if a document has cited others, the way it generates the topics for word occurrences is either (1) draw from the topic distribution of its cited documents, or (2) draw from its own topic distribution. Other similar models that utilize the citation as a pathway of drawing topic assignments: [2].
  • PHITS+PLSA [3, 4]: it treats each document as “words” when performing link generation, it is like PLSA twice, once on document-word, once on document-document. This method was later extended into “Link-LDA“, which puts a prior on topic distribution extending PLSA to LDA. Both models suffer from not (explicitly) utilizing word topical information for link analysis, link analysis is purely based on the information of link co-occurrence, and also fails to model the topical dependence between the cited and citing documents explicitly.
  • Link-PLSA-LDA[4]: overcoming the difficulties in PHITS+PLSA and “Link-LDA” by connecting word in topic-word and document topic-document by symmetric parametrization of merged cited documents.
  • LDA+MMSB[5]: this may be the most natural generative process for both words and links, it doesn’t treat documents as some kind of dictionaries as in PHITS+PLSA and Link-PLSA-LDA,  and thus can be generalized easily to new documents without previously seen links. However, due to the generative process, this model may not be scale-friendly (though some subsampling techniques might help alleviate the computational complexity).
  • LDA+ZAlign (RTM[6]): defining link probability according to the alignment of latent variables Z. the alignment is quantified by dot production, and the transformation from alignment to probability is through either a Sigmoid function or an Exponential function. The authors pointed out an common weakness of previous models (PHITS+PLSA, Link-PLSA-LDA, LDA+MMSB) that due to their underlying exchangeability assumptions, they might divide the topics into two independent subsets, and allow for links to be explained by one subset of topics, and the words to be explained by the other; to improve over this, they enforce the word and link are draw from the same set of topic assignments. In terms of scalability, since it also has to consider both existing and non-existing links, it is also not scale-friendly.
  • Others [7].

Discriminative Approaches

  • PLSA+\thetaReg (NetPLSA[8]): adding regularization loss, acts like smoothing, on \theta to original PLSA (log-)likelihood. This model can easily deal weighted links.
  • PLSA+\thetaCond (iTopicModel[9]): defining a condition of \theta_i given its neighbors, which leads to an likelihood of \theta configuration given network of documents, then add this conditioned \theta log-likelihood as a smoothing to original PLSA log-likelihood. Compared with PLSA+\thetaReg, samely this model can also easily deal weighted links, differently this model can fit directed networks more naturally, and also it has a closed form solution of \theta so it is much friendly for optimization compared with PLSA+\thetaReg.
  • Others [10].
[1] L. Dietz, S. Bickel, and T. Scheffer, “Unsupervised prediction of citation influences,” in Proceedings of the 24th international conference on machine learning, 2007, pp. 233-240.
title={Unsupervised prediction of citation influences},
author={Dietz, Laura and Bickel, Steffen and Scheffer, Tobias},
booktitle={Proceedings of the 24th international conference on Machine learning},
[2] A. Gruber, M. Rosen-Zvi, and Y. Weiss, “Latent topic models for hypertext,” Arxiv preprint arxiv:1206.3254, 2012.
title={Latent topic models for hypertext},
author={Gruber, Amit and Rosen-Zvi, Michal and Weiss, Yair},
journal={arXiv preprint arXiv:1206.3254},
[3] D. C. T. Hofmann, “The missing link-a probabilistic model of document content and hypertext connectivity.” 2001.
title={The missing link-a probabilistic model of document content and hypertext connectivity},
author={Hofmann, David Cohn Thomas},
[4] R. Nallapati and W. W. Cohen, “Link-plsa-lda: a new unsupervised model for topics and influence of blogs..” 2008.
title={Link-PLSA-LDA: A New Unsupervised Model for Topics and Influence of Blogs.},
author={Nallapati, Ramesh and Cohen, William W},
[5] R. M. Nallapati, A. Ahmed, E. P. Xing, and W. W. Cohen, “Joint latent topic models for text and citations,” in Proceedings of the 14th acm sigkdd international conference on knowledge discovery and data mining, 2008, pp. 542-550.
title={Joint latent topic models for text and citations},
author={Nallapati, Ramesh M and Ahmed, Amr and Xing, Eric P and Cohen, William W},
booktitle={Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining},
[6] J. Chang and D. M. Blei, “Relational topic models for document networks,” in International conference on artificial intelligence and statistics, 2009, pp. 81-88.
title={Relational topic models for document networks},
author={Chang, Jonathan and Blei, David M},
booktitle={International Conference on Artificial Intelligence and Statistics},
[7] Y. Zhu, X. Yan, L. Getoor, and C. Moore, “Scalable text and link analysis with mixed-topic link models,” in Proceedings of the 19th acm sigkdd international conference on knowledge discovery and data mining, 2013, pp. 473-481.
title={Scalable text and link analysis with mixed-topic link models},
author={Zhu, Yaojia and Yan, Xiaoran and Getoor, Lise and Moore, Cristopher},
booktitle={Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining},
[8] Q. Mei, D. Cai, D. Zhang, and C. Zhai, “Topic modeling with network regularization,” in Proceedings of the 17th international conference on world wide web, 2008, pp. 101-110.
title={Topic modeling with network regularization},
author={Mei, Qiaozhu and Cai, Deng and Zhang, Duo and Zhai, ChengXiang},
booktitle={Proceedings of the 17th international conference on World Wide Web},
[9] Y. Sun, J. Han, J. Gao, and Y. Yu, “Itopicmodel: information network-integrated topic modeling,” in Data mining, 2009. icdm’09. ninth ieee international conference on, 2009, pp. 493-502.
title={itopicmodel: Information network-integrated topic modeling},
author={Sun, Yizhou and Han, Jiawei and Gao, Jing and Yu, Yintao},
booktitle={Data Mining, 2009. ICDM'09. Ninth IEEE International Conference on},
[10] T. Yang, R. Jin, Y. Chi, and S. Zhu, “Combining link and content for community detection: a discriminative approach,” in Proceedings of the 15th acm sigkdd international conference on knowledge discovery and data mining, 2009, pp. 927-936.
title={Combining link and content for community detection: a discriminative approach},
author={Yang, Tianbao and Jin, Rong and Chi, Yun and Zhu, Shenghuo},
booktitle={Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining},
  • chentingpc

    Further reading on other related work:
    1. Unsupervised Prediction of Citation Influences
    2. Latent Topic Models for Hypertext
    3. Combining Content and Link for Classification using Matrix Factorization
    4. Link-based Classification using Labeled and Unlabeled Data
    Just a reminder for myself.

    • chentingpc

      Generalized Latent Factor Models for Social Network Analysis: Has some results for comparison.