[Could not find the bibliography file(s)
Different from Topic Modeling (which analyzes the topic of documents), Linked Documents Analysis (or Topic Modeling with Links) changes the problem setting by considering the links among documents. The goal is still similar, which is to analyze the topic of documents, expect that sometime we may also want to predict the presence of some queried link. The basic idea behind Linked Document Analysis is that, on one hand, link existence can provide some useful information on the topic distribution, on the other hand, topic distribution can also provide useful information on link existence; in other words, they are correlated. In this post, I am going to review some existing work on topic modeling with links. Basically they can be categorized into generative approaches and discriminative approaches, while former tries to explain why link existing or non-existing by topic/membership distributions of nodes, and the latter tries to utilize existing links in a discriminative way to better model topic/membership distributions.
- Citation Influence Model [?]: The basic idea is that if a document has cited others, the way it generates the topics for word occurrences is either (1) draw from the topic distribution of its cited documents, or (2) draw from its own topic distribution. Other similar models that utilize the citation as a pathway of drawing topic assignments: [?].
- PHITS+PLSA [?]: it treats each document as “words” when performing link generation, it is like PLSA twice, once on document-word, once on document-document. This method was later extended into “Link-LDA“, which puts a prior on topic distribution extending PLSA to LDA. Both models suffer from not (explicitly) utilizing word topical information for link analysis, link analysis is purely based on the information of link co-occurrence, and also fails to model the topical dependence between the cited and citing documents explicitly.
- Link-PLSA-LDA[?]: overcoming the difficulties in PHITS+PLSA and “Link-LDA” by connecting word in topic-word and document topic-document by symmetric parametrization of merged cited documents.
- LDA+MMSB[?]: this may be the most natural generative process for both words and links, it doesn’t treat documents as some kind of dictionaries as in PHITS+PLSA and Link-PLSA-LDA, and thus can be generalized easily to new documents without previously seen links. However, due to the generative process, this model may not be scale-friendly (though some subsampling techniques might help alleviate the computational complexity).
- LDA+ZAlign (RTM[?]): defining link probability according to the alignment of latent variables Z. the alignment is quantified by dot production, and the transformation from alignment to probability is through either a Sigmoid function or an Exponential function. The authors pointed out an common weakness of previous models (PHITS+PLSA, Link-PLSA-LDA, LDA+MMSB) that due to their underlying exchangeability assumptions, they might divide the topics into two independent subsets, and allow for links to be explained by one subset of topics, and the words to be explained by the other; to improve over this, they enforce the word and link are draw from the same set of topic assignments. In terms of scalability, since it also has to consider both existing and non-existing links, it is also not scale-friendly.
- Others [?].
- PLSA+Reg (NetPLSA[?]): adding regularization loss, acts like smoothing, on to original PLSA (log-)likelihood. This model can easily deal weighted links.
- PLSA+Cond (iTopicModel[?]): defining a condition of given its neighbors, which leads to an likelihood of configuration given network of documents, then add this conditioned log-likelihood as a smoothing to original PLSA log-likelihood. Compared with PLSA+Reg, samely this model can also easily deal weighted links, differently this model can fit directed networks more naturally, and also it has a closed form solution of so it is much friendly for optimization compared with PLSA+Reg.
- Others [?].