Synthetic Experiments with PLSA

Probabilistic Latent Semantic Analysis[1] is a technique for modeling the topic structures of a given set of documents, which is similar to LDA (Latent Dirichlet Allocation) [2], except that the latter has a better (more complete) generative property. In this post, I am conducting some simple experiments for testing the PLSA’s ability to recover the real parameter (here I focused on \beta, the word distribution over topics).

The generative process for the data is described in this post, to make visualization easier, we will not use some random \beta, instead, \beta is set to be visualization friendly as suggested by [3]:

doc100-prototype-beta

 

Here, each small box is a \beta_k for topic k, each small areas inside the box indicate words, the white areas mean the probabilities of the word for that given topic is high, while the black areas mean the reverse. These \beta boxes are generated by the following procedure: consider 10 topics and 25 words, for each topic, each \beta_k is a 1x25 vector, reshape it into 5x5 box, and uniformly fill 5 entries according to the above figure.

After the data is generated, we can test it with PLSA (optimized by EM algorithm), and see how the results below. To get more through understandings of the performance, I leverage some parameters used in generative process and learning process of PLSA to create different data and learning scenarios.

Scenario I – Prototype data

For prototype setting, I used 100 documents, each document with expected 100 words, \alpha is set of 0.2 so that documents have more skewed topic distribution. Here shows some of the documents:

doc100-prototype-Wsome

You can see they are quite mixed, but still appears some kind of pattern. For the document word matrix, we can re-organize it so that documents with the same (highest) topic will appear together (noted this re-organization require ground truth theta, which is not known to PLSA beforehand):

doc100-prototype-W

Here you can see the patterns even more clearly, thus it should not be too difficult for PLSA to recover those parameters like \beta.

The figure below is what PLSA recovered:

doc100-prototype-beta-plsa

Compared with the original \beta boxes, they are quite similar I would say (although there are still some differences from the true \beta, but not far away). So, not surprisingly, PLSA recover the parameters fair enough.

Scenario II – More uniform topic distributions

Since prototype is like the most basic and simplest setting, it is not surprising PLSA works. Now, consider when documents have much more uniform topic distributions, it can be achieved by setting a grater \alpha, say 2. And other settings are the same with Scenario one. Here shows some of the documents and the re-organized document word matrix:

doc100-highoverlapping-Wsomedoc100-highoverlapping-W

Compared with prototype, it is now much difficult for human to discover some patterns. Now how is PLSA doing on recovering \beta:

doc100-highoverlapping-beta-plsa

 

Worse I would say, but still can count as fair. So it seems to be good at handling highly mixed topics.

Scenario III – Few words

Now, I will still use the same settings as in prototype, with the changes on number of expected words per document: now it is 10 instead of 100 in prototype. Some of the documents and the re-organized document word matrix:

doc100-fewwords-Wsomedoc100-fewwords-W

This is PLSA recovered:

doc100-fewwords-beta-plsa

Well, this is not very good, quite sparse compared with true \beta. So fewer words in the documents are definitely not good for PLSA.

Scenario IV – Few documents

How about few documents, now change the number of documents from 100 in prototype to 10. Still see some of the documents and the re-organized document word matrix:

doc100-fewdocs-beta-plsadoc100-fewdocs-W

Okay, see the recovered \beta:

doc100-fewdocs-beta-plsa

Wow… This is.. Extremely bad.. So we learn, small number of documents are definitely bad for PLSA.

Scenario V – Different sized topics (small topics)

All above we assume the size of different topics are uniformly distributed, as we use the same alpha for each topics, now, let’s change it, I will use [1 2 3 4 5 6 7 8 9 10] for the 10 different topics, other than this, all settings are still same as in prototype. See some of the documents and the re-organized document word matrix:

doc100-diffsize-Wsomedoc100-diffsize-W

What the.. I don’t think any human can learn anything about \beta from these figures.. But how much can PLSA recovered

doc100-diffsize-beta-plsa:

Hoho, not much.. This suggest for different sized topics, PLSA perform poorly.

Scenario VI – Learning with more topics

In all above scenarios, learning for PLSA is conducted with true number of topics. What if we don’t know it and use the wrong number of topics? Here I use the same settings as in prototype, but when learning PLSA, instead of using true number of topics, which is 10, I use 20. Now see what PLSA has recovered:

doc100-prototype-beta-plsa

It seems most topics are broken down into many topics, but it wouldn’t matter too much as long as for each topic word distribution it is not polluted by other ones, and the linear combination of those topics can still well recover the true topics.

Scenario VII – Learning with fewer topics

How about fewer topics, here using topic number of 5. Recovered \beta:

doc100-prototype-beta-plsa

Huh.. This is like… merged together? Definitely not very good.

Conclusions

So here comes a summarization of performances of PLSA in recovering the true parameter \beta:

  • PLSA can recover the hidden topics well if the set of documents are more like “prototype”, meaning with a lot of documents, a lot of words per document, each document is not highly mixed with many topics.
  • When the mixing of topics are high, so that each document’s topic distribution is more uniform, PLSA performs a little bit worse, but still fair enough, so highly mixing of topics is not a nightmare for PLSA.
  • Few words in documents and few documents in the corpus are nightmares, especially when there are only a few documents, PLSA is likely to fail.
  • When there are various sized topics, some topics are very popular and appear in many documents, while some topics are cold and only appear on a few documents, PLSA will also perform worse, especially for the cold topics.
  • When using not “exact” number of topics when training, it would not be very problematic if the used number is slightly larger than real one, but it would be if the number is smaller.

It seems those conclusions from the synthetic experiments are someway similar to the conclusion of ICML 2014’s best paper ([4], in Section 4), though which is described for LDA and provide much much deeper studies than provided here.

[1] T. Hofmann, “Probabilistic latent semantic analysis,” in Proceedings of the fifteenth conference on uncertainty in artificial intelligence, 1999, pp. 289-296.
[Bibtex]
@inproceedings{hofmann1999probabilistic,
title={Probabilistic latent semantic analysis},
author={Hofmann, Thomas},
booktitle={Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence},
pages={289--296},
year={1999},
organization={Morgan Kaufmann Publishers Inc.}
}
[2] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” The journal of machine learning research, vol. 3, pp. 993-1022, 2003.
[Bibtex]
@article{blei2003latent,
title={Latent dirichlet allocation},
author={Blei, David M and Ng, Andrew Y and Jordan, Michael I},
journal={the Journal of machine Learning research},
volume={3},
pages={993--1022},
year={2003},
publisher={JMLR. org}
}
[3] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings of the national academy of sciences of the united states of america, vol. 101, iss. Suppl 1, pp. 5228-5235, 2004.
[Bibtex]
@article{griffiths2004finding,
title={Finding scientific topics},
author={Griffiths, Thomas L and Steyvers, Mark},
journal={Proceedings of the National academy of Sciences of the United States of America},
volume={101},
number={Suppl 1},
pages={5228--5235},
year={2004},
publisher={National Acad Sciences}
}
[4] J. Tang, Z. Meng, X. Nguyen, Q. Mei, and M. Zhang, “Understanding the limiting factors of topic modeling via posterior contraction analysis,” in Proceedings of the 31st international conference on machine learning, 2014, pp. 190-198.
[Bibtex]
@inproceedings{tang2014understanding,
title={Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis},
author={Tang, Jian and Meng, Zhaoshi and Nguyen, Xuanlong and Mei, Qiaozhu and Zhang, Ming},
booktitle={Proceedings of The 31st International Conference on Machine Learning},
pages={190--198},
year={2014}
}