Entry-Wise Eigenvector Analysis and Improved Rates for Topic Modeling on Short Documents
Abstract
:1. Introduction
1.1. Related Literature
1.2. Organization and Notations
2. Entry-Wise Eigenvector Analysis for Topic Models
2.1. A Normalized Data Matrix
2.2. Entry-Wise Singular Analysis for
3. Improved Rates for Topic Modeling
3.1. The Topic-Score Algorithm
Algorithm 1 Topic-SCORE |
Input: D, K, and a vertex hunting (VH) algorithm.
Output: the estimated topic matrix . |
3.2. The Improved Rates for Estimating A and W
3.3. Connections and Comparisons
4. Proof Ideas
4.1. Why the Leave-One-Out Technique Fails
4.2. The Proof Structure in [4] and Why It Is Not Sharp for Short Documents
4.3. Non-Stochastic Perturbation Analysis
4.4. Large-Deviation Analysis of
4.5. Proof Sketch of Theorem 1
5. Summary and Discussion
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Preliminary Lemmas and Theorems
Appendix B. Proofs of Lemmas 1 and 2
Appendix B.1. Proof of Lemma 1
Appendix B.2. Proof of Lemma 2
Appendix C. The Complete Proof of Theorem 1
Appendix D. Entry-Wise Eigenvector Analysis and Proof of Lemma A3
Appendix D.1. Proof of Lemma A3
Appendix D.2. Proof of Lemma A4
- (a)
- We claim that:
- (b)
- We claim that under the event :
- (c)
- We aim to derive a high probability bound of by Matrix Bernstein inequality (i.e., Theorem A1). We show that with probability , for some large :
Appendix D.3. Proof of Lemma A5
Appendix D.4. Proof of Lemma A6
Appendix E. Proofs of the Rates for Topic Modeling
Appendix E.1. Proof of Theorem 2
Appendix E.2. Proof of Theorem 3
Appendix E.3. Proof of Theorem 4
References
- Hofmann, T. Probabilistic latent semantic indexing. In Proceedings of the International ACM SIGIR Conference, Berkeley, CA, USA, 15–19 August 1999; pp. 50–57. [Google Scholar]
- Blei, D.; Ng, A.; Jordan, M. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
- Ke, Z.T.; Ji, P.; Jin, J.; Li, W. Recent advances in text analysis. Annu. Rev. Stat. Its Appl. 2023, 11, 347–372. [Google Scholar] [CrossRef]
- Ke, Z.T.; Wang, M. Using SVD for topic modeling. J. Am. Stat. Assoc. 2024, 119, 434–449. [Google Scholar] [CrossRef]
- de la Pena, V.H.; Montgomery-Smith, S.J. Decoupling inequalities for the tail probabilities of multivariate U-statistics. Ann. Probab. 1995, 23, 806–816. [Google Scholar] [CrossRef]
- Arora, S.; Ge, R.; Moitra, A. Learning topic models–going beyond SVD. In Proceedings of the Foundations of Computer Science (FOCS), New Brunswick, NJ, USA, 20–23 October 2012; pp. 1–10. [Google Scholar]
- Arora, S.; Ge, R.; Halpern, Y.; Mimno, D.; Moitra, A.; Sontag, D.; Wu, Y.; Zhu, M. A practical algorithm for topic modeling with provable guarantees. In Proceedings of the International Conference on Machine Learning (ICML), Atlanta, GA, USA, 16–21 June 2013; pp. 280–288. [Google Scholar]
- Bansal, T.; Bhattacharyya, C.; Kannan, R. A provable SVD-based algorithm for learning topics in dominant admixture corpus. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 1997–2005. [Google Scholar]
- Bing, X.; Bunea, F.; Wegkamp, M. A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics. Bernoulli 2020, 26, 1765–1796. [Google Scholar] [CrossRef]
- Erdős, L.; Knowles, A.; Yau, H.T.; Yin, J. Spectral statistics of Erdős–Rényi graphs I: Local semicircle law. Ann. Probab. 2013, 41, 2279–2375. [Google Scholar] [CrossRef]
- Fan, J.; Wang, W.; Zhong, Y. An L-infinity eigenvector perturbation bound and its application to robust covariance estimation. J. Mach. Learn. Res. 2018, 18, 1–42. [Google Scholar]
- Fan, J.; Fan, Y.; Han, X.; Lv, J. SIMPLE: Statistical inference on membership profiles in large networks. J. R. Stat. Soc. Ser. B. 2022, 84, 630–653. [Google Scholar] [CrossRef]
- Abbe, E.; Fan, J.; Wang, K.; Zhong, Y. Entrywise eigenvector analysis of random matrices with low expected rank. Ann. Statist. 2020, 48, 1452–1474. [Google Scholar] [CrossRef] [PubMed]
- Chen, Y.; Chi, Y.; Fan, J.; Ma, C. Spectral methods for data science: A statistical perspective. Found. Trends® Mach. Learn. 2021, 14, 566–806. [Google Scholar] [CrossRef]
- Ke, Z.T.; Wang, J. Optimal network membership estimation under severe degree heterogeneity. arXiv 2022, arXiv:2204.12087. [Google Scholar]
- Paul, D. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Stat. Sin. 2007, 17, 1617. [Google Scholar]
- Zipf, G.K. The Psycho-Biology of Language: An Introduction to Dynamic Philology; Routledge: London, UK, 2013. [Google Scholar]
- Davis, C.; Kahan, W.M. The rotation of eigenvectors by a perturbation. III. SIAM J. Numer. Anal. 1970, 7, 1–46. [Google Scholar] [CrossRef]
- Horn, R.; Johnson, C. Matrix Analysis; Cambridge University Press: Cambridge, UK, 1985. [Google Scholar]
- Jin, J. Fast community detection by SCORE. Ann. Statist. 2015, 43, 57–89. [Google Scholar] [CrossRef]
- Ke, Z.T.; Jin, J. Special invited paper: The SCORE normalization, especially for heterogeneous network and text data. Stat 2023, 12, e545. [Google Scholar] [CrossRef]
- Donoho, D.; Stodden, V. When does non-negative matrix factorization give a correct decomposition into parts? In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 13–18 December 2004; pp. 1141–1148. [Google Scholar]
- Araújo, M.C.U.; Saldanha, T.C.B.; Galvao, R.K.H.; Yoneyama, T.; Chame, H.C.; Visani, V. The successive projections algorithm for variable selection in spectroscopic multicomponent analysis. Chemom. Intell. Lab. Syst. 2001, 57, 65–73. [Google Scholar] [CrossRef]
- Jin, J.; Ke, Z.T.; Moryoussef, G.; Tang, J.; Wang, J. Improved algorithm and bounds for successive projection. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Wu, R.; Zhang, L.; Tony Cai, T. Sparse topic modeling: Computational efficiency, near-optimal algorithms, and statistical inference. J. Am. Stat. Assoc. 2023, 118, 1849–1861. [Google Scholar] [CrossRef]
- Klopp, O.; Panov, M.; Sigalla, S.; Tsybakov, A.B. Assigning topics to documents by successive projections. Ann. Stat. 2023, 51, 1989–2014. [Google Scholar] [CrossRef]
- Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing: Theory and Applications; Cambridge University Press: Cambridge, UK, 2012; pp. 210–268. [Google Scholar]
- Tropp, J. User-friendly tail bounds for sums of random matrices. Found. Comput. Math. 2012, 12, 389–434. [Google Scholar] [CrossRef]
- De la Pena, V.; Giné, E. Decoupling: From Dependence to Independence; Springer Science & Business Media: Berlin, Germany, 2012. [Google Scholar]
- Freedman, D.A. On tail probabilities for martingales. Ann. Probab. 1975, 3, 100–118. [Google Scholar] [CrossRef]
- Bloemendal, A.; Knowles, A.; Yau, H.T.; Yin, J. On the principal components of sample covariance matrices. Probab. Theory Relat. Fields 2016, 164, 459–552. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ke, Z.T.; Wang, J. Entry-Wise Eigenvector Analysis and Improved Rates for Topic Modeling on Short Documents. Mathematics 2024, 12, 1682. https://doi.org/10.3390/math12111682
Ke ZT, Wang J. Entry-Wise Eigenvector Analysis and Improved Rates for Topic Modeling on Short Documents. Mathematics. 2024; 12(11):1682. https://doi.org/10.3390/math12111682
Chicago/Turabian StyleKe, Zheng Tracy, and Jingming Wang. 2024. "Entry-Wise Eigenvector Analysis and Improved Rates for Topic Modeling on Short Documents" Mathematics 12, no. 11: 1682. https://doi.org/10.3390/math12111682
APA StyleKe, Z. T., & Wang, J. (2024). Entry-Wise Eigenvector Analysis and Improved Rates for Topic Modeling on Short Documents. Mathematics, 12(11), 1682. https://doi.org/10.3390/math12111682