- Article
Soft-Prompted Semantic Normalization for Unsupervised Analysis of the Scientific Literature
- Ivan Malashin,
- Dmitry Martysyuk and
- Aleksei Borodulin
- + 3 authors
Mapping thematic structure in large scientific corpora enables the systematic analysis of research trends and conceptual organization. This work presents an unsupervised framework that leverages large language models (LLMs) as fixed semantic inference operators guided by structured soft prompts. The framework transforms raw abstracts into normalized semantic representations that reduce stylistic variability while retaining core conceptual content. These representations are embedded into a continuous vector space, where density-based clustering identifies latent research themes without predefining the number of topics. Cluster-level interpretation is performed using LLM-based semantic decoding to generate concise, human-readable descriptions of the discovered themes. Experiments on ICML and ACL 2025 abstracts demonstrate that the method produces coherent clusters reflecting problem formulations, methodological contributions, and empirical contexts. The findings indicate that prompt-driven semantic normalization combined with geometric analysis provides a scalable and model-agnostic approach for unsupervised thematic discovery across large scholarly corpora.
5 March 2026





