Topic Analysis of the Literature Reveals the Research Structure: A Case Study in Periodontics

Galli, Carlo; Colangelo, Maria Teresa; Meleti, Marco; Guizzardi, Stefano; Calciolari, Elena

doi:10.3390/bdcc9010007

Open AccessArticle

Topic Analysis of the Literature Reveals the Research Structure: A Case Study in Periodontics

by

Carlo Galli

^1,*

,

Maria Teresa Colangelo

¹,

Marco Meleti

²,

Stefano Guizzardi

¹

and

Elena Calciolari

^2,3

¹

Histology and Embryology Laboratory, Department of Medicine and Surgery, University of Parma, Via Volturno 39, 43126 Parma, Italy

²

Department of Medicine and Surgery, Dental School, University of Parma, 43126 Parma, Italy

³

Centre for Oral Clinical Research, Institute of Dentistry, Faculty of Medicine and Dentistry, Queen Mary University of London, London E1 2AD, UK

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(1), 7; https://doi.org/10.3390/bdcc9010007

Submission received: 17 October 2024 / Revised: 18 December 2024 / Accepted: 2 January 2025 / Published: 5 January 2025

(This article belongs to the Special Issue Application of Semantic Technologies in Intelligent Environment)

Download

Browse Figures

Versions Notes

Abstract

:

Periodontics is a complex field characterized by a constantly growing body of research, which poses a challenge for researchers and stakeholders striving to stay abreast of the evolving literature. Traditional bibliometric surveys, while accurate, are labor-intensive and not scalable to meet the demands of such rapidly expanding domains. In this study, we employed BERTopic, a transformer-based topic modeling framework, to map the thematic landscape of periodontics research published in MEDLINE from 2009 to 2024. We identified 31 broad topics encompassing four major thematic axes—patient management, periomedicine, oral microbiology, and implant-related surgery—thereby illuminating core areas and their semantic relationships. Compared with a conventional Latent Dirichlet Allocation (LDA) approach, BERTopic yielded more contextually nuanced clusters and facilitated the isolation of distinct, smaller research niches. Although some documents remained unlabeled, potentially reflecting either semantic ambiguity or niche topics below the clustering threshold, our results underscore the flexibility, interpretability, and scalability of neural topic modeling in this domain. Future refinements—such as domain-specific embedding models and optimized granularity levels—could further enhance the precision and utility of this method, ultimately guiding researchers, educators, and policymakers in navigating the evolving landscape of periodontics.

Keywords:

periodontics; trending topics; natural language processing; deep learning; artificial intelligence

1. Introduction

Periodontics is a discipline that aims at preserving—and possibly restoring—the integrity of the supporting structures of the teeth [1]. This specialized branch of dentistry operates at the crossroads of various scientific fields, including oral medicine, oral surgery, and tissue regeneration [2]. Drawing knowledge and techniques from such diverse areas, periodontics has to rely on multidisciplinary approaches to address the intricate pathophysiology of periodontal tissues [3].

The field of periodontics finds itself amidst a rapid evolution, marked by significant inflation in its scientific literature over the past decade [4]. This surge, while indicative of progress, also presents a formidable challenge to scholars—the sheer volume of published articles makes the task of retrieving pertinent information and staying abreast of cutting-edge innovations increasingly challenging [5]. This hyperpublication trend, however, mirrors a similar phenomenon that is occurring across all medical disciplines [6], fueled by a multitude of factors that include, in addition to scientific advancements, a surge in the global scientific community, career incentives toward publication, and the emergence of novel publishing models that support these hypertrophic publishing habits [7].

Investigating such a dynamic and expansive terrain entails understanding evolving research themes and the trends that are animating the scientific community. Narrative reviews are and will presumably remain—in the foreseeable future—the prime tool to obtain an overview of any specific topic in the field, but new methods are required to understand the epistemic structure of this whole area of science. Traditional search methodologies, such as manual searches on peer-reviewed journals or popular databases like MEDLINE [8], are at risk of becoming less effective when used alone against this overwhelming volume of information traffic or, using an effective expression by A. Appadurai, when used against such a hectic infoscape [9]. Automated procedures, on the other hand, e.g., those relying on topic modeling, represent a useful tool for a more comprehensive analysis of the scientific output [10].

Topic modeling is a machine learning task that consists of extracting the subject (the ‘about’) from unlabeled documents [11], i.e., in our case, scientific articles. This allows us to automatically screen large datasets of publications, classifying them according to their topics, which could even be useful for the faster identification of articles of interest [12,13]. While various quantitative methods have been employed for this task in the past [14], recent advancements in deep learning, particularly the use of embeddings, have opened new frontiers in neural approaches to topic modeling [15]. Deep learning models trained on extensive corpora assign vectorial representations to words, or even sentences, based on contextual proximity, yielding dense embeddings that encapsulate semantic similarities to hitherto unattained levels of performance [16].

To try an analysis of the quite intricate periodontics field, its most relevant lines of research, and their diachronic development, the present investigation relied on BERTopic, an advanced algorithm implemented by Grootendorst in 2022 [17] that leverages Bidirectional Encoder Representations from Transformers (BERT) embeddings. BERT, which was introduced by Google in 2018 [18], is built around the mechanism of attention to generate contextual embeddings, allowing it to surpass static embedding algorithms, such as Word2Vec [19] or Glove [20], in several tasks [21,22]. Unlike Word2Vec and GloVe, which provide fixed vector representations for words, BERT generates dynamic embeddings that capture the context-dependent meaning of words. This capability underlies BERT’s superior performance across diverse applications. To improve the quality of the topic representation in terms of human readability, we used the OpenHermes-2.5-Mistral Large Language Model, a form of artificial intelligence that is capable of expressing the topic as a brief phrase, instead of chaining a few representative keywords into a topic label, as BERTopic would do by default [23,24,25].

Therefore, the purpose of this study is to use topic modeling to map out the past and current research pursuits in periodontics, identify prevailing trends, and contribute to a more profound understanding of this very dynamic field. The overarching goal of our investigation is to provide a tool to dynamically monitor a whole field of dental research, which could prove very useful in directing novel research efforts and allocating resources to promising new areas.

2. Materials and Methods

2.1. Dataset

Data were produced, processed, and analyzed using Google Colab Pro notebooks powered by Python 3.10.12 [26] and running on a T4 GPU [27]. The corpus was compiled with the Biopython library [28], which employed the Entrez.esearch function to query MEDLINE. MEDLINE was selected as it provides comprehensive coverage of the biomedical and life sciences literature, ensuring access to a robust and diverse dataset for analysis. The query utilized for this exploration is as follows:

periodont*[All Fields] OR parodont[All Fields] OR periodont*[MeSH Terms]

This query was designed to capture all publications broadly related to periodontology, incorporating both general terms (periodont*) and MeSH-specific indexing terms.

To retrieve data systematically, the following iterative process was implemented:

The dataset spanned publications from January 2009 to August 2024.
For each publication year, the database was queried on a monthly basis to ensure complete data retrieval and avoid server-side limitations.
Retrieved information included PubMed ID (PMID), title, publication year, authors, abstract, and MeSH keywords.

The retrieved data were organized into a pandas DataFrame [29], a widely used Python library for structured data manipulation, which facilitated further preprocessing and analysis.

An analysis of the publications was conducted on their titles. Our decision to place a primary focus on titles stemmed from the recognition that they serve as succinct and concentrated summaries [30]. Authors intentionally craft titles to encapsulate the core theme or essence of their work, making them useful for capturing the fundamental topics of each publication [31].

2.2. Data Analysis

2.2.1. The Dataset

Figure 1 summarizes the steps we followed for topic modeling with BERTopic. The dataset retrieved from MEDLINE underwent minimal preprocessing to preserve the original context and semantic integrity of the titles. Entries without titles were removed, as titles served as the primary focus of this analysis. Unlike previous studies, titles were intentionally not lowercased [32], allowing for the preservation of case-sensitive information that might hold semantic or domain-specific significance in the biomedical literature. Stopwords, defined as common grammatical words that typically carry little semantic meaning, were retained during this stage. This decision was based on the advanced capabilities of BERTopic, which leverages transformer embeddings to capture contextual and semantic relationships even in the presence of stopwords [33].

2.2.2. Embedding Generation

Embeddings, within natural language processing, are dense vectors that represent information objects—such as words, sentences, or documents—within a multidimensional space, capturing semantic relationships and contextual meaning [34]. Unlike previous embedding algorithms, such as Word2vec [35], Bidirectional Encoder Representations from Transformers (BERT) can discern the context of a word and generate distinct embeddings for a term based on its contextual meaning [36,37]. Therefore, a pre-trained BERT model is a deep learning architecture trained on large text corpora to understand the contextual relationships of words and sentences. It processes text bidirectionally, meaning it analyzes the context both before and after a given word, allowing it to generate embeddings—dense vector representations of words or phrases in a multidimensional semantic space—that capture their meaning in specific contexts. This sophistication is particularly useful for understanding the—often nuanced—semantics of titles.

To create a semantic representation of the titles, we utilized Huggingface’s all-mpnet-base-v2 model. This pre-trained BERT model generates dense vector representations of text by analyzing bidirectional contextual relationships within the input. Compared to smaller models, such as all-MiniLM-L6-v2, this model provides superior performance in biomedical datasets, as validated in our prior research [38]. The embeddings generated by this model are critical for accurately capturing the nuanced semantics of the titles.

2.2.3. Dimensionality Reduction

Given the high dimensionality of the embeddings, we applied Uniform Manifold Approximation and Projection (UMAP) to reduce them to a lower-dimensional space for efficient clustering [39]. UMAP is based on mathematical concepts from topology, constructing a higher-dimensional graph representing the original data and then optimizing the layout of this graph in lower-dimensional space. For UMAP, the cuML library was used [40].

2.2.4. Clustering

Documents were clustered using Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN), an unsupervised machine learning algorithm that identifies dense regions in the high-dimensional embedding space to form clusters while marking less dense regions as noise [41]. This approach was chosen for its ability to dynamically determine the number of clusters, making it well-suited for exploratory analyses like ours, in which the structure of the data is not predefined. For HDBSCAN, the cuML library was used [40].

After clustering, Class-Based Term Frequency-Inverse Document Frequency (cTf-Idf) was applied to identify topic keywords with higher saliency within each cluster. Tf-Idf, a widely used algorithm in computational linguistics, calculates the importance of a term by increasing its weight based on frequency in a document and decreasing its weight if the term is common across many documents [42]. cTf-Idf extends this approach by calculating term importance across an entire cluster of documents rather than individual ones, ensuring that the extracted keywords reflect the collective content of the cluster and differentiate it from others [43]. To refine the extracted keywords further, stopwords (e.g., common grammatical words) were removed using scikit-learn’s CountVectorizer function [44]. This step improved the interpretability of the topics by ensuring that only meaningful and descriptive terms contributed to the topic representation while the clustering process itself remained unaffected.

We manually tuned the following parameters for UMAP dimensionality reduction and HDBSCAN clustering:

UMAP metric: cosine;
Size of the neighborhood: 25;
Number of components: 10;
HDBSCAN clustering metric: Euclidean;
Minimum cluster size: 250.

All other parameters not explicitly mentioned were left at their default values as provided by the respective software libraries and tools.

2.2.5. Keyword Refinement

BERTopic allows for topic fine-tuning by incorporating additional models to enhance topic representation. In this study, we employed KeyBERT (KeyBERTInspired package from the bertopic.representation module) and Maximal Marginal Relevance (MMR; the MaximalMarginalRelevance package from the bertopic.representation module) to refine the keywords extracted for each topic. These models were used selectively based on qualitative assessments of the adequacy of the default keywords generated by BERTopic. If the default keywords were deemed insufficient to fully describe a topic, these additional methods were applied to improve the representativeness and diversity of the keywords. A topic in this context is defined as a semantic cluster of documents sharing a common theme, while keywords are terms extracted from the documents in that cluster to summarize and represent the topic. KeyBERT, which was developed by Maarten Grootendorst—the creator of BERTopic—uses the transformers library to extract representative keywords more effectively [45]. The process involves embedding the documents using a pre-trained BERT model, tokenizing the text into smaller units (e.g., words or phrases), and generating embeddings for the candidate keywords and n-grams. These embeddings are then compared to the document embeddings based on semantic similarity, with keywords exhibiting the highest similarity scores selected as the most representative terms [46]. The MMR model complements this process by selecting keywords with greater overall diversity, ensuring that the keywords capture a broader range of the topic’s nuances [47]. Together, these methods enable the generation of more precise and comprehensive topic representations.

2.2.6. Topic Modeling with LDA

Latent Dirichlet Allocation (LDA) was employed as an additional topic modeling method. LDA is a generative statistical model that identifies latent topics by clustering co-occurring terms within a corpus [48]. Unlike HDBSCAN, which identifies clusters in a high-dimensional embedding space, LDA generates topics as probabilistic distributions over words [49].

The titles of the documents from the dataset were used as inputs for the LDA analysis. Unlike with sentence embeddings, titles were preprocessed using the CountVectorizer function from scikit-learn, which converts textual data into a document–term matrix [50]. The preprocessing included the removal of punctuation, tokenization, and the exclusion of English stopwords, while preserving case sensitivity to maintain semantic integrity. The vocabulary size was limited to the 1000 most frequent terms to enhance computational efficiency and topic coherence.

The LDA model was trained using the scikit-learn implementation, with the number of topics set to 30, based on BERTopic results. To improve the interpretability of the identified topics, a Large Language Model (LLM) was employed to generate concise and human-readable labels, as with BERTopic.

2.3. LLM Labeling

To improve the interpretability of the topics, we used the OpenHermes-2.5-Mistral Large Language Model (LLM) [51] to generate concise and human-readable labels, replacing the default sequence of keywords. Large language models are advanced artificial intelligence systems engineered to understand, interpret, and generate coherent and contextually relevant human language [52]. These models, while powerful, often require substantial computational resources. To address this challenge, quantized LLMs have emerged, which reduce the memory requirements and computational load with minimal performance loss [53].

For this study, we employed the quantized version of OpenHermes-2.5-Mistral-7B-GGUF (openhermes-2.5-mistral-7b.Q4_K_M.gguf), which is freely available on Huggingface.com. Quantization reduced the model’s 32-bit parameters to 4-bit values, striking a balance between efficiency and performance [54]. To generate labels, the LLM requires a prompt, which serves as a structured input to guide its text generation [55]. We used the following prompt to create topic labels, limiting the output to at most five words to ensure brevity and clarity:

“““ Q:

I have a topic that contains the following documents:

[DOCUMENTS]

The topic is described by the following keywords: ‘[KEYWORDS]’.

Based on the above information, can you give a short label of the topic of at most 5 words?

A:

”””

This approach allowed the model to synthesize information from the cluster’s documents and keywords to produce descriptive labels that captured the core essence of each topic.

For LDA, the top 10 keywords of each topic were extracted based on their importance within the topic-word distribution generated by LDA. These keywords, along with a sample of documents, were provided as input to the LLM using a structured prompt designed to generate descriptive labels of five words or fewer, as follows:

“““ Q:

The following keywords represent a topic identified by a Latent Dirichlet Allocation model:

Keywords: {‘, ’.join(keywords)}.

Please provide a concise and descriptive label for this topic in 5 words or fewer.

A:

”””

2.4. Data Visualization and Trend Analysis

Data were visualized using BERTopic’s inbuilt functions and the matplotlib [56] and seaborn libraries [57]. The chordplot was created using the bokeh library, and the chord library was created using Holoviews [58].

To assess the distributional characteristics of the generated topics, we computed Shannon’s entropy for both the BERTopic and LDA outputs. Shannon’s entropy quantifies the uniformity of document distribution across identified topics. In this context, a higher entropy value indicates a more even spread of documents among the topics, while lower entropy suggests the presence of a few dominant clusters. By comparing entropy values, we gained insight into how each method balanced the thematic subdivisions within the corpus.

To assess the alignment between the topics identified by BERTopic and LDA, we performed a normalized crosstab analysis. For each document in the dataset, we compared its topic assignments from both models: BERTopic and LDA. The resulting contingency table captured the frequency of documents shared between each BERTopic topic (y-axis) and LDA topic (x-axis). To facilitate interpretation, the table was normalized row-wise, converting absolute counts into proportions relative to the total number of documents in each BERTopic topic. The normalized values were then visualized as a heatmap using the seaborn library, where the color intensity reflects the degree of overlap between topics, ranging from dark purple (low alignment) to bright yellow (high alignment).

3. Results

3.1. BERTopic Analysis—Setting the Stage

The generated dataset comprised 93,971 articles published from 2009 to 2024. We analyzed the articles published in 2024 separately because the number of published papers was not comparable to that of the preceding years. Unsurprisingly, the distribution of papers over the years showed a progressive increase in the number of publications in the field (Figure 2), which corresponds to what is generally known about the life science and biomedical sectors.

Topic modeling algorithms can be tuned by operators using many parameters, which affect the number of topics the system can identify. To obtain a macroscopic overview of the research space in periodontics, we first set the minimum cluster size at n = 250, which means that the HDBSCAN clustering step only identified density areas of at least 250 papers as a distinct group. With these settings, BERTopic identified 31 topics (Table 1). Each topic was indicated by an integer number from 0 to 29, including a ‘−1’ null topic, where all the unclassified documents were collected. This group alone contained 43,271 titles, which is almost half of the total number of papers. Far from representing a failure of the algorithm, this high number of unclassified papers is the direct consequence of setting such a high threshold for clusters, i.e., it means that these 43,271 articles did not contain topics that consisted of at least 250 manuscripts.

As a default, BERTopic creates document clusters based on the similarity in their embedding representations and uses the four most frequent keywords found in each cluster to describe them, which then constitutes a topic. In addition to this crude method, we included additional topic representations obtained by applying the KeyBERT algorithm to improve keyword quality and the MMR algorithm to increase keyword diversity, thus possibly capturing more nuances of the topic, as well as the LLM label, whose purpose is to obtain a more immediate, comprehensive, and overall ‘human’ description of the topic (Supplementary Table S1).

The most common topic (indicated as #0) refers, probably unsurprisingly so, to periodontal regeneration and stem cells (Topic #0 Periodontal Stem Cell Regeneration). Topic #1, Peri-Implant Soft Tissue Stability, revolves around implants, which is an area that is admittedly only tangentially related to periodontics (but close enough for the PubMed search to pick it up and include it in the dataset we used). Topic #2, Oral Health and Quality of Life, focuses on oral health and periodontal diseases. Simply going through a ranking of topics, however, does not provide the kind of insight we were aiming for, and we set off to investigate the topological structure of research topics in periodontics within the semantic space. This is made possible by the very nature of embeddings, which, once properly reduced to two-dimensional vectors, can be used as cartesian coordinates.

3.2. Overview of the Research Landscape

Figure 3, Figure 4, Figure 5 and Figure 6 illustrate the distribution of these topic clusters within a cartesian semantic space, as identified by BERTopic, with their LLM labels. LLM labels were preferred over BERTopic’s default labels in our iconography because they provide a more readable rendering of the topics. Each topic in Figure 3, Figure 4, Figure 5 and Figure 6 is represented as a gray circle, and its size is proportionate to the number of items it includes. The proximity of the circles indicates their semantic closeness, with the distance between clusters reflecting topic divergence. Some topics appear to be very closely related—even overlapping—so that four main topic groupings or clusters can be identified (Figure 3A).

The first cluster (Figure 3B) contains seven topics, including a broad and vast topic (#2 Oral Health and Quality of Life, n = 6649), about the epidemiology of periodontal diseases, as suggested by its descriptor keywords, as follows:

KeyBERT: [‘periodontal health’, ‘oral health’, ‘periodontal diseases’, ‘periodontal disease’, ‘periodontitis’, ‘periodontal therapy’, ‘dental care’, ‘periodontal’, ‘oral hygiene’, ‘aggressive periodontitis’].

This is also confirmed by browsing through the titles on this topic, as follows:

Disparities and social determinants of periodontal diseases [59];

Validity of individual self-report oral health measures in assessing periodontitis for causal research applications [60];

Periodontal Health Knowledge and Oral Health-Related Quality of Life in Caribbean Adults [61].

This cluster also contains topics related to the prevention of periodontal disease, i.e., #23, Toothbrush Plaque Removal, which includes papers such as

Effects of professional toothbrushing among patients with gingivitis [62]

and #28, Dentin Hypersensitivity Management, e.g.,

Effect of milk as a mouthwash on dentin hypersensitivity after non-surgical periodontal treatment [63].

This topic cluster further includes research on the effects of health conditions or habits on periodontal disease, i.e., Topics #25, HIV and Periodontal Disease, #19, COVID-19 and Dental Practice, and #13, Smoking and Periodontal Disease.

Interestingly, this cluster of topics also contains Topic #14, Periodontal Disease and Pregnancy Complications, which again has marked epidemiological traits, as follows:

A Six-Month Single-Center Study in 2021 on Oral Manifestations during Pregnancy in Bhubaneswar, India [64];

Unfavourable beliefs about oral health and safety of dental care during pregnancy: a systematic review [65];

Periodontal pathogens of the interdental microbiota in a 3 month pregnant population with an intact periodontium [66].

Cluster 2 (Figure 4A) contains 10 topics that are more closely related to surgery, surgical methods, and their outcomes. A closer look at its topics (Figure 4B) indicates that this topic cluster includes a vast topic (n = 7847) related to peri-implant soft tissues (#1 Peri-Implant Soft Tissue Stability). If we examine its descriptors more carefully, it becomes apparent that the LLM may have only partially captured the nature of this topic, as its KeyBERT keywords are as follows:

[‘implant placement’, ‘dental implant’, ‘dental implants’, ‘implants placed’, ‘immediate implant’, ‘peri implant’, ‘peri implantitis’, ‘ridge augmentation’, ‘alveolar ridge’, ‘implant supported’].

This indicates that the articles of this topic unit are more generally related to implants and implant-related challenges and not as focused on soft tissue stability as LLM suggested.

A quick glance at randomly chosen titles from this group confirms this impression, as follows:

Predictive factors for the treatment success of peri-implantitis: a protocol for a prospective cohort study [67];

Evaluation of Peri-Implant Parameters and Functional Outcome of Immediately Placed and Loaded Mandibular Overdentures: A 5-year Follow-up Study [68];

Immediate implant placement into infected and noninfected extraction sockets: a pilot study [69].

It is thus less surprising that BERTopic identifies another closely associated topic, #15, Sinus Augmentation, in the same cluster, together with #21, Titanium Surface Studies. This cluster also comprises Topic #7, Bond Strength of Dental Restorations, which is described with a surprisingly specific label by the LLM, although KeyBERT and MMR once again reveal a broader scope, as follows:

KeyBERT: [‘ceramic crowns’, ‘resin composites’, ‘dental prostheses’, ‘composite resin’, ‘resin composite’, ‘composite restorations’, ‘denture’, ‘resins’, ‘zirconia crowns’, ‘resin based’];

MMR: [‘zirconia’, ‘resin’, ‘composite’, ‘ceramic’, ‘restorations’, ‘strength’, ‘crowns’, ‘bond’, ‘bond strength’, ‘adhesive’].

Title examination confirms this impression, as follows:

The use of zirconium and feldspathic porcelain in the management of the severely worn dentition: a case report [70];

Influence of hydrothermal aging on the shear bond strength of 3D printed denture-base resin to different relining materials [71].

Topic #7, when considered as a whole, thus appears to contain mostly material-centered reports.

Cluster #2 contains another substantial topic (#3 Giant Cell Granuloma Cases, n = 3796), which, despite the focus on giant cell granuloma, as highlighted in the LLM-generated description, actually contains reports on a much vaster array of oral diseases, as follows:

KeyBERT: [‘cell granuloma’, ‘granuloma’, ‘pyogenic granuloma’, ‘gingival fibromatosis’, ‘cell carcinoma’, ‘ossifying fibroma’, ‘giant cell’, ‘squamous cell’, ‘ameloblastoma’, ‘gingival’]

The following representative selection of titles supports this view:

Diagnosis and management of exuberant palatal pyogenic granuloma in a systemically compromised patient—Case report [72];

Radicular Cyst: A Cystic Lesion Involving the Hard Palate [73];

Management of Chronic Inflammatory Gingival Enlargement: A Short Review and Case Report, [74]

which firmly situates this topic group in the Oral Surgery field.

This at least partially explains why this topic is semantically close to #6, Cleft Lip and Palate Treatment, as they are both eminently surgical topics. This topic cluster contains five additional topics (#4 Antimicrobial Photodynamic Therapy with Diode Laser, #12 Gingival Recession Treatment, #9 Root Canal Therapy Outcomes, #29 Smile Esthetics, and #4 Cone Beam Computed Tomography Applications), which are mostly centered on surgical approaches to periodontal tissue and tend to concentrate in the right part of the diagram.

Figure 5 shows the structure of the third topic cluster, which contains five topics broadly related to the microbiology of periodontal diseases (#6 Porphyromonas Gingivalis Effects, #16 Periodontal Pathogens and Inflammation, and #11 Oral Microbiome and Health), mouthwashes (#8 Chlorhexidine and Herbal Mouthwash), which are intuitively associated to the reduction of the microbiological load, but also a more general topic on periodontal health and probiotics (#17 Probiotic Periodontal Health). This group, too, contains two closely related topics, #6 and #16, and judging by the LLM label alone, it could be assumed that Topic #6 could be a subset of Topic #16.

Therefore, it is once again necessary to investigate the content of these two topics by comparing their keyword descriptors. Topic #16 is characterized by the following keywords:

KeyBERT: [‘aggregatibacter actinomycetemcomitans’, ‘actinomycetemcomitans fusobacterium’, ‘induced aggregatibacter’, ‘actinomycetemcomitans leukotoxin’, ‘actinomycetemcomitans infection’, ‘pathogen aggregatibacter’, ‘actinomyces’, ‘actinobacillus’, ‘actinomycosis’, ‘actinomycetemcomitans’].

The majority of these keywords revolve around one single bacterial species, which would suggest that this topic could be most aptly labeled after this species. However, the MMR algorithm is specifically designed to increase the diversity in the keywords used for the topic representation, and, in this case, it yielded the following keywords:

MMR: [‘aggregatibacter’, ‘actinomycetemcomitans’, ‘aggregatibacter actinomycetemcomitans’, ‘fusobacterium’, ‘nucleatum’, ‘fusobacterium nucleatum’, ‘prevotella’, ‘jp2’, ‘leukotoxin’, ‘serotype’].

Thus, MMR reveals that this topic contains what could be broadly considered articles about oral microbiology. A glance at a selection of titles for this group confirms that reports on Aggregatibacter actinomycetemcomitans are indeed very numerous, but other bacterial species have also been investigated, as follows:

The prevalence of Fusobacterium nucleatum subspecies in the oral cavity stratifies by local health status [75];

Bacteriome analysis of Aggregatibacter actinomycetemcomitans-JP2 genotype-associated Grade C periodontitis in Moroccan adolescents [76];

The role of NLRP3 in regulating gingival epithelial cell responses evoked by Aggregatibacter actinomycetemcomitans [77].

Works on Porphyromonas Gingivalis are, on the contrary, decisively predominant in Topic #6, as follows:

Gingival fibroblast activation by Porphyromonas gingivalis is driven by TLR2 and is independent of the LPS-TLR4 axis [78];

Emergence of Antibiotic-Resistant Porphyromonas gingivalis in United States Periodontitis Patients [79].

Even the MMR algorithm did not detect any extra bacterial species among the keywords for this group, as follows:

MMR: [‘porphyromonas’, ‘porphyromonas gingivalis’, ‘gingivalis’, ‘lipopolysaccharide’, ‘gingivalis lipopolysaccharide’, ‘cells’, ‘induced’, ‘human’, ‘expression’, ‘response’].

Interestingly, works on P. Gingivalis have been clustered in a topic of their own and thus do not represent just another bacterial species in a larger microbiology group. This is most likely a consequence of the extreme abundance of this literature (n = 1792 in this dataset), which made it possible for BERTopic to create a topic just for these works, isolating them from the rest of the microbiological literature (Topic #16 comprises only 664 articles).

Figure 6 shows the last topic group, which mostly contains periomedicine topics (#10 Diabetes and Periodontal Disease, #18 Periodontal–Cardiovascular Disease Link, and #20 Rheumatoid Arthritis and Periodontal Disease) and topics centered on the association of periodontal disease with systemic conditions (and possibly, more specifically, the effects of periodontal disease on systemic diseases or diseases localized in other regions of the organism), such as #24 Periodontitis–CKD association, #27 Bisphosphonate-Related Osteonecrosis, and #26 Vitamin D and Periodontitis, but also, and maybe unexpectedly, a very large topic, Topic #0, Periodontal Stem Cell Regeneration (n = 11715!). Upon closer inspection, its keywords are as follows:

KeyBERT: [‘periodontal regeneration’, ‘periodontal tissue’, ‘human periodontal’, ‘periodontal ligament’, ‘bone regeneration’, ‘osteogenic differentiation’, ‘stem cells’, ‘periodontal disease’, ‘stem cell’, ‘tissue regeneration’];

MMR: [‘cells’, ‘periodontal’, ‘stem’, ‘stem cells’, ‘ligament’, ‘periodontal ligament’, ‘bone’, ‘human’, ‘regeneration’, ‘periodontitis’].

This confirms that this topic is mostly related to periodontal regeneration and its cellular basis. Representative titles conform to this view, as follows:

Multipotent adult progenitor cells acquire periodontal ligament characteristics in vivo [80];

Novel gene-activated matrix with embedded chitosan/plasmid DNA nanoparticles encoding PDGF for periodontal tissue engineering [81];

Cementogenesis and the induction of periodontal tissue regeneration by the osteogenic proteins of the transforming growth factor-beta superfamily [82].

However, in vitro or wet lab subjects are also present in this group, as follows:

The spatial transcriptomic landscape of human gingiva in health and periodontitis [83], or

Emerging roles of exosomes in oral diseases progression [84].

It can be speculated that the focus on cellular and molecular mechanisms might have caused BERTopic to cluster this topic in this otherwise differently oriented group.

3.3. Relations Between Topics

To gain deeper insights into how these research areas are related to each other, we ran a cosine similarity check on embeddings of the titles in the dataset. For each topic, we computed a single representative embedding by averaging the embeddings of all titles within the topic. This averaged embedding allowed us to compare topics directly and identify all topic pairs that exhibited a cosine similarity > 0.9.

We represented the associations as a chord plot in Figure 7, which visually represents how papers on the topic of Probiotic Periodontal Health, on average, unsurprisingly display a high degree of similarity to papers in both Topics #11, Oral Microbiome and Health, and #8, Chlorhexidine and Herbal Mouthwash, given the strong emphasis on the microbiological dimension of periodontics. Similarly intuitive is the association between Topic #18, Periodontal–Cardiovascular Disease Link, and Topics #24, Periodontal–CKD Association, and #10, Diabetes and Periodontal Disease, as they are all centered on the link between periodontal health and systemic conditions. Papers on Topic #23, Toothbrush Plaque Removal, display high similarities to papers on Topic #2, Oral Health Quality of Life, while the only apparently surprising association is that between #9, Root Canal Therapy Outcomes, and Topic #4, Cone Beam Computed Tomography Applications (which is also related to Topic #1, Peri-Implant Soft Tissue Stability), because Topic #4 also includes radiographic studies, which focus on techniques that are necessary to endodontic diagnosis.

3.4. LDA Topic Modeling: A Conventional Perspective

In addition to the BERTopic analysis, we employed Latent Dirichlet Allocation (LDA) to complement the exploration of the research field’s thematic structure. LDA provides a probabilistic framework for identifying latent topics, representing each document as a mixture of topics and each topic as a distribution over words. Using the same dataset of 93,971 titles, we trained the LDA model to identify 31 latent topics chosen based on BERTopics results. The topics are characterized by the following keywords:

Topic 0: health, related, oral, quality, life, dental, factors, periodontal, self, survey, prevalence, risk, practice, national, disorders, examination, reported, patients, evidence, impact;

Topic 1: gingivalis, porphyromonas, human, cells, gingival, response, fibroblasts, inflammatory, induced, expression, activation, lipopolysaccharide, production, epithelial, signaling, il, mediated, pathway, macrophages, periodontal;

Topic 2: cell, cancer, oral, carcinoma, squamous, giant, neck, granuloma, head, epithelial, tumor, cells, tongue, interactions, death, central, stem, hormone, gingiva, line;

Topic 3: bone, alveolar, tooth, ridge, movement, augmentation, orthodontic, using, defects, preservation, regeneration, graft, technique, reconstruction, grafting, autogenous, block, vertical, study, guided;

Topic 4: healing, laser, human, effects, low, titanium, wound, gingival, effect, acid, fibroblasts, level, oral, osteonecrosis, expression, therapy, diode, jaw, bisphosphonate, related;

Topic 5: orthodontic, patients, treatment, cleft, lip, palate, class, occlusal, ii, fixed, changes, force, skeletal, periodontal, analysis, classification, unilateral, scanning, malocclusion, mandibular;

Topic 6: oral, bacteria, biofilm, periodontal, pathogens, bacterial, il, subgingival, activity, pregnancy, microbial, antimicrobial, plaque, biofilms, fusobacterium, nucleatum, species, vitro, microbiota, periodontitis;

Topic 7: implantitis, peri, effect, surface, resin, vitro, based, calcium, composite, properties, strength, nanoparticles, antibacterial, different, candida, dentin, evaluation, bond, removal, biofilm;

Topic 8: bone, implant, implants, study, analysis, extraction, placement, immediate, peri, loss, element, finite, dental, marginal, sockets, sites, healing, evaluation, influence, stability;

Topic 9: patients, periodontitis, levels, gingival, chronic, fluid, crevicular, peri, salivary, periodontal, serum, implant, saliva, healthy, biomarkers, smokers, analysis, matrix, interleukin, expression;

Topic 10: study, pilot, vivo, actinomycetemcomitans, aggregatibacter, evaluation, comparative, experimental, gingivitis, oxide, root, effect, preliminary, ex, assessment, pocket, periodontal, probing, dogs, bleeding;

Topic 11: implants, implant, year, study, dental, clinical, retrospective, term, long, follow, supported, prospective, single, fixed, years, results, short, immediate, loading, posterior;

Topic 12: platelet, rich, fibrin, molar, molars, plasma, mandibular, teeth, impacted, second, permanent, extraction, lower, genome, healing, effect, prf, tooth, primary, periodontal;

Topic 13: periodontitis, function, syndrome, host, like, loss, differential, bone, oxygen, periodontium, neutrophil, linked, extracellular, reactive, role, epithelium, poly, homeostasis, immune, brain;

Topic 14: study, different, sinus, root, vitro, implant, laser, materials, evaluation, comparison, using, surface, er, floor, titanium, effect, elevation, yag, comparative, surfaces;

Topic 15: study, cross, periodontitis, periodontal, association, sectional, risk, patients, disease, population, control, cohort, status, adults, 19, loss, COVID, tooth, associated, case;

Topic 16: case, report, management, series, treatment, patient, maxillary, approach, cases, teeth, esthetic, root, surgical, rare, follow, tooth, incisor, implant, central, anterior;

Topic 17: treatment, endodontic, dentistry, periodontal, regenerative, teeth, clinical, delivery, drug, apical, application, diagnosis, smile, medicine, therapy, new, invasive, applications, strategies, procedures;

Topic 18: tomography, computed, cone, beam, using, maxillary, root, study, evaluation, canal, time, assessment, analysis, accuracy, periapical, detection, digital, cbct, teeth, thickness;

Topic 19: tissue, regeneration, soft, bone, dental, periodontal, collagen, pulp, guided, engineering, stem, cells, membrane, based, hard, using, membranes, scaffold, scaffolds, hydroxyapatite;

Topic 20: therapy, periodontitis, treatment, periodontal, clinical, patients, chronic, non, surgical, effect, photodynamic, efficacy, aggressive, randomized, effects, microbiological, root, study, scaling, adjunct;

Topic 21: gingival, tissue, graft, recession, connective, flap, treatment, multiple, advanced, coverage, free, technique, root, clinical, matrix, coronally, recessions, pain, overgrowth, induced;

Topic 22: periodontitis, induced, bone, rats, stress, experimental, loss, inflammation, signaling, oxidative, inflammatory, periodontal, pathway, alveolar, model, effects, mice, women, resorption, differentiation;

Topic 23: cells, periodontal, ligament, human, stem, differentiation, osteogenic, derived, mesenchymal, vitro, effect, effects, cell, proliferation, dental, expression, fibroblasts, fibroblast, fluoride, containing;

Topic 24: clinical, randomized, trial, controlled, defects, treatment, study, evaluation, mouth, periodontal, efficacy, enamel, double, matrix, intrabony, split, versus, randomised, blind, effect;

Topic 25: factor, growth, expression, streptococcus, receptor, protein, gene, alpha, necrosis, beta, mice, endothelial, tumor, vascular, formation, mutans, nuclear, interleukin, treponema, binding;

Topic 26: disease, periodontal, diseases, oral, role, periodontitis, health, systemic, association, chronic, gene, risk, polymorphisms, cardiovascular, microbiome, genetic, vitamin, relationship, potential, polymorphism;

Topic 27: odontogenic, lesions, dental, oral, analysis, diagnostic, periapical, current, research, identification, maxillofacial, based, cysts, frequency, learning, imaging, acute, future, gum, origin;

Topic 28: review, systematic, analysis, meta, literature, studies, periodontal, treatment, implant, narrative, efficacy, trials, peri, network, therapy, effectiveness, dental, effect, scoping, outcomes;

Topic 29: oral, diabetes, disease, type, periodontitis, arthritis, infection, patients, mellitus, periodontal, rheumatoid, cavity, associated, hiv, manifestations, peripheral, inflammatory, mechanisms, microbiota, possible;

Topic 30: dental, oral, health, children, care, caries, status, hygiene, patients, students, knowledge, adults, study, adolescents, education, tooth, india, factors, older, old.

To facilitate an easier interpretation of these topics, instead of just listing the keywords, we utilized a Large Language Model (LLM) to process the 20 most relevant keywords for each topic and generate topic labels, similar to what we achieved using BERTopic. These labels, which are listed in Table 2, provide a more accessible summary of the topics, enabling a clearer understanding of the research areas represented in the dataset.

The distribution of documents across topics identified by LDA and BERTopic reveals distinct clustering patterns. Shannon’s entropy scores, which measure the uniformity of distribution, underscore this difference; LDA yields a more uniform topic distribution (entropy = 4.8) compared to BERTopic (entropy = 3.1). This suggests that LDA topics tend to have more balanced sizes, whereas BERTopic produces clusters that include both large, broad topics and smaller, more niche ones. Notably, some LDA topics closely resemble BERTopic clusters, such as LDA Topic #0, ‘Oral Health & Quality of Life Factors’, which aligns with BERTopic Topic #2, ‘Oral Health Quality of Life’, and LDA Topic #21, ‘Gingival tissue grafts and treatments for periodontal recession’, which mirrors BERTopic Topic #12, ‘Gingival Recessions Treatment’.

To further explore the structure of topics identified by LDA, we projected the topic-word distributions (LDA components) into a 2D semantic space using UMAP (Figure 8). These components represent the probability distributions of words across each topic, in which each topic is defined by a unique combination of words weighted by their importance. As with BERTopic, topics positioned closely in the 2D semantic space indicate higher semantic similarity. Although the semantic clustering of LDA topics is less visually pronounced compared to BERTopic, certain relationships are apparent. For example, LDA Topic #0, ‘Oral Health & Quality of Life Factors’, and Topic #30, ‘Oral Health Education and Care for All Age Groups’, and Topic #11, ‘Dental implant studies’, and Topic #7, ‘Dental Implant Surface Properties and Biofilm Effects’, are located in similar areas of the semantic space, reflecting a similarity already suggested by their labels. However, overall, the clustering of LDA topics appears less distinct than that observed for BERTopic.

It is also apparent that some LDA topics are close cognates of BERTopic clusters, e.g., LDA Topic #0, Oral Health and Quality of Life Factors, closely resembles BERTopic Topic #2, Oral Health Quality of Life, and LDA Topic #21, Gingival tissue grafts and treatments for periodontal recession is closely reminiscent of BERTopic Topic #12, Gingival Recessions Treatment.

To evaluate how well the topics identified by LDA and BERTopic align, we employed a contingency table generated with the pd.crosstab function and visualized it as a heatmap (Figure 9). This heatmap summarizes the frequency of documents assigned to each BERTopic-derived topic against the LDA-derived topics, representing the overlap between the two methods. To facilitate interpretation, the table was normalized row-wise to show the proportion of documents from each BERTopic topic aligning with LDA topics. As depicted in Figure 9, despite some label resemblances, most topics from the two methods do not align closely. A few notable exceptions include LDA Topic #15, ‘Periodontal Disease Risk Factors’, which partially aligns with BERTopic Topic #24, ‘Periodontitis-CKD association’ (11% overlap), Topic #23, ‘Toothbrush Plaque Removal’ (10%), and Topic #20, ‘Rheumatoid Arthritis and Periodontal Disease’ (10%). Similarly, LDA Topic #30, ‘Oral Health Education and Care for All Age Groups’, showed partial alignment with BERTopic Topic #25, ‘HIV and Periodontal Disease’, Topic #22, ‘Cleft Lip and Palate Treatment’, and Topic #17, ‘Probiotics Periodontal Health’ (each with 10% overlap). These results highlight the challenges in directly comparing outputs from the two algorithms, which employ fundamentally different methodologies to define and cluster topics.

3.5. A Changing Mosaic: The Evolution of Research Topics

Over the years, most topics have exhibited some growth, although with noticeable oscillations. Figure 10 shows the number of publications appeared in academic journals and indexed in MEDLINE for the top 10 topics in our dataset, as identified by BERTopic.

Topic #0, Periodontal Stem Cell Regeneration, has grown steadily in the analyzed timeframe, while #1, Peri-Implant Soft Tissue Stability, and Topic #2, Oral Health Quality of Life, have slightly but steadily grown in the same period. While no topic appeared to decline in the 2009–2023 timespan, some topics have maintained constant rates of publication, such as Topic #3 and Topic #9.

4. Discussion

Topic modeling algorithms open up significant possibilities for data analysis, particularly as novel tools, such as sentence transformers, become more capable of capturing subtle meaning differences within analyzed text corpora [38]. The tool of choice significantly influences the type and quality of outcomes, and in this context, BERTopic stands out with a plethora of parameter customization options, which range from the selection of sentence transformers, dimensionality reduction, and clustering preferences to the incorporation of supervised, semi-supervised, or purely guided approaches, thus granting considerable flexibility in refining the modeling process [17]. BERTopic can handle diverse paper corpora without requiring extensive preprocessing or text cleaning, enabling investigators to quickly analyze significantly larger datasets compared to the past [15,16].

In this study, we applied BERTopic to a corpus of periodontics research to uncover thematic structures and compare them with those derived using a traditional LDA approach [49]. We opted for a low-granularity setting, focusing on broad thematic clusters that could be clearly aligned with known research areas in periodontics. This approach facilitated straightforward mapping of the literature landscape.

According to these BERTopic settings, periodontics research is arranged along four significant topic axes, which could be tentatively labeled (1) patient management and hygiene; (2) periodontal (and implant) surgery; (3) oral microbiology; and (4) periomedicine. The first and the last topic groups may be more challenging to characterize based on our data because they appear to overlap to some extent. The first topic cluster, although it contains a topic on periodontal disease and low birth weight in pregnant women, which would be intuitively perceived as closer to periomedicine and thus belong to the fourth cluster, is mostly centered on patient management and the epidemiological aspects of periodontal disease. Two semantic polarities can actually be identified in this first cluster, with one pole comprising topics about toothbrushing and hygiene management (#23 Toothbrush Plaque Removal and #28 Dentin Hypersensitivity Management), and one pole that revolves around patient management by the periodontists, including therapy and the challenges that the recent sanitary emergencies have raised (e.g., #19 COVID-19 and Dental Practice). We have chosen the periomedicine label for topic Cluster #4 to highlight the presence of topics investigating the links between periodontal disease and systemic diseases, although this group of topics also comprises a very large topic on periodontal regeneration and stem cells. It is presumable that the semantic proximity of these groups lies in their focus on the cellular mechanisms of bone physiology and its relation to the general metabolism and the immune system as the underlying theme that runs across all these topics. When taken together, Cluster #4 also contains two areas of topic density, one that is mostly focused on the associations between periodontal disease and systemic diseases (#10 Diabetes and Periodontal Disease, #18 Periodontal–Cardiovascular Disease Link, #20 Rheumatoid Arthritis and Periodontal Disease, and #24 Periodontitis–CKD association) and one that is centered on bone, bone regeneration, bone osteonecrosis, and vitamin D, including Topics #0, #27, and #26.

The second topic cluster is hybrid in nature, and it highlights the deep connections that connect periodontics with implant dentistry. Many topics in this cluster focus on endosseous implants and related research (e.g., #21 Titanium Surface Studies), although periodontal surgery and maxillo-facial surgery of the lip and palate form a constellation of topics gravitating around a core of implant research. Also, in the case of this cluster, two main areas of semantic addensation can be identified, with implant dentistry on one side and oral surgery on the other (Figure 4B).

The third topic cluster, unlike the previous clusters, cannot be divided into further poles of semantic attraction, but its layout suggests a continuity of meaning that ranges from Topic #6, Porphyromonas Gingivalis Effects, to studies on the oral microbiome and priobiotics, studies on aggregatibacter and other pathogenic bacterial species’ infections, and #8, Chlorhexidine and Herbal Mouthwash. There is also apparently a disconcerting semantic distance between Topic #6, Porphyromonas Gingivalis Effects, and the apparently close Topic #16, Aggregatibacter Actinomycetemcomitans Infections.

To gain a better understanding of the potential and limits of BERTopic, we decided to model the topics of this dataset with a non-neural algorithm, LDA. Unlike BERTopic, LDA requires predefining the number of topics, which we set to match the number of topics identified by BERTopic (n = 31). When paired with HDBSCAN, BERTopic significantly reduces reliance on the investigator’s subjective choices. By merely selecting a minimum cluster size, the researcher can allow the algorithm to automatically determine the number of distinct topic areas present in the corpus. The obvious drawback is the way that HDBSCAN generates a substantial number of unclassified documents marked as −1 in the algorithm output, which, with our settings, exceeded 43,000 titles. HDBSCAN uses points density to compute clusters in an unsupervised fashion and has the clear advantage of not requiring operators to pre-set the number of clusters, unlike well-known unsupervised algorithms, such as K-Means. On the other hand, HDBSCAN will not force every point into a cluster, as K-Means would do, and will keep data points that do not fit into any cluster as unlabeled. Thus, these unlabeled articles come from either the inability of BERTopic/HDBSCAN to assign these documents to an existing topic group because of their semantic ambiguity or the inability to form new clusters, despite their semantics, because these would not contain enough manuscripts to meet the threshold. In the first case, the semantics of these titles may not be completely captured by the sentence transformers, and it is possible that better-trained encoders might be able to provide better results; in the latter case, however, the −1 null topic group contains articles that investigate an area that is still unexplored. According to our experience, reducing the minimum topic size reduces the size of the −1 null topic group in this dataset, indicating that at least some of the papers in this group belonged to small niches. However, this reduction risks over-fragmentation of the research landscape, leading to an excessive number of narrowly defined topics that primarily reflect lexical or minor semantic variations rather than substantive insights. Therefore, investigators must strike a balance between topic granularity and parcellation to avoid noise.

When compared side by side with LDA, BERTopic seems to be particularly apt at generating informative, albeit smaller, topics, e.g., Topic #19, ‘COVID-19 and Dental Practice’ (n = 483), which contains a niche that is completely absent in LDA topics, as well as Topic #10, ‘Diabetes and Periodontal Disease’, Topic #18, ‘Periodontal-Cardiovascular Disease Link’, and Topic #20, ‘Rheumatoid Arthritis and Periodontal Disease’, which correctly classify the literature that explores the link between periodontal disease and different specific medical conditions or diseases, to which LDA can only juxtapose a generic topic, Topic #29, ‘Oral health and systemic diseases’. LDA produced more uniform topic distributions, with a Shannon’s entropy value of 4.8, compared to BERTopic’s 3.1, indicating less variation in topic size. This uniformity can make LDA appear more systematic, but it also leads to overly generic topics, such as LDA Topic #12, ‘Dental procedures and related genetics’, which lacks a clear thematic focus. Furthermore, the occurrence of LDA Topic #14, ‘Undetermined’, reflects instances in which even manual inspection or LLM-based labeling failed to provide meaningful characterizations from the available keywords. Similarly, BERTopic provided greater inter-topic clarity, revealing relationships between related clusters more effectively in the semantic space. The use of transformer-based embeddings allows BERTopic to capture contextual and semantic similarities that LDA, which relies solely on word co-occurrence, struggles to detect. As a result, BERTopic highlighted topics that were closer in a visually interpretable manner.

Taken together, the comparison between BERTopic and LDA highlights key differences in how these methodologies parse complex research corpora, such as those related to periodontics. BERTopic’s capacity to produce smaller, more contextually nuanced clusters enables the identification of highly specific niche areas, which are difficult to capture using more conventional methods like LDA. On the other hand, LDA’s tendency to produce more evenly distributed but thematically broader topics underscores its comparative simplicity and systematic approach to topic formation, even if it comes at the cost of granularity and interpretability. It cannot be overstated that, unlike BERTopic, LDA does not require a GPU to work, which may result in a lower resource demand.

These findings suggest several avenues for future research and methodological refinement. One notable possibility is the incorporation of more specialized or domain-specific embeddings. At present, sentence transformers are trained on broad corpora, and although they already outperform more traditional bag-of-words approaches in capturing nuanced semantic relationships, there is room for improvement [38]. Tailoring embeddings to the biomedical or dental domain through fine-tuning on specialized corpora could yield even more accurate topic clusters and reduce the number of ambiguous or unlabeled documents [85].

Another potential improvement lies in experimenting with varying granularity settings [86]. Different thresholds affect the thematic fragmentation of the dataset. Striking an optimal balance between complexity and interpretability remains an open challenge. Too coarse a granularity leads to the loss of critical insights, while too fine a granularity risks overwhelming investigators with a multitude of narrowly defined themes that may lack practical significance. Systematic experimentation with these parameters, perhaps guided by empirical criteria, such as topic coherence measures, could yield more meaningful topic maps.

Despite these limitations, topic modeling studies like the present study can prove very useful and inform broader meta-analyses. The discovered thematic axes—patient management, periodontal surgery and implantology, oral microbiology, and periomedicine— can serve as foundational pillars guiding researchers, clinicians, and policymakers. Identifying stable, reproducible topic structures across multiple analyses over time can provide valuable longitudinal insights, showing how the field of periodontics is evolving and where emerging trends or neglected areas lie. Such insights might be crucial for guiding funding decisions, informing curriculum development in educational programs, or pinpointing gaps where further inquiry is needed.

In addition to algorithmic approaches like BERTopic, ontology-based frameworks offer structured, domain-specific representations of complex biomedical phenomena. The Periodontitis-Ontology (PeriO) framework is a noteworthy example, as it provides a formalized model of processes, entities, and relationships pertinent to periodontitis pathogenesis, including molecular mechanisms, cellular interactions, and links to systemic conditions [87].

Comparing our BERTopic-derived clusters with the PeriO framework could reveal interesting complementarities. While our approach highlights broad thematic axes—such as patient management, periomedicine, oral microbiology, and implant-related surgery—PeriO delves into the underlying biological processes, immune responses, and osteoimmunological aspects of the disease. For instance, our periomedicine cluster, which identifies linkages between periodontal disease and systemic conditions like diabetes or cardiovascular disease, could be mapped onto corresponding causal pathways and biological processes specified within PeriO. This cross-referencing could also clarify where contemporary research frontiers—uncovered by data-driven topic modeling—align with or diverge from the established conceptual structures offered by domain ontologies.

Finally, while the current work focused on titles and limited textual information, future efforts may explore more content-rich data sources, including abstracts. The more data the modeling algorithms have access to, the more robust and detailed their thematic structures are likely to become. However, this also introduces additional methodological considerations regarding data cleaning and computational resources.

In summary, the interplay between topic modeling algorithms and the evolving research landscape in periodontics demonstrates both the promise and the limitations of these advanced text analysis methods. By carefully choosing parameters, leveraging domain expertise, refining the underlying language models, and experimenting with various degrees of topic granularity, researchers can improve the thematic clarity, coherence, and practical applicability of topic modeling outcomes. These steps will not only enhance our understanding of the periodontics literature but also serve as a blueprint for applying advanced topic modeling approaches to other complex, multidisciplinary fields.

5. Conclusions

In conclusion, this study illustrates how advanced topic modeling frameworks, particularly BERTopic combined with transformer-based embeddings and large language models, can effectively delineate the thematic landscape of a specialized research domain like periodontics. By fine-tuning key parameters, we identified broad thematic axes—such as patient management, periomedicine, oral microbiology, and implant surgery—as well as more granular topic clusters, including those exploring novel research areas or tracking the emergence and decline of specific themes over time. Compared to traditional methods like LDA, BERTopic offered greater semantic sensitivity and the ability to highlight niche topics that might otherwise remain undetected. Although challenges persist, such as the presence of a null category of unlabeled documents, the approach remains efficient, flexible, and readily interpretable. Future refinements—such as domain-adaptive embeddings, expert-guided adjustments, and careful calibration of topic granularity—hold the potential to further enhance our understanding of both the current state and the ongoing evolution of an entire research field.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/bdcc9010007/s1: Table S1: List of the topics in the 2009–2023 period sorted by topic size (count = number of papers in the cluster). The table includes the KeyBERT and MMR keywords and the LLM-generated label.

Author Contributions

Conceptualization, C.G., M.M. and E.C.; methodology, C.G.; software, C.G.; formal analysis, C.G. and M.T.C.; data curation, S.G. and M.T.C.; writing—original draft preparation, C.G. and M.M.; writing—review and editing, E.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dentino, A.; Lee, S.; Mailhot, J.; Hefti, A.F. Principles of Periodontology. Periodontology 2000 2013, 61, 16–53. [Google Scholar] [CrossRef]
Raj, S.C.; Tabassum, S.; Mahapatra, A.; Patnaik, K. Interdisciplinary Periodontics. In Periodontology-Fundamentals and Clinical Features; IntechOpen: London, UK, 2021; ISBN 1838806792. [Google Scholar]
Lyons, K.M.; Darby, I. Interdisciplinary Periodontics: The Multidisciplinary Approach to the Planning and Treatment of Complex Cases. Periodontology 2000 2017, 74, 7–10. [Google Scholar] [CrossRef] [PubMed]
Landhuis, E. Scientific Literature: Information Overload. Nature 2016, 535, 457–458. [Google Scholar] [CrossRef]
Stephens, K.S.; White, B.P. Keeping Up With the Literature: New Solutions for an Old Problem. J. Pharm. Pract. 2024, 37, 11–13. [Google Scholar] [CrossRef]
Larsen, P.; Von Ins, M. The Rate of Growth in Scientific Publication and the Decline in Coverage Provided by Science Citation Index. Scientometrics 2010, 84, 575–603. [Google Scholar] [CrossRef]
Clapham, P. Publish or Perish. Bioscience 2005, 55, 390–391. [Google Scholar] [CrossRef]
Bramer, W.M.; Rethlefsen, M.L.; Kleijnen, J.; Franco, O.H. Optimal Database Combinations for Literature Searches in Systematic Reviews: A Prospective Exploratory Study. Syst. Rev. 2017, 6, 245. [Google Scholar] [CrossRef] [PubMed]
Appadurai, A. Modernity at Large: Cultural Dimensions of Globalization; University of Minnesota Press: Minneapolis, MN, USA, 1996; Volume 1, ISBN 145290006X. [Google Scholar]
Delen, D.; Crossland, M.D. Seeding the Survey and Analysis of Research Literature with Text Mining. Expert. Syst. Appl. 2008, 34, 1707–1720. [Google Scholar] [CrossRef]
Vayansky, I.; Kumar, S.A.P. A Review of Topic Modeling Methods. Inf. Syst. 2020, 94, 101582. [Google Scholar] [CrossRef]
Kavvadias, S.; Drosatos, G.; Kaldoudi, E. Supporting Topic Modeling and Trends Analysis in Biomedical Literature. J. Biomed. Inf. 2020, 110, 103574. [Google Scholar] [CrossRef] [PubMed]
Cao, Q.; Cheng, X.; Liao, S. A Comparison Study of Topic Modeling Based Literature Analysis by Using Full Texts and Abstracts of Scientific Articles: A Case of COVID-19 Research. Libr. Hi Tech. 2023, 41, 543–569. [Google Scholar] [CrossRef]
Abdelrazek, A.; Eid, Y.; Gawish, E.; Medhat, W.; Hassan, A. Topic Modeling Algorithms and Applications: A Survey. Inf. Syst. 2023, 112, 102131. [Google Scholar] [CrossRef]
Kherwa, P.; Bansal, P. Topic Modeling: A Comprehensive Review. ICST Trans. Scalable Inf. Syst. 2018, 0, 159623. [Google Scholar] [CrossRef]
Basmatkar, P.; Maurya, M. An Overview of Contextual Topic Modeling Using Bidirectional Encoder Representations from Transformers. In Proceedings of Third International Conference on Communication, Computing and Electronics Systems: ICCCES 2021; Springer: Singapore, 2022; pp. 489–504. [Google Scholar]
Grootendorst, M. BERTopic: Neural Topic Modeling with a Class-Based TF-IDF Procedure. arXiv 2022, arXiv:2203.05794. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Stroudsburg, PA, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Yuan, W.; Lei, Y.; Guo, X. Research on Text Similarity Calculation Based on BERT and Word2Vec. In Proceedings of the ICETIS 2022; 7th International Conference on Electronic Technology and Information Science, Harbin, China, 21–23 January 2022; pp. 1–4. [Google Scholar]
Shen, Y.; Liu, J. Comparison of Text Sentiment Analysis Based on Bert and Word2vec. In Proceedings of the 2021 IEEE 3rd International Conference on Frontiers Technology of Information and Computer (ICFTIC), Greenville, SC, USA, 12–14 November 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 144–147. [Google Scholar]
Rui, Y.; Tan, T.F.; Lu, W.; Thirunavukarasu, A.J.; Ting, D.S.W.; Liu, N. Large language models in health care: Development, applications, and challenges. Health Care Science 2023, 2, 255–263. [Google Scholar]
Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 2024, 15, 39. [Google Scholar] [CrossRef]
Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D. Emergent Abilities of Large Language Models. arXiv 2022, arXiv:2206.07682. [Google Scholar]
Bassi, S. A Primer on Python for Life Science Researchers. PLoS Comput. Biol. 2007, 3, e199. [Google Scholar] [CrossRef]
Jia, Z.; Maggioni, M.; Smith, J.; Scarpazza, D.P. Dissecting the NVidia Turing T4 GPU via Microbenchmarking. arXiv 2019, arXiv:1903.07486. [Google Scholar]
Cock, P.J.A.; Antao, T.; Chang, J.T.; Chapman, B.A.; Cox, C.J.; Dalke, A.; Friedberg, I.; Hamelryck, T.; Kauff, F.; Wilczynski, B. Biopython: Freely Available Python Tools for Computational Molecular Biology and Bioinformatics. Bioinformatics 2009, 25, 1422–1423. [Google Scholar] [CrossRef] [PubMed]
Mckinney, W. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference, Austin, TX, USA, 28–30 June 2010; pp. 51–56. [Google Scholar]
Cook, D.A.; Beckman, T.J.; Bordage, G. A Systematic Review of Titles and Abstracts of Experimental Studies in Medical Education: Many Informative Elements Missing. Med. Educ. 2007, 41, 1074–1081. [Google Scholar] [CrossRef] [PubMed]
Hartley, J. Planning That Title: Practices and Preferences for Titles with Colons in Academic Articles. Libr. Inf. Sci. Res. 2007, 29, 553–568. [Google Scholar] [CrossRef]
Guizzardi, S.; Colangelo, M.T.; Mirandola, P.; Galli, C. Modeling New Trends in Bone Regeneration, Using the BERTopic Approach. Regen. Med. 2023, 18, 719–734. [Google Scholar] [CrossRef]
Saif, H.; Fernandez, M.; He, Y.; Alani, H. On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter. In Proceedings of the Ninth International Conference on Language Resources and Evaluation; Calzolari, N., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., Piperidis, S., Eds.; European Language Resources Association (ELRA): Reykjavik, Iceland, 2014; pp. 810–817. [Google Scholar]
Gutiérrez, L.; Keith, B. A Systematic Literature Review on Word Embeddings. In Proceedings of the Trends and Applications in Software Engineering: Proceedings of the 7th International Conference on Software Process Improvement (CIMPS 2018) 7; Springer: Berlin/Heidelberg, Germany, 2019; pp. 132–141. [Google Scholar]
Wang, S.; Zhou, W.; Jiang, C. A Survey of Word Embeddings Based on Deep Learning. Computing 2020, 102, 717–740. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process Syst. 2017, 30, 6000–6010. [Google Scholar]
Liu, Q.; Kusner, M.J.; Blunsom, P. A Survey on Contextual Embeddings. arXiv 2020, arXiv:2003.07278. [Google Scholar]
Galli, C.; Donos, N.; Calciolari, E. Performance of 4 Pre-Trained Sentence Transformer Models in the Semantic Query of a Systematic Review Dataset on Peri-Implantitis. Information 2024, 15, 68. [Google Scholar] [CrossRef]
McInnes, L.; Healy, J.; Melville, J. Umap: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar]
Raschka, S.; Patterson, J.; Nolet, C. Machine Learning in Python: Main Developments and Technology Trends in Data Science, Machine Learning, and Artificial Intelligence. Information 2020, 11, 193. [Google Scholar] [CrossRef]
McInnes, L.; Healy, J.; Astels, S. Hdbscan: Hierarchical Density Based Clustering. J. Open Source Softw. 2017, 2, 205. [Google Scholar] [CrossRef]
Qaiser, S.; Ali, R. Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents. Int. J. Comput. Appl. 2018, 181, 25–29. [Google Scholar] [CrossRef]
Xu, D.D.; Wu, S.B. An Improved TFIDF Algorithm in Text Classification. Appl. Mech. Mater. 2014, 651, 2258–2261. [Google Scholar] [CrossRef]
Akre, P.; Malu, R.; Jha, A.; Tekade, Y.; Bisen, W. Sentiment Analysis Using Opinion Mining on Customer Review. Int. J. Eng. Manag. Res. 2023, 13, 41–44. [Google Scholar]
Issa, B.; Jasser, M.B.; Chua, H.N.; Hamzah, M. A Comparative Study on Embedding Models for Keyword Extraction Using KeyBERT Method. In Proceedings of the 2023 IEEE 13th International Conference on System Engineering and Technology (ICSET), Shah Alam, Malaysia, 2 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 40–45. [Google Scholar]
Zhang, Y.; Jin, R.; Zhou, Z.-H. Understanding Bag-of-Words Model: A Statistical Framework. Int. J. Mach. Mach. Learn. Cybern. 2010, 1, 43–52. [Google Scholar] [CrossRef]
Bennani-Smires, K.; Musat, C.; Hossmann, A.; Baeriswyl, M.; Jaggi, M. Simple Unsupervised Keyphrase Extraction Using Sentence Embeddings. arXiv 2018, arXiv:1801.04470. [Google Scholar]
Chauhan, U.; Shah, A. Topic Modeling Using Latent Dirichlet Allocation: A Survey. ACM Comput. Surv. (CSUR) 2021, 54, 145. [Google Scholar] [CrossRef]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet Allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Baldha, T.; Mungalpara, M.; Goradia, P.; Bharti, S. COVID-19 Vaccine Tweets Sentiment Analysis and Topic Modelling for Public Opinion Mining. In Proceedings of the 2021 International Conference on Artificial Intelligence and Machine Vision (AIMV), Gandhinagar, India, 24–26 September 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar]
Teknium Teknium/OpenHermes-2.5-Mistral-7B. Available online: https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B (accessed on 10 February 2024).
Thirunavukarasu, A.J.; Ting, D.S.J.; Elangovan, K.; Gutierrez, L.; Tan, T.F.; Ting, D.S.W. Large Language Models in Medicine. Nat. Med. 2023, 29, 1930–1940. [Google Scholar] [CrossRef] [PubMed]
Park, S.; Choi, J.; Lee, S.; Kang, U. A Comprehensive Survey of Compression Algorithms for Language Models. arXiv 2024, arXiv:2401.15347. [Google Scholar]
Kaddour, J.; Harris, J.; Mozes, M.; Bradley, H.; Raileanu, R.; McHardy, R. Challenges and Applications of Large Language Models. arXiv 2023, arXiv:2307.10169. [Google Scholar]
Meskó, B. Prompt Engineering as an Important Emerging Skill for Medical Professionals: Tutorial. J. Med. Internet Res. 2023, 25, e50638. [Google Scholar] [CrossRef]
Hunter, J.D. Matplotlib: A 2D Graphics Environment. Comput. Sci. Eng. 2007, 9, 90–95. [Google Scholar] [CrossRef]
Waskom, M. Seaborn: Statistical Data Visualization. J. Open Source Softw. 2021, 6, 3021. [Google Scholar] [CrossRef]
Lavanya, A.; Gaurav, L.; Sindhuja, S.; Seam, H.; Joydeep, M.; Uppalapati, V.; Ali, W.; SD, V.S. Assessing the Performance of Python Data Visualization Libraries: A Review. Int. J. Comput. Eng. Res. Trends. 2023, 10, 29–39. [Google Scholar] [CrossRef]
Albandar, J.M. Disparities and Social Determinants of Periodontal Diseases. In Periodontology 2000; Wiley: Hoboken, NJ, USA, 2024. [Google Scholar]
Bond, J.C.; Casey, S.M.; McDonough, R.; McLone, S.G.; Velez, M.; Heaton, B. Validity of Individual Self-report Oral Health Measures in Assessing Periodontitis for Causal Research Applications. J. Periodontol. 2024, 95, 892–906. [Google Scholar] [CrossRef] [PubMed]
Collins, J.R.; Rivas-Tumanyan, S.; Santosh, A.B.R.; Boneta, A.E. Periodontal Health Knowledge and Oral Health-Related Quality of Life in Caribbean Adults. Oral Health Prev. Dent. 2024, 22, 9–22. [Google Scholar]
Noh, M.; Kim, E.; Sakong, J.; Park, E.Y. Effects of Professional Toothbrushing among Patients with Gingivitis. Int. J. Dent. Hyg. 2023, 21, 611–617. [Google Scholar] [CrossRef] [PubMed]
Salari, A.; Alavi, F.N.; Aliaghazadeh, K.; Nikkhah, M. Effect of Milk as a Mouthwash on Dentin Hypersensitivity after Non-Surgical Periodontal Treatment. J. Adv. Periodontol. Implant. Dent. 2022, 14, 104. [Google Scholar] [CrossRef] [PubMed]
Bhuyan, R.; Pati, T.; Panda, N.R.; Mohanty, J.N.; Bhuyan, S.K. A Six-Month Single-Center Study in 2021 on Oral Manifestations during Pregnancy in Bhubaneswar, India. Iran. J. Med. Sci. 2023, 48, 350. [Google Scholar]
Kamalabadi, Y.M.; Campbell, M.K.; Zitoun, N.M.; Jessani, A. Unfavourable Beliefs about Oral Health and Safety of Dental Care during Pregnancy: A Systematic Review. BMC Oral Health 2023, 23, 762. [Google Scholar] [CrossRef]
Carrouel, F.; Kanoute, A.; Lvovschi, V.-E.; Bourgeois, D. Periodontal Pathogens of the Interdental Microbiota in a 3 Months Pregnant Population with an Intact Periodontium. Front. Microbiol. 2023, 14, 1275180. [Google Scholar] [CrossRef] [PubMed]
Zhu, Y.; Lu, H.; Yang, S.; Liu, Y.; Zhu, P.; Li, P.; De Waal, Y.C.M.; Visser, A.; Tjakkes, G.-H.E.; Li, A. Predictive Factors for the Treatment Success of Peri-Implantitis: A Protocol for a Prospective Cohort Study. BMJ Open 2024, 14, e072443. [Google Scholar] [CrossRef] [PubMed]
AlHelal, A.A.; Alzaid, A.A.; Almujel, S.H.; Alsaloum, M.; Alanazi, K.K.; Althubaitiy, R.O.; Al-Aali, K.A. Evaluation of Peri-Implant Parameters and Functional Outcome of Immediately Placed and Loaded Mandibular Overdentures: A 5-Year Follow-up Study. Oral Health Prev. Dent. 2024, 22, 23–30. [Google Scholar] [PubMed]
Chang, S.-W.; Shin, S.-Y.; Hong, J.-R.; Yang, S.-M.; Yoo, H.-M.; Park, D.-S.; Oh, T.-S.; Kye, S.-B. Immediate Implant Placement into Infected and Noninfected Extraction Sockets: A Pilot Study. Oral Surg. Oral Med. Oral Pathol. Oral Radiol. Endodontology 2009, 107, 197–203. [Google Scholar] [CrossRef]
Malkoc, M.A.; Sevimay, M.; Yaprak, E. The Use of Zirconium and Feldspathic Porcelain in the Management of the Severely Worn Dentition: A Case Report. Eur. J. Dent. 2009, 3, 75–78. [Google Scholar] [CrossRef] [PubMed]
Lee, C.-G.; Jin, G.; Lim, J.-H.; Liu, Y.; Afrashtehfar, K.I.; Kim, J.-E. Influence of Hydrothermal Aging on the Shear Bond Strength of 3D Printed Denture-Base Resin to Different Relining Materials. J. Mech. Behav. Biomed. Mater. 2024, 149, 106221. [Google Scholar] [CrossRef] [PubMed]
Ventura, J.V.L.; Vogel, J.D.O.; Cortezzi, E.B.D.A.; de Arruda, J.A.A.; Cunha, J.L.S.; Andrade, B.A.B.D.; Tenório, J.R. Diagnosis and Management of Exuberant Palatal Pyogenic Granuloma in a Systemically Compromised Patient–Case Report. Spec. Care Dent. 2023, 44, 773–778. [Google Scholar] [CrossRef] [PubMed]
Rathi, N.; Reche, A.; Agrawal, S.; Agrawal, S.R. Radicular Cyst: A Cystic Lesion Involving the Hard Palate. Cureus 2023, 15, e47030. [Google Scholar] [CrossRef] [PubMed]
Sandhu, A.; Jyoti, D.; Malhotra, R.; Phull, T.; Sidhu, H.S.; Nayak, S. Management of Chronic Inflammatory Gingival Enlargement: A Short Review and Case Report. Cureus 2023, 15, e46770. [Google Scholar] [CrossRef] [PubMed]
Krieger, M.; AbdelRahman, Y.M.; Choi, D.; Palmer, E.A.; Yoo, A.; McGuire, S.; Kreth, J.; Merritt, J. The Prevalence of Fusobacterium Nucleatum Subspecies in the Oral Cavity Stratifies by Local Health Status. bioRxiv 2023. bioRxiv: 2010–2023. [Google Scholar]
Molli, V.L.P.; Kissa, J.; Baraniya, D.; Gharibi, A.; Chen, T.; Al-Hebshi, N.N.; Albandar, J.M. Bacteriome Analysis of Aggregatibacter Actinomycetemcomitans-JP2 Genotype-Associated Grade C Periodontitis in Moroccan Adolescents. Front. Oral Health 2023, 4, 1288499. [Google Scholar] [CrossRef]
Demirel, K.J.; Wu, R.; Guimaraes, A.N.; Demirel, I. The Role of NLRP3 in Regulating Gingival Epithelial Cell Responses Evoked by Aggregatibacter Actinomycetemcomitans. Cytokine 2023, 169, 156316. [Google Scholar] [CrossRef] [PubMed]
Schuster, A.; Nieboga, E.; Kantorowicz, M.; Lipska, W.; Kaczmarzyk, T.; Potempa, J.; Grabiec, A.M. Gingival Fibroblast Activation by Porphyromonas Gingivalis Is Driven by TLR2 and Is Independent of the LPS-TLR4 Axis. Eur. J. Immunol. 2024, 54, 2350776. [Google Scholar] [CrossRef] [PubMed]
Rams, T.E.; Sautter, J.D.; van Winkelhoff, A.J. Emergence of Antibiotic-Resistant Porphyromonas Gingivalis in United States Periodontitis Patients. Antibiotics 2023, 12, 1584. [Google Scholar] [CrossRef] [PubMed]
Kramer, P.R.; Kramer, S.F.; Puri, J.; Grogan, D.; Guan, G. Multipotent Adult Progenitor Cells Acquire Periodontal Ligament Characteristics in Vivo. Stem Cells Dev. 2009, 18, 67–76. [Google Scholar] [CrossRef] [PubMed]
Peng, L.; Cheng, X.; Zhuo, R.; Lan, J.; Wang, Y.; Shi, B.; Li, S. Novel Gene-activated Matrix with Embedded Chitosan/Plasmid DNA Nanoparticles Encoding PDGF for Periodontal Tissue Engineering. J. Biomed. Mater. Res. Part A Off. J. Soc. Biomater. Jpn. Soc. Biomater. Aust. Soc. Biomater. Korean Soc. Biomater. 2009, 90, 564–576. [Google Scholar] [CrossRef] [PubMed]
Ripamonti, U.; Petit, J.; Teare, J. Cementogenesis and the Induction of Periodontal Tissue Regeneration by the Osteogenic Proteins of the Transforming Growth Factor-β Superfamily. J. Periodontal Res. 2009, 44, 141–152. [Google Scholar] [CrossRef] [PubMed]
Shen, Z.; Zhang, R.; Huang, Y.; Chen, J.; Yu, M.; Li, C.; Zhang, Y.; Chen, L.; Huang, X.; Yang, J. The Spatial Transcriptomic Landscape of Human Gingiva in Health and Periodontitis. Sci. China Life Sci. 2024, 67, 720–732. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Jing, J.; Zhou, C.; Fan, Y. Emerging Roles of Exosomes in Oral Diseases Progression. Int. J. Oral Sci. 2024, 16, 4. [Google Scholar] [CrossRef] [PubMed]
Vithanage, D.; Yu, P.; Wang, L.; Deng, C. Contextual Word Embedding for Biomedical Knowledge Extraction: A Rapid Review and Case Study. J. Heal. Inf. Res. 2024, 8, 158–179. [Google Scholar] [CrossRef]
Zhou, Y. An Empirical Study on Fertility Proposals Using Multi-Grined Topic Analysis Methods. arXiv 2023, arXiv:2307.10025 2023. [Google Scholar]
Suzuki, A.; Takai-Igarashi, T.; Nakaya, J.; Tanaka, H. Development of an Ontology for Periodontitis. J. Biomed. Semant. 2015, 6, 30. [Google Scholar] [CrossRef] [PubMed]

Figure 1. This diagram represents the computational workflow used to analyze and interpret topics within the dataset. Each step is shown sequentially, from data acquisition and preprocessing to embedding generation, clustering, and topic refinement. The workflow also includes visualization and trend analysis to interpret results. Different categories, such as data preparation, modeling, and visualization, are represented by distinct steps to illustrate the systematic approach to topic analysis.

Figure 2. Histogram representing the distribution of publications in the analyzed corpus. As the database search was conducted at the beginning of 2024, the number of publications for this period was remarkably lower than in the preceding years.

Figure 3. (A) Low-magnification view of the distribution of topics within the semantic space. Four clusters can be identified by their spatial arrangement. (B) High-magnification view of the topics that compose Cluster #1. The main topic of this cluster is highlighted in red in both panels (A,B).

Figure 4. (A) Low-magnification view of the distribution of topics within the semantic space. Four clusters can be identified by their spatial arrangement. (B) High-magnification view of the topics that compose Cluster #2. The main topic of this cluster is highlighted in red in both panels (A,B).

Figure 5. (A) Low-magnification view of the distribution of topics within the semantic space. Four clusters can be identified by their spatial arrangement. (B) High-magnification view of the topics that compose Cluster #3. The same topic is highlighted in red in both panels (A,B).

Figure 6. (A) Low-magnification view of the distribution of topics within the semantic space. Four clusters can be identified by their spatial arrangement. (B) High-magnification view of the topics that compose cluster #4. The main topic of this cluster is highlighted in red in both panels (A,B).

Figure 7. Chord plot representing semantic similarity between topics, calculated as the cosine similarity between the means of the embeddings of the titles in a topic group.

Figure 8. The figure illustrates the distribution of the 31 topics identified by the Latent Dirichlet Allocation (LDA) model visualized in a 2D semantic space using UMAP. Each point represents a topic, and its position reflects the semantic similarity to other topics based on the topic-word distributions. Topics that are closer together indicate higher semantic similarity, while distant topics are more distinct. Labels were generated using a Large Language Model (LLM) to provide concise descriptions of each topic.

Figure 9. This heatmap visualizes the alignment between topics generated by BERTopic (y-axis) and LDA (x-axis) using a normalized crosstab analysis. Each cell represents the proportion of documents that overlap between a specific BERTopic topic and an LDA topic, with values normalized row-wise. The color intensity, as indicated by the color bar, ranges from dark purple (low alignment) to bright yellow (high alignment), highlighting the degree of overlap between corresponding topics. Areas with higher alignment suggest similar thematic content identified by both methods, whereas low-intensity regions indicate divergence between the two topic models.

Figure 10. Line chart tracking the number of papers published by year in the top 10 topics of our corpus, identified by topic ID.

Table 1. List of BERTopic-generated topics in the dataset with their LLM labels.

Topic	N. Publications	LLM
−1	43,271	Periodontal Health and Treatment
0	11,715	Periodontal Stem Cell Regeneration
1	7847	Peri-Implant
2	6649	Soft Tissue Stability
3	3796	Oral Health Quality of Life
4	2052	Giant Cell Granuloma Cases
5	1969	Cone Beam Computed Tomography Applications
6	1792	Antimicrobial Photodynamic
7	1438	Therapy with Diode Laser
8	1177	Porphyromonas Gingivalis Effects
9	1126	Bond Strength of Dental Restorations
10	1003	Chlorhexidine and herbal mouthwash
11	961	Root Canal Therapy
12	920	Outcomes
13	898	Diabetes and Periodontal Disease
14	854	Oral Microbiome and Health
15	851	Gingival Recessions Treatment
16	664	Smoking and Periodontal Disease
17	553	Periodontal disease and pregnancy complications
18	528	Sinus Augmentation
19	483	Aggregatibacter
20	470	Actinomycetemcomitans Infections
21	437	Probiotic Periodontal Health
22	415	Periodontal–Cardiovascular
23	364	Disease Link
24	327	COVID-19 and Dental Practice
25	314	Rheumatoid Arthritis and Periodontal Disease
26	300	Titanium Surface Studies
27	269	Cleft Lip and Palate Treatment
28	267	Toothbrush Plaque Removal
29	261	Periodontitis–CKD association

Table 2. List of LDA-generated topics in the dataset with their LLM labels.

Topic	N. Publications	LLM
0	2922	Oral Health and Quality of Life Factors
1	3932	Periodontitis microbes and host cell response
2	2022	Oral cancer and cell interactions
3	3179	Dental ridge augmentation techniques
4	2134	Laser wound healing in humans
5	1895	Orthodontic Treatment for Cleft Lip and Palate Patients
6	2404	Periodontal Disease Keywords
7	2625	Dental Implant Surface Properties and Biofilm Effects
8	4006	Immediate implant placement study
9	3656	Periodontitis Biomarker Analysis
10	1417	Periodontal Disease Evaluation in Dogs
11	4552	Dental implant studies
12	1261	Dental procedures and related genetics
13	1282	Periodontal disease and host immune response
14	3136	Undetermined
15	5009	Periodontal Disease Risk Factors
16	5174	Maxillary Incisor Treatment Series: Cases and Management Strategies
17	2826	Endodontic therapies and periodontal treatments in Dentistry
18	2785	3D dental imaging evaluation using CBCT tomography
19	2739	Regenerative Dental Tissue Engineering
20	3877	Periodontitis Treatment Efficacy Study
21	2088	Gingival tissue grafts and treatments for periodontal recession
22	3317	Bone loss in stress-induced periodontitis
23	4319	Stem Cell Differentiation in Dental Tissues
24	2652	Dental Treatment Efficacy Study
25	1577	Streptococcus Growth and Gene Expression
26	3531	Oral Disease and Systemic Health Links
27	1905	Oral Diagnostic Analysis of Odontogenic Lesions
28	2041	Dental Implant Treatment Review
29	3182	Oral health and systemic diseases
30	6526	Oral Health Education and Care for All Age Groups

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Galli, C.; Colangelo, M.T.; Meleti, M.; Guizzardi, S.; Calciolari, E. Topic Analysis of the Literature Reveals the Research Structure: A Case Study in Periodontics. Big Data Cogn. Comput. 2025, 9, 7. https://doi.org/10.3390/bdcc9010007

AMA Style

Galli C, Colangelo MT, Meleti M, Guizzardi S, Calciolari E. Topic Analysis of the Literature Reveals the Research Structure: A Case Study in Periodontics. Big Data and Cognitive Computing. 2025; 9(1):7. https://doi.org/10.3390/bdcc9010007

Chicago/Turabian Style

Galli, Carlo, Maria Teresa Colangelo, Marco Meleti, Stefano Guizzardi, and Elena Calciolari. 2025. "Topic Analysis of the Literature Reveals the Research Structure: A Case Study in Periodontics" Big Data and Cognitive Computing 9, no. 1: 7. https://doi.org/10.3390/bdcc9010007

APA Style

Galli, C., Colangelo, M. T., Meleti, M., Guizzardi, S., & Calciolari, E. (2025). Topic Analysis of the Literature Reveals the Research Structure: A Case Study in Periodontics. Big Data and Cognitive Computing, 9(1), 7. https://doi.org/10.3390/bdcc9010007

Article Menu

Topic Analysis of the Literature Reveals the Research Structure: A Case Study in Periodontics

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Data Analysis

2.2.1. The Dataset

2.2.2. Embedding Generation

2.2.3. Dimensionality Reduction

2.2.4. Clustering

2.2.5. Keyword Refinement

2.2.6. Topic Modeling with LDA

2.3. LLM Labeling

2.4. Data Visualization and Trend Analysis

3. Results

3.1. BERTopic Analysis—Setting the Stage

3.2. Overview of the Research Landscape

3.3. Relations Between Topics

3.4. LDA Topic Modeling: A Conventional Perspective

3.5. A Changing Mosaic: The Evolution of Research Topics

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI