1. Introduction
Capsule endoscopy (CE) has revolutionized the management of small bowel diseases [
1]. Since its introduction into clinical practice, it evolved to include the possibility of a colon assessment as part of a panenteric approach [
2]. In fact, due to the opportunistic capture of esophagus and stomach mucosa while performing CE, there has been a growing interest in developing a more comprehensive panendoscopic solution that enables the additional evaluation of these locations [
3]. Advances in imaging technologies, such as those enhancing mucosal visualization, further highlight the potential of this method for assessing the gastrointestinal (GI) tract [
4].
Although minimally invasive capsule panendoscopy (CPE) may be tempting and likely well-tolerated by most patients, it presents some technical challenges, including a higher burden for exam interpretation and a greater risk of missing crucial frames [
3]. Consequently, the introduction of automated reading assistance methodologies may be crucial for decreasing its reading time while attaining a highly satisfactory diagnostic performance [
5]. By doing so, it may be possible to reduce its financial costs, further reinforcing CPE as a cost-effective option [
5].
There are particular challenges with the esophageal assessment by CE. Insufflation and controlled movement are not possible in this procedure [
6]. Moreover, the rapid esophageal transit, particularly in the upright position, accounts for the reduced number of esophageal mucosa frames provided by each exam, particularly the Z-line, where frequent pathologies are discovered [
7]. While esophagogastroduodenoscopy (EGD) remains the gold standard for esophageal assessment, its invasiveness can cause patient discomfort and poses a non-negligible risk of complications [
8]. Additionally, certain protocols have been identified to enhance the diagnostic performance of CE in detecting esophageal lesions, by swallowing the capsule in a reclined position, suggesting it may be feasible to detect common esophageal lesions at a fairly comparable level to EGD [
9,
10].
Artificial intelligence (AI) has been a hot topic in the medical community, specifically in areas with a strong imaging component, such as gastroenterology [
11]. In the CE technological scenario, machine-learning models are being supplanted by deep-learning (DL) algorithms due to their unsupervised learning capacity [
12]. The majority of DL models developed so far are based on convolutional neural networks (CNNs), although there is a growing interest in using vision transformer (ViT) methods to leverage computer vision tasks [
13]. While DL models already have demonstrated potential in the automatic detection of lesions in the small bowel and colon first, and, more recently, in the stomach, there is no evidence regarding its use in the esophagus [
6,
14,
15].
The aim of this study was to develop and validate the first deep-learning model capable of the automatic detection of pleomorphic esophageal lesions.
2. Materials and Methods
2.1. Study Design
In this retrospective study across two centers (Centro Hospitalar Universitário de São João and ManopH Gastroenterology Clinic, both in Porto, Portugal), we reviewed frames from CE procedures, either CE or Colon CE (CCE), performed from June 2021 to May 2023.
Since the project was designed without direct patient intervention, their clinical management remained unaffected. In order to address privacy concerns related to data protection, each patient’s identification information was omitted and assigned an arbitrary number instead. A legal team with certification as Data Protection Officer (Maastricht University) also evaluated privacy rules to ensure non-tracking and compliance with the General Data Protection Regulation.
2.2. Capsule Endoscopy Protocol
Three distinct devices were used to develop this DL model: PillCam
TM SB3 (Medtronic Corp., Minneapolis, MN, USA), PillCam
TM Crohn’s Capsule (Medtronic Corp., Minneapolis, MN, USA), and OMOM
® HD Capsule (JINSHAN Co., Yubei, Chongqing, China). PillCam
TM SB3 and PillCam
TM Crohn’s frames were reviewed with PillCam™ Software version 9 (Medtronic, Minneapolis, MN, USA), whereas OMOM
® HD images were examined using Vue Smart Software (Jinshan Science & Technology Co, Chongqing, Yubei, China,
https://www.jinshangroup.com/product/omom-hd-capsule-endoscopy-camera/, accessed on 22 December 2024).
The European Society of Gastrointestinal Endoscopy’s recommendations were used to direct the bowel preparation process. Patients were encouraged to follow a clear liquid diet and fast overnight the day before capsule ingestion. A 2 L polyethylene glycol (PEG) solution was utilized for small bowel preparation. A 4 L PEG solution was given in a split-dosage for PillCam Crohn’s capsule, similar to the preparation process for colonoscopy (patients were instructed to drink 2 L of PEG in the night prior to the procedure and 2 L in the morning of the procedure). The use of an anti-foaming agent, specifically simethicone, was also integrated into the capsule’s administration protocol.
2.3. Categorization of Lesions
Esophageal frames of CE exams were independently reviewed by three gastroenterologists with expertise in CE for identification of lesions in this location. Each frame was classified as either normal or as containing a pleomorphic lesion, which included at least one of the following: protruding lesions, ulcers and erosions, vascular lesions, hematic residues, and esophageal diverticula. Images were included in the final dataset only if their classification received unanimous agreement from the three physicians. The algorithm was built using a total of 7982 frames from three different types of CE devices, 2942 of which had pleomorphic esophageal lesions.
2.4. Development of the DL Model and Performance Analysis
We developed a vision transformer (ViT) model to automatically identify and categorize two types of frames in the esophagus: normal mucosa and pleomorphic lesions. The complete dataset was divided into two main groups for our study: one for training and validation, and the other for testing. To ensure consistency, frames from the same patient were grouped together during this division, following a patient-split design. The training set, comprising 70% of the data, was employed to train the model, while the validation set (20%) helped fine-tune its parameters. The remaining 10% of the data constituted the testing set, used to independently evaluate the diagnostic performance of our ViT model. The graphical representation of our study design can be found in
Figure 1.
Our ViT model was initialized using pre-trained weights from ImageNet, a comprehensive image dataset specializing in object recognition [
16]. We retained the feature extractor weights to leverage knowledge from ImageNet and defined our own fully connected layers to adapt the pre-trained model to our specific task. To prevent overfitting, we have included between these fully connected layers dropout layers with a dropout rate of 0.2. Subsequently, a dense layer was added to determine the binary classification result (normal or pleomorphic esophageal lesions). The hyperparameters, including the initial learning rate, batch size (32), and the number of epochs, were determined through trial and error. Common data augmentation techniques, such as image rotation and mirroring, were applied during the training stage. Our computational setup consisted of a NVIDIA RTX A6000 graphic processing unit (NVIDIA Corp, Santa Clara, CA, USA) and a dual AMD EPYC 7282 16-Core processor (AMD, Santa Clara, CA, USA).
The model outputs the probability of each frame being labeled as normal or having a pleomorphic esophageal lesion. Based on the highest calculated probability, each frame was assigned one of these labels (
Figure 2). We also generated heatmaps to visualize the frame features that contributed most to the model predictions (
Figure 3). To establish a reference point, we compared the final classification made by our algorithm with expert assessments provided by three gastroenterologists, recognized as the gold standard.
2.5. Statistics and Reproducibility
We conducted three runs of training, each with an equal distribution of training, validation, and testing data. The patients included in each run were randomly selected, resulting in a unique set of patients for every iteration. Sensitivity (proportion of true positives correctly identified among the total number of individuals with lesions), specificity (proportion of true negatives correctly identified among the total number of individuals without lesions), accuracy (proportion of correctly identified cases [both true positives and true negatives] out of the total number of individuals [all predictions]), negative predictive value (NPV—proportion of true negatives correctly identified among all individuals who were predicted to not have lesions), and positive predictive value (PPV—proportion of true positives among all cases with a positive test result) were calculated for each test group. The final metrics for the model diagnostic performance were determined based on the median and range values of these variables. Additionally, we computed the area under the ROC curve (AUC-ROC) and the area under the precision–recall curve (AUC-PR) for each test set and calculated their average values. We chose to compute both precision–recall and conventional ROC curves to address the imbalance between normal mucosa frames (true negatives) and frames containing pleomorphic lesions (true positives), as this imbalance could potentially lead to misinterpretation when relying solely on the ROC curve.
Furthermore, we assessed the computational performance efficiency of our ViT model in our machine by measuring the processing time for all frames within the test set
Our statistical analysis was carried out using Sci-kit Learn v0.22.2 [
17].
3. Results
We constructed a ViT model using a total of 598 exams, 512 of which were small bowel CE exams (457 were PillCamTM SB3 and 55 were OMOM® HD Capsule), and the remaining 86 were colon capsule CE exams (PillCamTM Crohn’s Capsule).
After triple validation, a total of 7982 frames were used, with 2942 including pleomorphic esophageal lesions (protruding lesions, ulcers and erosions, vascular lesions, hematic residues, and esophageal diverticula). The number of frames and patients, as well as types of CE device are displayed in
Table 1.
The metrics calculated for each run are displayed in
Table 2. The median sensitivity, specificity, and accuracy were, respectively, 75.8% (range 63.6–82.1%), 95.8% (range 93.7–97.9%), and 93.5% (range 91.8–93.8%). The median positive and negative predictive values were, respectively, 71.9% (range 50.0–90.1%) and 96.4% (94.2–97.6%). The median AUC-ROC and the AUC-PR for the detection of pleomorphic esophageal lesions were, respectively, 0.82 and 0.93 (
Figure 4).
In the test set, each frame took 26 ± 3 milliseconds to process.
4. Discussion
To the best of our knowledge, this is the first AI deep-learning model for the automatic detection of pleomorphic esophageal lesions during CE. This proof-of-concept specific model showed good overall diagnostic performance metrics, across different types of CE devices, which could potentially serve as a noteworthy contribution and the missing step for the implementation of a minimally invasive CE-based panendoscopic evaluation.
There are some highlights that must be acknowledged. Firstly, the patient-split design ensures that frames from a specific patient are only attributable to one of either the training/validation or testing group, reducing the risk of similar frames being presented in both, which could lead to the overfitting of the AI model. Secondly, several CE devices were used to train this algorithm, which increases the interoperability of the model, an important aspect for enhancing its technology readiness level and, consequently, its application in real-life clinical practice. Thirdly, it is worth noting that this algorithm was trained using frames from two high-volume CE centers, which may increase the external validity of this results. Fourthly, its capacity to accurately identify different types of lesions during a single procedure enhances its clinical utility, as well. Moreover, through the generation of heatmaps that highlight the region with the higher probability of containing a lesion, we can infer that the model appears to detect patterns as lesions in the way we ideally expect it to. This explainability feature addresses an important current topic, since it not only reduces the cognitive load on exam interpretation by directing attention to specific areas, but also empowers the physician to make the necessary corrections in case of erroneous predictions.
The fact that the development of this model was based on a ViT model can also stand as a strength of this paper and should be emphasized. ViT models are a type of DL algorithm that were initially developed in the field of natural language processing, since there is an enhanced capability in recognizing complex relations using self-attention mechanisms [
18]. This distinct feature enables models to achieve a greater precision in tasks such as language paraphrasing and translations. More recently, there has been an increasing interest in trying to employ ViT models to enhance visual complicated tasks, with some evidence indicating they perform equally well or at a higher level compared to CNN models [
13,
18]. In terms of CE technology, there are a great number of published DL models based on CNN algorithms, but none of them include ViT models. To our knowledge, this model stands as a pioneer. Not only is it the first of its kind in the esophagus, but it also marks the inaugural application of a ViT method for automatical lesion detection in this specific location, potentially marking a significant double advancement in CE technology. The outcome metrics demonstrate a robust diagnostic performance, indicating a valuable balance between minimizing missed lesions and keeping the proportion of false positives low, which will be an essential consideration for an effective AI-enhanced capsule endoscopy assessment.
Nonetheless, some limitations have to be recognized. On the one hand, this study has a retrospective and bicentric design involving a relatively small number of patients, which may introduce a demographic (selection) bias, implying that these finding may not be broadly applicable to other population settings. Additionally, the lack of access to clinically relevant information about this sample further hinders our understanding of how this model might impact different patient groups. Subsequent prospective studies with a larger patient population and better control of clinical variables are needed in order to corroborate our results for future application. On the other, this algorithm was constructed using a relatively low number of CE still-frames, and it does not necessarily guarantee that the model will exhibit the same diagnostic performance when applied to full-length CE videos. We were also unable to conduct a subanalysis by device or lesion type due to the limited sample size, which lacked sufficient statistical power. This also represents a limitation of the study, as it prevents us from evaluating how device-specific image characteristics may influence the model performance, as well as determining whether the accuracy is consistent across different lesion types. The paucity of esophageal mucosa frames is a topic worth mentioning, as it turns out to be both a limitation and strength of this work. Until now, there were no published esophageal-specific trained DL models for lesion detection. This may be explained due to the limited number of esophageal frames provided by each exam, impeding the development of dedicated databases. Despite being at a lower technological readiness level due to the limited data and methodological challenges, this achievement marks a significant achievement. It stands as the first published DL algorithm in this esophageal location, representing as well the accumulated experience of the scientific group in the field of AI-enhanced CE, although we also recognize that future studies are needed to explore its broader applicability.
The use of AI algorithms that further improve the diagnostic performance of CE can potentially shift the imbalance that currently favors EGD and may render CE as a cost-effective and patient-friendly option for the evaluation of esophageal pathology. In the case of esophageal varices, systematic reviews with meta-analyses have been published, demonstrating that, with the correct protocol, the pooled-accuracy for the detection of such lesions can reach up to 90% of cases [
10]. Similarly, when it comes to Barrett’s metaplasia in patients with gastro-esophageal reflux disease, systematic reviews with meta-analyses indicate that CE’s diagnostic performance metrics can be comparable to those of EGD (with a pooled sensitivity of 78% and specificity 86% in CE, versus 78% and 90%, respectively, in EGD) [
9]. Considering that it is possible for CE to detect esophageal lesions at a similar level as the gold standard, EGD, one may hypothesize that the addition of AI reading-assisted systems could potentially result in a further improvement in its diagnostic yield.