1. Introduction
Dysphagia is the inability of a person to smoothly ingest food owing to problems with the passage of food from the mouth to the esophagus [
1]. Patients with stroke, dementia, traumatic brain injury, Parkinson’s disease, and cancer are at an increased risk of developing dysphagia [
2]. Dysphagia is also common in aged individuals, including those who have conditions not accompanied by swallowing-related functional or anatomical abnormalities, such as chronic obstructive pulmonary disease [
3,
4,
5]. Notably, in a nursing care facility survey in South Korea, more than half of the patients reported having dysphagia [
6].
Oropharyngeal dysphagia is characterized by difficulty initiating a swallow, and it may be accompanied by penetration, aspiration, and a sensation of residual food remaining in the pharynx [
7]. Penetration, a mild form of swallowing difficulty, means that a bolus enters the laryngeal vestibule but never reaches the level of the vocal folds; this condition generally clears spontaneously [
8]. Aspiration is a severe condition of dysphagia caused by accidentally inhaling food or liquid through the vocal cords into the airway [
9]. The primary purpose of tube feeding for dysphagia patients is to prevent aspiration and aspiration pneumonia, which is a form of pneumonia in which a foreign body such as food, saliva, or sputum enters the alveoli and lungs through the trachea rather than the esophagus [
10]. This can result in inflammation and bacterial proliferation, leading to severe complications such as sepsis, which requires invasive management and long-term care for treatment; if left untreated, it can eventually lead to death [
11]. Consuming food is a basic human instinct, with tube feeding due to impaired oral feeding causing deep neck pain to patients, potentially leading to severe depression and serious complications, such as gastrointestinal bleeding, which requires rapid and accurate evaluation and management for treatment [
12]. Additionally, in patients with stroke and dementia, aspiration pneumonia can lead to irreversible sequelae, from which permanent recovery is impossible. The prevalence of aspiration pneumonia, which can be as high as 20%, has been on a steep rise in recent years, with it being ranked as the fourth leading cause of death in 2016 [
13]. Treating this disease using broad-spectrum antibiotics requires weeks to months [
14]. Therefore, after accurately diagnosing dysphagia in a patient, a healthcare provider should determine whether to tube feed them and determine the appropriate enteral nutrition formulation for them.
The gold standard method for examining dysphagia is the videofluoroscopic swallowing study (VFSS) [
15,
16]. The VFSS is not the most common method of diagnosing dysphagia, but it is considered the gold standard, with the FEES (fiberoptic endoscopic evaluation on swallowing) and bedside testing also being common internationally [
11]. The time required for a VFSS test is naturally greater if patient compliance is low [
17]. In addition, skilled specialists take a considerable period of time to simultaneously analyze and make inferences from the large amount of data in the test results [
18]. Therefore, in order to overcome these shortcomings, researchers have recently attempted to diagnose VFSS images using artificial intelligence (AI). Regarding existing studies, Kim et al. [
19] collected VFSS data from 190 patients with dysphagia, selected ten frame images from the swallowing process, and applied a convolutional neural network (CNN) to classify normal swallowing, penetration, and aspiration in the VFSS. They used 665 images from 133 patients as the training dataset and 285 images from 57 patients as the testing dataset. Ten images of the swallowing process were selected, with five peak images (position of the highest hyoid bone) and five lowest peak images (position of the most inferior hyoid bone). The learning outcomes were normal (AUC = 0.942), penetration (AUC = 0.878), and aspiration (AUC = 1.000) with very high accuracies. However, their method was limited in that an entire video could not be used for the deep learning analysis. Ariji et al. [
20] performed video fluorography (VFG) on twelve patients (seven men and five women). The U-Net neural network was applied to automatically segment the food material from the VFG images of both patients who swallowed normally and those who had dysphagia. For the training dataset, 1845 static images were used, which included 1005 static images of 18 swallows from three patients with healthy swallows and 840 static images of 12 swallows from two patients with aspiration or laryngeal conditions. For the validation dataset, 155 static images of six swallows from one patient with healthy swallows and 510 static images of eighteen swallows from three patients with healthy swallows were used as test dataset 1; and 1400 static images of 18 swallows from three patients with aspiration or laryngeal conditions were used as test dataset 2. The learning outcomes were high, exceeding 0.9 on the test datasets.
However, the previous studies using AI technology to diagnose patients’ swallowing disorders with VFSS videos have not examined the entire video. The limitation of these studies is that they selected several images from 300 to 500 video frames and analyzed whether there was a swallowing disorder in these images. This method is not practical in clinical practice because it requires additional effort and time for clinicians to select the appropriate individual video images. This study aimed to develop a system to solve these problems. We propose a web application that can diagnose dysphagia through a web-based database construction that handles the VFSS video data files before the evaluation and analysis, labels the generated data required for development with the artificial intelligence (AI) technology, and develops an AI model that can be applied to clinical settings based on the labeling data. In addition, the developed system was applied to stroke patients on a trial basis to investigate its accuracy and reliability.
2. Materials and Methods
2.1. Study Design and Dataset
In this study, 249 VFSS cases were randomly selected from 1348 VFSS cases, and the corresponding video clips were used to conduct a pilot study. The patients were between 25 and 96 years old (a mean age of 68.3 ± 17.8 years) and included 169 males and 80 females. Of these patients’ videos, 31 were excluded because they contained images of the patient unable to swallow food due to severe oral phase delay. In the end, 218 VFSS videos were included in developing the AI model. Aspiration was diagnosed in 141 patients by a passive VFSS video reading by neuro-rehabilitation medical doctors, and 77 patients had penetration alone.
The data were collected from patients diagnosed with dysphagia (International Classification of Diseases, 10th Revision, Clinical Modification Diagnosis Code R13.10) who were subjected to a VFSS between January 2017 and April 2022 at Wonkwang University Hospital, a 798-bed university-affiliated tertiary hospital in Iksan, South Korea. Finally, a video clip labeling dataset was created for the AI learning of the oral (n = 2355), pharyngeal (n = 2338), esophageal (n = 1480), penetration (n = 1856), and aspiration (n = 1320) datasets. Three rehabilitation physicians with more than ten years of experience in performing and reading VFSS examinations categorized all the video clips as either “normal”, “penetration”, or “aspiration”. If the specialists were not unanimous in their decision for a video, they would review it frame-by-frame till they reached a common consensus. To address potential sources of bias, the specialists and the data technicians were separated during the study periods. Regarding the data, aspiration or penetration was confirmed in 218 cases.
2.2. Web-Based Database to Manage VFSS Video Clips
The database schema was designed by analyzing the data generation rules and relationships of the system and considering data-based read and processing measures. The database was designed by mapping documents and classifying collections, because it used MongoDB, which is a NoSQL database.
2.3. Multiframe Medical Image-Labeling Web Application
We propose a web application for the multiframe medical image labeling.
Figure 1 illustrates the overall structure of the system.
The proposed labeling web application manages a video file about swallowing and performs the file handling. The swallowing test images are supported in a multiframe manner to capture the entire path of the food traveling from the oral cavity to the esophagus. Compared to the cross-sectional images obtained using computed tomography or magnetic resonance imaging, multiframe images have bulky data formats with file sizes of 300 MB~1.5 GB, because they contain 200–700 images that capture movements in the body. When a file is uploaded, the node server stores the original image using the STore Over the Web (STOW) by REpresentations State Transfer (REST) Services (STOW-RSs), which separates multiple frames for labeling, transforms them into a single image, and saves the image. Subsequently, processing is performed to convert and save the audio–video interleaved file format for video analysis using the AI model. When the multiframe medical image-labeling web application requests image processing, which involves actions such as using the Brush tag for the labeling, the image is processed using the Python package in the Flask microweb framework (
Figure 2).
DICOM, Digital Image and Communications in Medicine.
2.4. Categories of the Datasets and the Labeling Process
The three phases of swallowing include (1) the oral phase, where the food is chewed as necessary or to a consistency and form that can be swallowed before the tongue pushes the food back to induce pharyngeal swallowing; (2) the pharyngeal phase, in which a food bolus reaches the pharynx and involuntary swallowing occurs, in which the swallowing reflex causes rhythmic and involuntary contractions of various muscles in the back of the mouth, pharynx, and esophagus to push food into the pharynx and esophagus; and (3) the esophageal phase, in which the food lump is moved through the esophagus to the stomach by the peristaltic movement of the esophagus.
Each stage in which the food material (a bolus) moved from the oral phase to the pharyngeal phase and the esophageal phase was labeled, and the bolus corresponding to penetration and aspiration was labeled [
20]. After defining the boundary part of each phase in advance, three technicians performed the labeling. At this time, there may have been some differences in boundary classification due to anatomical variation among each technician. In this case, the boundary classification and labeling were conducted according to the opinions of clinical experts. Oral, pharyngeal, esophageal, penetration, and aspiration were divided into five classes, and each class was labeled as oral (
n = 2355), pharyngeal (
n = 2338), esophageal (
n = 1480), penetration (
n = 1856), and aspiration (
n = 1320).
When the VFSS file was uploaded, the server checked the multiframe image and saved it as frames with a resolution of 946 × 958 pixels. The user used the scroll bar or left and right buttons to move each image to each stage for the labeling. They then moved to the image where the food bolus was observed and adjusted the windowing values to see the food bolus more clearly. Then, the food bolus was drawn and labeled in pixel units using a brush. However, because labeling the food boli manually by pixels is not accurate, the GrabCut algorithm was used to separate the objects and backgrounds based on a designated ROI area along with the BackProjection function, which automatically selects an area with a histogram similar to the histogram of the ROI. This makes the labeling process relatively easy. A food bolus was labeled at each stage using this method. Finally, the user could export and save the segmentation in the PNG (Portable Network Graphic) format to record the labeling results. The swallowing test data were saved as a single multiframe image comprising several frames so a large number of labeling data could be stored together.
2.5. AI Model for Detecting Aspiration and Penetration
The VFSS videos of 218 patients were used to develop an AI model to identify airway involvement, a severe phenomenon in swallowing disorders. The AI model was tested based on the labeled data so that airway involvement could be diagnosed by dividing it into normal, aspiration, and penetration. Separately, the ability to classify the normal swallowing processes, which are divided into the oral, pharyngeal, and esophageal phases, was added to the AI model.
When the VFSS study was downloaded from the labeling web application, all the data were compressed into folders and downloaded. Any extracted compressed file contained all the labeled files in that folder and could be viewed in the form of file names (for example, Aspiration_1.3.12.2.1107.5.3.33.7367.4.202205261015030277_2_1_0020.png), separated, and saved in folders for each class, with the first token representing that class. Then, using a utility called SplitFolders, each folder was split into two folders in a ratio of 7:1:2 for the training, test, and validation sets, respectively, and used as the AI training data.
The AI training was tested in two ways. First, for a comparison with related research [
15], an AI model that classified each class using EfficientNetV2 [
16], which is a CNN-based classifier, was developed. Second, an AI model for object detection in each class was developed using YOLOv7 [
21].
In CNN models, the learning speed typically decreases as the size of the dataset increases. However, EfficientNetV2 is a fast learning model that achieves 4 times faster training speeds and 6.8 times fewer parameters than EfficientNetV1. In addition, since the training of the EfficientNetV2 model slows down as the image size increases, the original frame size of the VFSS inspection data (946 × 958) was reduced by half (473 × 479).
Regarding the YOLOv7 model, which was adopted in this study, the original image size was inputted as the training data, and the downloaded data for the label file were the image data. These data were unavailable in the actual YOLO model; thus, the center of the contour of the image with the label file was calculated and converted to YOLO coordinates that could wrap the object from the center. The number of classes and class names were changed to match those of the current dataset, and the remaining hyperparameters were used as provided.
4. Discussion
This study proposes an AI web application for diagnosing aspiration or penetration in the swallowing process. An AI model that can verify the results of multiframe data generated by VFSS inspection without preprocessing or separate processing was developed for clinical applications. Labeling application software was developed to manage large amounts of data and effectively learn data generation. In particular, we aimed to reduce errors in labeling data by including clinician verification during the labeling process. Additionally, a YOLO-based model was developed to diagnose dysphagia caused by food materials, achieving an accuracy rate of at least 0.8. After optimizing modularization, the model divided the diagnosis into five classes, and the classification results were integrated into the web application.
Research on using artificial intelligence to analyze medical images has been actively published in recent years. Most of these studies used AI to diagnose diseases or investigate functions from still images, such as MRI and CT images, rather than videos. However, there are few studies that analyze video images with AI and use them to diagnose diseases. Konradi et al. [
24] developed explainable artificial intelligence (XAI) to analyze Flexible Endoscopic Evaluation of Swallowing (FEES) videos. In this pilot study, it was reported that the accuracy of the training data was 0.925 and the testing data was 0.571 to diagnose swallowing disorders [
24]. Similar to this study, Jeong et al. [
25] attempted to diagnose swallowing disorders using VFSS video with the ResNet3D AI model. Multiple indicators, including the oral phase duration, pharyngeal delay time, pharyngeal response time, and pharyngeal transit time, were used, and the accuracy was reported to be 0.901–0.981. Medical images stored as videos (e.g., echocardiography, gastroscopy, fetal ultrasound, gait analysis, etc.) make it difficult to diagnose diseases using AI. This study is one of a few studies in which an AI automatically analyzes swallowing disorders after uploading the entire video file. Research on diagnosing diseases using artificial intelligence from medical videos is still in its early stages, and suitable AI models are also in the experimental stage. However, medical video AI research is expected to continue to be active.
Regarding our study, the average time taken to detect aspiration and penetration using the proposed AI model using EfficientNetV2 and YOLOv7 for the entire VFSS video was 40–60 s (700 frames). The model also exhibited an accuracy of 90% in diagnosing aspiration and penetration. Regarding the difficulties faced in using a VFSS, it is time-consuming for a doctor to examine and accurately read a VFSS because it often requires multiple video views. Stroke patients that are subjected to a VFSS often also have hemiplegia or quadriplegia, as well as dysphagia; therefore, the long waiting time in the clinic to obtain their medical results puts a lot of pressure on them. Additionally, the VFSS has the limitation of poor inter- and intra-rater reliability [
16,
22,
26]. Inexperienced examiners often misinterpret VFSS results because of the complex anatomy of the human neck and the poor video quality that arises from attempting to record uncooperative patients [
16]. One study showed an inter-rater reliability coefficient of 0.9 between well-trained and experienced examiners and an inter-rater reliability coefficient of 0.6 between less skilled and experienced examiners [
26]. The AI model developed in our study to diagnose airway aspiration and penetration from an entire VFSS video can contribute to reducing patient waiting times and improving the reliability and validity of testing [
27].
In our study, four labelers labeled food material in the swallowing process of multiframe data to diagnose dysphagia through the food material in the VFSS video of patients with dysphagia. However, if the food material overlapped ambiguously between sections, it was labeled differently from the perspective of the labeler, resulting in poor learning results. To solve this problem, rehabilitation physicians were tasked with manually reading the VFSS test and reviewing whether the labelers had labelled it properly. Additionally, the accuracy of the labeling training data was improved through verification.
This study has several limitations. Since the system was developed for patients in a single institution, additional validation tests are required on patients from external hospitals. Compared to previous studies, we used many patients’ VFSS videos to develop AI models, but more VFSS videos need to be trained on this model to be more reliable. In addition, many diseases cause swallowing disorders, and this study did not consider the characteristics of disease-specific swallowing disorders. Airway invasion, such as aspiration or penetration, has the most important diagnostic value in dysphagia. Nevertheless, depending on the disease, there may be differences in how it manifests itself. The optimization of AI VFSS video diagnostics by cause of swallowing disorders will be further implemented in future studies.