Next Article in Journal
Prevalence and Risk Factors for Portal Cavernoma in Adult Patients with Portal Vein Thrombosis
Previous Article in Journal
Utilizing Deep Learning for Diagnosing Radicular Cysts
Previous Article in Special Issue
Side- and Sinus-Specific Relationships between Chronic Rhinosinusitis and Ischemic Stroke Using Imaging Analyses
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Automated Laryngeal Invasion Detector of Boluses in Videofluoroscopic Swallowing Study Videos Using Action Recognition-Based Networks

by
Kihwan Nam
1,
Changyeol Lee
2,
Taeheon Lee
3,
Munseop Shin
3,
Bo Hae Kim
4,*,† and
Jin-Woo Park
3,*,†
1
Graduate School of Management of Technology, Korea University, Seoul 02841, Republic of Korea
2
AimedAI, Seoul 06150, Republic of Korea
3
Department of Physical Medicine and Rehabilitation, Dongguk University Ilsan Hospital, College of Medicine, 27 Dongguk-ro, Ilsandong-gu, Goyang 10326, Republic of Korea
4
Department of Otorhinolaryngology-Head and Neck Surgery, Dongguk University Ilsan Hospital, College of Medicine, 27 Dongguk-ro, Ilsandong-gu, Goyang 10326, Republic of Korea
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work as corresponding authors.
Diagnostics 2024, 14(13), 1444; https://doi.org/10.3390/diagnostics14131444 (registering DOI)
Submission received: 23 May 2024 / Revised: 1 July 2024 / Accepted: 4 July 2024 / Published: 6 July 2024
(This article belongs to the Special Issue Advances in Diagnosis and Treatment in Otolaryngology)

Abstract

:
We aimed to develop an automated detector that determines laryngeal invasion during swallowing. Laryngeal invasion, which causes significant clinical problems, is defined as two or more points on the penetration–aspiration scale (PAS). We applied two three-dimensional (3D) stream networks for action recognition in videofluoroscopic swallowing study (VFSS) videos. To detect laryngeal invasion (PAS 2 or higher scores) in VFSS videos, we employed two 3D stream networks for action recognition. To establish the robustness of our model, we compared its performance with those of various current image classification-based architectures. The proposed model achieved an accuracy of 92.10%. Precision, recall, and F1 scores for detecting laryngeal invasion (≥PAS 2) in VFSS videos were 0.9470 each. The accuracy of our model in identifying laryngeal invasion surpassed that of other updated image classification models (60.58% for ResNet101, 60.19% for Swin-Transformer, 63.33% for EfficientNet-B2, and 31.17% for HRNet-W32). Our model is the first automated detector of laryngeal invasion in VFSS videos based on video action recognition networks. Considering its high and balanced performance, it may serve as an effective screening tool before clinicians review VFSS videos, ultimately reducing the burden on clinicians.

1. Introduction

Swallowing is successfully accomplished through the sequential and harmonious movement of the upper digestive tract structures [1]. Functional loss or anatomical deformities of these structures, which play a pivotal role in the swallowing process, can impair bolus transition, resulting in conditions such as aspiration pneumonia or malnutrition [2]. Videofluoroscopic swallowing study (VFSS) is the most valuable diagnostic test for dysphagia, elucidating the compromised swallowing mechanism by capturing serial images while the patient ingests a fluorescent bolus [3]. Converting the images acquired during VFSS into a chronological video format facilitates a visually intuitive and detailed evaluation of the dynamic movement of the swallowing structures and their correlation with bolus transition [4]. Consequently, clinicians primarily rely on reviewing VFSS videos. As physicians must assess the success of the intricate swallowing process through reconstructed VFSS videos, which document the entire process within a few seconds, this review demands a significant investment of time and experience from physicians. Moreover, issues related to intra- and inter-rater reliabilities need to be addressed when scrutinizing VFSS videos [5].
Artificial intelligence (AI) based on deep learning has been expanding its applications in various medical fields to enhance the diagnosis and treatment of diseases, alleviate the burden on physicians, and improve the reliability of image review [6,7]. Among the different deep learning models for detecting abnormalities in medical images, image-based analysis, which typically utilizes a two-dimensional (2D) convolutional neural network (CNN), is the most widely employed [8]. A 2D CNN can identify image characteristics by constructing multiple convolution and subsampling layers. It has demonstrated excellent performance in classifying and identifying abnormalities in medical images, despite the relatively limited volume of learning data available in the medical field compared with that in the information technology industry [8,9]. In a previous study, an image-based network exhibited a highly proficient ability to detect aspiration on VFSS [10]; however, image-based networks focus on identifying the spatial characteristics of an image and often overlook temporal information (ordered frame sequences) [11,12]. Therefore, a new deep learning network should be capable of analyzing data, considering both spatial and temporal information, as required in VFSS [13,14].
The most crucial aspect of accurate decision-making by AI using VFSS videos is the efficient extraction and learning of meaningful bolus movements from a few VFSS videos [15]. Action recognition is a specialized model for video analysis that utilizes both spatial and temporal information [16]. Additionally, action recognition can produce more meaningful results in video analysis than conventional image-based analysis because it efficiently analyzes and predicts the characteristics of VFSS videos. Therefore, we aimed to develop an automated detector that determines laryngeal invasion during swallowing by utilizing two or more scores on the penetration–aspiration scale (PAS). This scale poses significant clinical problems that we aimed to address in VFSS videos by employing two 3D stream networks for action recognition.

2. Materials and Methods

2.1. Video Fluoroscopic Swallowing Study

All the VFSS procedures were supervised by a single experienced physician in the Department of Rehabilitation Medicine (J.W.P.). VFSS images were acquired while the patients swallowed a fluorescent bolus mixed with liquid barium and water in an upright position, positioned 1.5 m away from the X-ray tube (Sonialvision-100; Shimadzu Corporation, Kyoto, Japan). Each participant completed two swallows with a 5 mL bolus administered using a syringe. Lateral-view images were captured at 30 frames per second with a frame size of 1021 × 1021 pixels and digitally stored in a digital picture archive and communication system [17].

2.2. Dataset

We consecutively extracted 1300 VFSS videos from our institution’s database of patients complaining of dysphagia between January 2010 and April 2022. The study protocol was approved by our institutional review board (IRB) [IRB no. 2022-05-007-001]. The dataset included VFSS videos of appropriate image quality that recorded at least one task of the entire swallowing process, specifically focusing on the fluidic bolus, which is the most sensitive to laryngeal invasion. We included all recorded videos of liquid swallowing in adults aged 18 years, regardless of illness. After excluding videos with poor quality or those that did not capture the entire swallowing process, 1023 videos were consecutively included.
Two specialized reviewers determined the PAS scores of the included VFSS videos. Any disagreements were resolved by a third reviewer. Laryngeal invasion during swallowing on VFSS videos was defined as PAS 2 or higher [18]. PAS 1 was assigned to 266 videos (26.0%), and 757 videos (74.0%) were assigned two or more points. The dataset was randomly divided into two groups: a training set (821 videos) and a testing set (202 videos). The testing set comprised 51 PAS 1 videos and 151 laryngeal invasion videos (PAS 2 or higher), serving to evaluate the performance of our AI model.

2.3. Deep Learning Architecture for Action Recognition

This study mainly focused on the application of action recognition techniques to identify laryngeal invasion (PAS 2 or higher scores) in VFSS videos. A video classification model, employing a 2D CNN that disregards spatial factors, extracted features from individual frames using an image classification network. These features were then amalgamated to obtain the results (Figure 1A). However, because spatial factors were not considered, this approach exhibited suboptimal performance in intricate videos such as VFSS [19].
The fundamental principle of the action recognition model involves recognizing keyframe information in a multiframe sequence and classifying it based on the selected keyframe data. Four dimensional factors (two spatial dimensions, height and width, one temporal dimension, and one channel dimension) traversed the model, facilitating the learning of various temporal interactions between adjacent frames (Figure 1B) [19]. The total number of frames in the video, denoted as “n” for each VFSS test image, and the specific starting and ending points relevant to the VFSS model analysis varied. To ensure an efficient analysis, we defined valid starting and ending points within the ‘n’ frames and interpolated the selected frames using the nearest-neighbor method [20]. Consequently, “k” fixed-size frames were generated and employed as inputs for the action recognition network. To develop a model optimized for video-based action recognition, accuracy was enhanced and computational costs were reduced using a 3D group convolution network [21]. Model learning involved the incorporation of group and depth-wise convolutions. Group convolution was partitioned into channel interaction and spatial–temporal interaction to augment accuracy and introduce a regularization effect to mitigate overfitting [22]. Furthermore, depth-wise convolution was employed to reduce computational costs [22]. As the video underwent convolution, it was segmented into R, G, and B channels, enabling the model to concurrently learn from these three channels, thereby reducing the computational costs by a factor of three.

2.4. Application of Architecture for Action Recognition on VFSS

The videos stored in our institutional database were converted into individual frame images. To enhance image comprehension, we equalized the brightness of all the images. The region of interest (ROI) on each image was divided into three areas where bolus transition occurred during swallowing: the pharynx, larynx, and vocal folds. Manual annotation was performed for each frame image in these areas. After selecting the frames indicating the initiation and completion of swallowing, we sorted the annotated images in the ROI according to time sequence. These images were initially interpolated using the nearest-neighbor method and subsequently used to train a deep learning model. We concatenated all the learned features extracted from the annotated images, followed by the extraction of the final features [23]. Finally, we analyzed the detector performance using a test set to identify abnormalities in the VFSS video (Figure 2). To establish the model’s robustness, we compared its performance with that of various up-to-date image classification-based architectures (resnet101, swin-transformer, efficientnet-β2, and HRnet-w32).

3. Results

Performance

The accuracy of our model for classifying laryngeal invasion was 92.10% (Table 1). The specific accuracy of our model was 84.31% (43/51 videos) for the prediction of PAS 1 and 94.70% (143/151 videos) for PAS 2 or higher (Figure 3A). The receiver operating characteristic (ROC) curve showed an area under the curve of 0.88 (Figure 3B). The precision, recall, and F1 scores for detecting laryngeal invasion (PAS 2 or higher) in the VFFS video were 0.9470, 0.9470, and 0.9470, respectively (Table 1). As the precision, recall, and F1 score for detecting PAS 1 scores were 0.8431, 0.8431, and 0.8431, respectively (Table 1), our model performed better in detecting laryngeal invasion compared to determining the absence of laryngeal invasion on the VFSS video. This result may be attributed to the fact that the number of laryngeal invasion videos used for training the model was greater than the number of PAS videos. When using the proposed model to classify the absence of laryngeal invasion (PAS 1) and the presence of laryngeal invasion (PAS 2 or higher), the most common misclassification was PAS 2, classified as the absence of laryngeal invasion (PAS 1). The remaining PAS 3 to PAS 8 showed solid results in terms of prediction accuracy.
The accuracy of our model in identifying laryngeal invasion was higher than that of other up-to-date image classification models (60.58% for ResNet101, 60.19% for Swin-Transformer, 63.33% for EfficientNet-B2, and 31.17% for HRNet-W32) (Table 2).

4. Discussion

To determine laryngeal invasion in VFSS videos, we developed an automated detector based on two 3D stream networks for action recognition. The classifier was designed using 3D convolutions to learn both spatial features and movements based on the time sequence of the bolus [24,25]. Our detector for VFSS videos demonstrated an accuracy of 92.1%, surpassing that of various state-of-the-art image classification-based deep learning networks.
Action recognition in videos using deep learning, applied for various purposes such as ensuring public safety, preventing crimes, and enhancing the effective motion of athletes, is a technique that recognizes or classifies actions within the object of interest [26,27]. Advances in action recognition and handling of spatial–temporal information have been delayed compared with those in image analysis models owing to the lack of large-scale datasets, high computational cost, and less attention to temporal modeling [28]. However, video action recognition has recently progressed with the introduction of various architectures that can reduce computational costs, while effectively learning temporal and spatial information from videos [28]. VFSS contains 3D information on bolus transitions with height, width, and time sequence. Clinicians can determine abnormalities in VFSS by considering the bolus location and transition [4]. Therefore, spatial–temporal models for action recognition are suitable for application in VFSS deep learning. Additionally, VFSS videos composed of serial images can be used to capture the swallowing process in a short time. This is also an advantageous characteristic of VFSS videos when applying a spatial–temporal model for action recognition from the perspectives of computational cost and clinical application. We defined the bolus transition according to the time in the VFSS videos as an action of interest. Our model is composed of two main 3D convolutions, one for spatial–temporal interaction and the other for channel interaction, to handle both spatial and temporal information in the VFSS video. Moreover, convolution separation enables various calculations by simultaneous parallel processing and reduces the number of parameters used in each convolution, resulting in a reduced computational cost. This network is generally called the “two-stream 3D network for action recognition” [29]. The next important step in the final prediction was to fuse the two separately trained 3D streams by averaging the results of both convolution predictions. In our model, the final prediction was made based on the average output of the two streams through fusion using the concatenation of all learned features from the annotated images [23].
An automated detector was recently developed to determine the presence of aspiration in VFSS by utilizing a 2D CNN for detection [10]. This classifier focused on a specific pathological event and demonstrated an accuracy of 93.2%, with 91.2% recall and 88.1% precision in detecting aspiration during swallowing. The remarkable performance of previous studies may be attributed to the extraction of image characteristics through multiple convolutions with various kernels specialized in detecting single pathological events such as aspiration [10]. Classifying VFSS videos into PAS 1 and PAS 2 or higher groups using a conventional CNN necessitates the definition and learning of all the diverse pathological events observed in VFSS, incurring excessive work and costs. We closely simulated the actual VFSS review process for clinicians using our deep learning model for bolus action recognition. This model exhibited high and balanced performance with 92.10% accuracy, 94.7% precision, 9474% recall, and a 0.947 F1 score for detecting laryngeal invasion during swallowing in VFSS videos. In contrast to the 2D CNN, our action recognition model learns bolus characteristics by incorporating channel interaction and bolus transition, in addition to spatial–temporal interaction, to enhance accuracy and introduce a regularization effect. This is achieved without the need to define and train various types of pathological events in VFSS [30]. This aspect is a crucial control point that can contribute to the high and balanced performance of our classifier, while minimizing the training burden for the deep learning model.
Although various advanced techniques for action recognition have been employed to address the limited training data, only approximately 1000 VFSS videos were used for both model training and validation in this study. The scarcity of data poses a significant issue and is a common challenge in the development of AI models for medical applications. Moreover, external validation using VFSS data from other institutions was not performed during the model development phase. Discrepancies in image quality and VFSS protocols can affect the effectiveness of deep learning in AI development. The primary limitations of this study include the insufficient volume of data and the absence of external validation. These limitations could be mitigated by incorporating additional external data and expanding the dataset.
To the best of our knowledge, our model is the first automated detection of laryngeal invasion in VFSS videos that utilizes video action recognition networks. Owing to its demonstrated high and balanced performance, it may serve as an effective screening tool before clinicians review VFSS videos, potentially reducing their burden.

Author Contributions

Conceptualization: B.H.K. and J.-W.P.; methodology: K.N. and J.-W.P.; validation: T.L., M.S. and J.-W.P.; formal analysis: K.N. and C.L.; investigation: B.H.K., J.-W.P. and M.S.; data curation: T.L., M.S. and J.-W.P.; writing the original draft: K.N. and B.H.K.; writing—review and editing: K.N., B.H.K. and J.-W.P.; visualization: K.N.; supervision: J.-W.P.; project administration: B.H.K. and J.-W.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. RS-2023-00252208).

Institutional Review Board Statement

This study was approved by the Institutional Review Board of Dongguk University Ilsan Hospital Institutional Review Board (IRB) [IRB no. 2022-05-007-001; approval date 31 May 2022].

Informed Consent Statement

Patient consent was waived because this study was a retrospective study conducted through anonymization of data, written informed consent was not required.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to ethical issue.

Conflicts of Interest

Author Changyeol Lee was employed by the company AimedAI. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Matsuo, K.; Palmer, J.B. Anatomy and physiology of feeding and swallowing: Normal and abnormal. Phys. Med. Rehabil. Clin. N. Am. 2008, 19, 691–707. [Google Scholar] [CrossRef]
  2. Pauloski, B.R. Rehabilitation of dysphagia following head and neck cancer. Phys. Med. Rehabil. Clin. N. Am. 2008, 19, 889–928. [Google Scholar] [CrossRef]
  3. Martin-Harris, B.; Jones, B. The videofluorographic swallowing study. Phys. Med. Rehabil. Clin. N. Am. 2008, 19, 769–785. [Google Scholar] [CrossRef]
  4. Gramigna, G.D. How to perform video-fluoroscopic swallowing studies. GI Motil. Online 2006. [Google Scholar] [CrossRef]
  5. Edwards, A.; Froude, E.; Sharpe, G. Developing competent videofluoroscopic swallowing study analysts. Curr. Opin. Otolaryngol. Head Neck Surg. 2018, 26, 162–166. [Google Scholar] [CrossRef]
  6. Bhinder, B.; Gilvary, C.; Madhukar, N.S.; Elemento, O. Artificial Intelligence in Cancer Research and Precision Medicine. Cancer Discov. 2021, 11, 900–915. [Google Scholar] [CrossRef]
  7. Miller, D.D.; Brown, E.W. Artificial Intelligence in Medical Practice: The Question to the nswer? Am. J. Med. 2018, 131, 129–133. [Google Scholar] [CrossRef]
  8. Yamashita, R.; Nishio, M.; Do, R.K.G.; Togashi, K. Convolutional neural networks: An overview and application in radiology. Insights Imaging 2018, 9, 611–629. [Google Scholar] [CrossRef]
  9. Wang, R.; Chen, S.; Ji, C.; Fan, J.; Li, Y. Boundary-aware context neural network for medical image segmentation. Med. Image Anal. 2022, 78, 102395. [Google Scholar] [CrossRef]
  10. Lee, S.J.; Ko, J.Y.; Kim, H.I.; Choi, S.-I. Automatic Detection of Airway Invasion from Videofluoroscopy via Deep Learning Technology. Appl. Sci. 2020, 10, 6179. [Google Scholar] [CrossRef]
  11. Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data. 2021, 8, 53. [Google Scholar] [CrossRef]
  12. Yang, Q.; Lu, T.; Zhou, H. A spatio-temporal motion network for action recognition based on spatial attention. Entropy 2022, 24, 368. [Google Scholar] [CrossRef]
  13. Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Fei-Fei, L. Large-scale video classification with convolutional neural networks. In Proceedings of the 2014 IEEE conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
  14. Carreira, J.; Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  15. Ramanathan, M.; Yau, W.-Y.; Teoh, E.K. Human action recognition with video data: Research and evaluation challenges. IEEE Trans. Hum.-Mach. Syst. 2014, 44, 650–663. [Google Scholar] [CrossRef]
  16. Li, T.; Foo, L.G.; Ke, Q.; Rahmani, H.; Wang, A.; Wang, J.; Liu, J. (Eds.) Dynamic spatio-temporal specialization learning for fine-grained action recognition. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
  17. Park, J.W.; Oh, J.C.; Lee, J.W.; Yeo, J.S.; Ryu, K.H. The effect of 5Hz high-frequency rTMS over contralesional pharyngeal motor cortex in post-stroke oropharyngeal dysphagia: A randomized controlled study. Neurogastroenterol. Motil. 2013, 25, 324-e250. [Google Scholar] [CrossRef]
  18. Rosenbek, J.C.; Robbins, J.A.; Roecker, E.B.; Coyle, J.L.; Wood, J.L. A penetration–aspiration scale. Dysphagia 1996, 11, 93–98. [Google Scholar] [CrossRef]
  19. Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar]
  20. Rukundo, O.; Cao, H. Nearest neighbor value interpolation. arXiv 2012, arXiv:12111768. [Google Scholar]
  21. Lin, C.-J.; Lin, C.-W. Using Three-dimensional Convolutional Neural Networks for Alzheimer’s Disease Diagnosis. Sens. Mater. 2021, 33, 3399–3413. [Google Scholar] [CrossRef]
  22. Liao, Y.; Lu, S.; Yang, Z.; Liu, W. Depthwise grouped convolution for object detection. Mach. Vision Appl. 2021, 32, 1–13. [Google Scholar] [CrossRef]
  23. Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional two-stream network fusion for video action recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  24. Xu, H.; Das, A.; Saenko, K. Two-stream region convolutional 3D network for temporal activity detection. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 2319–2332. [Google Scholar] [CrossRef]
  25. Feng, Z.; Sivak, J.A.; Krishnamurthy, A.K. Two-stream attention spatio-temporal network for classification of echocardiography videos. In Proceedings of the 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), Nice, France, 13–16 April 2021; IEEE: Piscataway, NJ, USA, 2021. [Google Scholar]
  26. Pham, H.H.; Khoudour, L.; Crouzil, A.; Zegers, P.; Velastin, S.A. Video-based human action recognition using deep learning: A review. arXiv 2022, arXiv:220803775. [Google Scholar]
  27. Huang, X.; Cai, Z. A review of video action recognition based on 3D convolution. Comput. Electr. Eng. 2023, 108, 108713. [Google Scholar] [CrossRef]
  28. Zhu, Y.; Li, X.; Liu, C.; Zolfaghari, M.; Xiong, Y.; Wu, C.; Zhang, Z.; Tighe, J.; Manmatha, R.; Li, M. A comprehensive study of deep video action recognition. arXiv 2020, arXiv:201206567. [Google Scholar]
  29. Liu, H.; Tu, J.; Liu, M. Two-stream 3d convolutional neural network for skeleton-based action recognition. arXiv 2017, arXiv:170508106. [Google Scholar]
  30. Jeong, S.Y.; Kim, J.M.; Park, J.E.; Baek, S.J.; Yang, S.N. Application of deep learning technology for temporal analysis of videofluoroscopic swallowing studies. Sci. Rep. 2023, 13, 17522. [Google Scholar] [CrossRef]
Figure 1. Presentations of network architecture for analyzing medical data: (A) imaging-based deep learning models primarily determine image normality based on extracted characteristics from the dataset without considering temporal information from serial images; (B) action recognition networks predict the abnormality of VFSS videos by incorporating both spatial and temporal information. VFSS—videofluoroscopic swallowing study.
Figure 1. Presentations of network architecture for analyzing medical data: (A) imaging-based deep learning models primarily determine image normality based on extracted characteristics from the dataset without considering temporal information from serial images; (B) action recognition networks predict the abnormality of VFSS videos by incorporating both spatial and temporal information. VFSS—videofluoroscopic swallowing study.
Diagnostics 14 01444 g001
Figure 2. Process for training the deep learning model using VFSS videos. The training of a two-stream 3D network is initiated by annotating the region of interest in each frame image. After sorting annotated images according to the time sequence, they are utilized for deep learning model training. All learned features from annotated images are concatenated, and final features are then extracted. VFSS—videofluoroscopic swallowing study.
Figure 2. Process for training the deep learning model using VFSS videos. The training of a two-stream 3D network is initiated by annotating the region of interest in each frame image. After sorting annotated images according to the time sequence, they are utilized for deep learning model training. All learned features from annotated images are concatenated, and final features are then extracted. VFSS—videofluoroscopic swallowing study.
Diagnostics 14 01444 g002
Figure 3. Performance of our action recognition model for detecting laryngeal invasion in VFSS videos: (A) confusion matrix; (B) receiver operating characteristic curve for detecting laryngeal invasion using VFSS videos. VFSS—videofluoroscopic swallowing study; PAS—penetration–aspiration scale.
Figure 3. Performance of our action recognition model for detecting laryngeal invasion in VFSS videos: (A) confusion matrix; (B) receiver operating characteristic curve for detecting laryngeal invasion using VFSS videos. VFSS—videofluoroscopic swallowing study; PAS—penetration–aspiration scale.
Diagnostics 14 01444 g003
Table 1. Precision, recall, and F1 scores per video determining the normality of videofluoroscopic swallowing study videos.
Table 1. Precision, recall, and F1 scores per video determining the normality of videofluoroscopic swallowing study videos.
ClassificationPrecisionRecallF1 Score
Absence of laryngeal invasion (PAS 1)0.84310.84310.8431
Presence of laryngeal invasion (PAS 2 or higher)0.94700.94700.9470
PAS—penetration–aspiration scale.
Table 2. Comparison of accuracy between our model and other up-to-date image classification models.
Table 2. Comparison of accuracy between our model and other up-to-date image classification models.
TypeModelValid Accuracy
Imageresnet10160.39% (122/202)
swin-transformer60.19% (125/202)
efficientnet-b263.36% (128/202)
HRnet-w3261.38% (124/202)
VideoOur model92.10% (186/202)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nam, K.; Lee, C.; Lee, T.; Shin, M.; Kim, B.H.; Park, J.-W. Automated Laryngeal Invasion Detector of Boluses in Videofluoroscopic Swallowing Study Videos Using Action Recognition-Based Networks. Diagnostics 2024, 14, 1444. https://doi.org/10.3390/diagnostics14131444

AMA Style

Nam K, Lee C, Lee T, Shin M, Kim BH, Park J-W. Automated Laryngeal Invasion Detector of Boluses in Videofluoroscopic Swallowing Study Videos Using Action Recognition-Based Networks. Diagnostics. 2024; 14(13):1444. https://doi.org/10.3390/diagnostics14131444

Chicago/Turabian Style

Nam, Kihwan, Changyeol Lee, Taeheon Lee, Munseop Shin, Bo Hae Kim, and Jin-Woo Park. 2024. "Automated Laryngeal Invasion Detector of Boluses in Videofluoroscopic Swallowing Study Videos Using Action Recognition-Based Networks" Diagnostics 14, no. 13: 1444. https://doi.org/10.3390/diagnostics14131444

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop