BAE-ViT: An Efficient Multimodal Vision Transformer for Bone Age Estimation
Abstract
:1. Introduction
2. Materials and Methods
2.1. Dataset
2.2. Models
2.3. Evaluation
3. Results
3.1. Comparison of Performance of CNNs and ViTs
3.2. Visualization of Heatmaps
3.3. Performance of Models in Different Age Groups
3.4. Training Techniques for Ensemble Models
3.5. Robustness of Models over Bad Examples
3.6. Sensitivity to Demographic Label Perturbation
4. Discussion
4.1. Model Comparison
4.2. Limitations
4.3. Clinical Implementation
4.4. Impact of Demographic Label Integrity
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Greulich, W.W.; Pyle, S.I. Radiographic Atlas of Skeletal Development of the Hand and Wrist; Stanford University Press: Redwood City, CA, USA, 1959; Available online: http://www.sup.org/books/title/?id=2696 (accessed on 25 October 2022).
- Poznanski, A.K. Assessment of Skeletal Maturity and Prediction of Adult Height (TW2 Method). Am. J. Dis. Child. 1977, 131, 1041–1042. [Google Scholar] [CrossRef]
- Lee, J.H.; Kim, Y.J.; Kim, K.G. Bone age estimation using deep learning and hand X-ray images. Biomed. Eng. Lett. 2020, 10, 323–331. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Lee, H. Fully Automated Deep Learning System for Bone Age Assessment. J. Digit. Imaging 2017, 30, 427–441. [Google Scholar] [CrossRef] [PubMed]
- Bui, T.D.; Lee, J.J.; Shin, J. Incorporated region detection and classification using deep convolutional networks for bone age assessment. Artif. Intell. Med. 2019, 97, 1–8. [Google Scholar] [CrossRef] [PubMed]
- Wu, E.; Kong, B.; Wang, X.; Bai, J.; Lu, Y.; Gao, F.; Zhang, S.; Cao, K.; Song, Q.; Lyu, S.; et al. Residual Attention Based Network for Hand Bone Age Assessment. In Proceedings of the 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), Venice, Italy, 8–11 April 2019; pp. 1158–1161. [Google Scholar] [CrossRef]
- Han, J.; Jia, Y.; Zhao, C.; Gou, F. Automatic Bone Age Assessment Combined with Transfer Learning and Support Vector Regression. In Proceedings of the 2018 9th International Conference on Information Technology in Medicine and Education (ITME), Hangzhou, China, 19–21 October 2018; pp. 61–66. [Google Scholar] [CrossRef]
- Liu, Y.; Zhang, C.; Cheng, J.; Chen, X.; Wang, Z.J. A multi-scale data fusion framework for bone age assessment with convolutional neural networks. Comput. Biol. Med. 2019, 108, 161–173. [Google Scholar] [CrossRef] [PubMed]
- Umer, M.; Eshmawi, A.A.; Alnowaiser, K.; Mohamed, A.; Alrashidi, H.; Ashraf, I. Skeletal age evaluation using hand X-rays to determine growth problems. Peerj Comput. Sci. 2023, 9, e1512. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Liu, Z.Q.; Hu, Z.J.; Wu, T.Q.; Ye, G.X.; Tang, Y.L.; Zeng, Z.H.; Ouyang, Z.M.; Li, Y.Z. Bone age recognition based on mask R-CNN using xception regression model. Front. Physiol. 2023, 14, 1062034. [Google Scholar] [CrossRef]
- Pan, X.; Zhao, Y.; Chen, H.; Wei, D.; Zhao, C.; Wei, Z. Fully Automated Bone Age Assessment on Large-Scale Hand X-Ray Dataset. Int. J. Biomed. Imaging 2020, 2020, 8460493. [Google Scholar] [CrossRef]
- Halabi, S.S.; Prevedello, L.M.; Kalpathy-Cramer, J.; Mamonov, A.B.; Bilbily, A.; Cicero, M.; Pan, I.; Pereira, L.A.; Sousa, R.T.; Abdala, N.; et al. The RSNA Pediatric Bone Age Machine Learning Challenge. Radiology 2019, 290, 498–503. [Google Scholar] [CrossRef]
- González, C.; Escobar, M.; Daza, L.; Torres, F.; Triana, G.; Arbeláez, P. SIMBA: Specific Identity Markers for Bone Age Assessment. In Medical Image Computing and Computer Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2020; Volume 12266, pp. 753–763. [Google Scholar] [CrossRef]
- Ren, X.; Li, T.; Yang, X.; Wang, S.; Ahmad, S.; Xiang, L.; Stone, S.R.; Li, L.; Zhan, Y.; Shen, D.; et al. Regression Convolutional Neural Network for Automated Pediatric Bone Age Assessment From Hand Radiograph. IEEE J. Biomed. Health Inform. 2019, 23, 2030–2038. [Google Scholar] [CrossRef]
- Wang, C.; Wu, Y.; Wang, C.; Zhou, X.; Niu, Y.; Zhu, Y.; Gao, X.; Wang, C.; Yu, Y. Attention-based multiple-instance learning for Pediatric bone age assessment with efficient and interpretable. Biomed. Signal Process. Control 2023, 79, 104028. [Google Scholar] [CrossRef]
- Nurzynska, K.; Piórkowski, A.; Strzelecki, M.; Kociołek, M.; Banyś, R.P.; Obuchowicz, R. Differentiating age and sex in vertebral body CT scans—Texture analysis versus deep learning approach. Biocybern. Biomed. Eng. 2024, 44, 20–30. [Google Scholar] [CrossRef]
- Guo, Z.; Wang, X.; Yang, L.; Yang, X.; Qi, Y.; Zhao, Z. An intelligent bone age assessment model incorporating multilayer superimposed texture enhancement and the China-05 attention mechanism. Biomed. Signal Process. Control 2025, 99, 106852. [Google Scholar] [CrossRef]
- Hering, R.N.; von Kroge, S.; Delsmann, J.; Simon, A.; Ondruschka, B.; Püschel, K.; Schmidt, F.N.; Rolvien, T. Pronounced cortical porosity and sex-specific patterns of increased bone and osteocyte lacunar mineralization characterize the human distal fibula with aging. Bone 2024, 182, 117068. [Google Scholar] [CrossRef]
- Obuchowicz, R.; Nurzynska, K.; Pierzchala, M.; Piorkowski, A.; Strzelecki, M. Texture Analysis for the Bone Age Assessment from MRI Images of Adolescent Wrists in Boys. J. Clin. Med. 2023, 12, 2762. [Google Scholar] [CrossRef] [PubMed]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jegou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the Machine Learning Research, Boulder, CO, USA, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2022. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
- Hu, R.; Singh, A. UniT: Multimodal Multitask Learning with a Unified Transformer. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 1419–1429. [Google Scholar] [CrossRef]
- Wang, Y.; Chen, X.; Cao, L.; Huang, W.; Sun, F.; Wang, Y. Multimodal Token Fusion for Vision Transformers. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12176–12185. [Google Scholar] [CrossRef]
- Xu, Y.; Xu, Y.; Lv, T.; Cui, L.; Wei, F.; Wang, G.; Lu, Y.; Florencio, D.; Zhang, C.; Che, W.; et al. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, 1–6 August 2021; pp. 2579–2591. [Google Scholar] [CrossRef]
- Prokop-Piotrkowska, M.; Marszałek-Dziuba, K.; Moszczyńska, E.; Szalecki, M.; Jurkiewicz, E. Traditional and New Methods of Bone Age Assessment-An Overview. J. Clin. Res. Pediatric Endocrinol. 2021, 13, 251–262. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Wu, K.; Zhang, J.; Peng, H.; Liu, M.; Xiao, B.; Fu, J.; Yuan, L. TinyViT: Fast Pretraining Distillation for Small Vision Transformers. arXiv 2022. [Google Scholar] [CrossRef]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
- Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar] [CrossRef]
- Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016. [Google Scholar] [CrossRef]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. arXiv 2015. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015. [Google Scholar] [CrossRef]
- Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6105–6114. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar] [CrossRef]
- Wang, H.; Wang, Z.; Du, M.; Yang, F.; Zhang, Z.; Ding, S.; Mardziel, P.; Hu, X. Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 111–119. [Google Scholar] [CrossRef]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. Conf. Comput. Vis. 2017, 10, 618–626. [Google Scholar] [CrossRef]
- Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q. RandAugment: Practical Automated Data Augmentation with a Reduced Search Space. Adv. Neural Inf. Process. Syst. 2020, 33, 18613–18624. [Google Scholar]
- Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random Erasing Data Augmentation. arXiv 2017. [Google Scholar] [CrossRef]
- Guo, L.; Wang, J.; Teng, J.; Chen, Y. Bone Age Assessment Based on Deep Convolutional Features and Fast Extreme Learning Machine Algorithm. Front. Energy Res. 2022, 9, 813650. [Google Scholar] [CrossRef]
- Huang, G.B.; Zhu, Q.Y.; Siew, C.K. Extreme learning machine: Theory and applications. Neurocomputing 2006, 70, 489–501. [Google Scholar] [CrossRef]
- Wu, J.; Mi, Q.; Zhang, Y.; Wu, T. SVTNet: Automatic bone age assessment network based on TW3 method and vision transformer. Int. J. Imaging Syst. Technol. 2024, 34, e22990. [Google Scholar] [CrossRef]
- Mao, X.; Hui, Q.; Zhu, S.; Du, W.; Qiu, C.; Ouyang, X.; Kong, D. Automated Skeletal Bone Age Assessment with Two-Stage Convolutional Transformer Network Based on X-ray Images. Diagnostics 2023, 13, 1837. [Google Scholar] [CrossRef]
- Zhang, Z.; Song, Y.; Qi, H. Age Progression/Regression by Conditional Adversarial Autoencoder. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Pan, I.; Thodberg, H.H.; Halabi, S.S.; Kalpathy-Cramer, J.; Larson, D.B. Improving Automated Pediatric Bone Age Estimation Using Ensembles of Models from the 2017 RSNA Machine Learning Challenge. Radiol. Artif. Intell. 2019, 1, 6. [Google Scholar] [CrossRef]
Model | Param. No. (M) | Input Res. (px) | Sex (Y/N) | RSNA MAE (↓) | External MAE (↓) | |||
---|---|---|---|---|---|---|---|---|
Center | Multi-Crop | Center | Multi-Crop | |||||
RSNA Challenge winner | >24 | 5002 | Y | - | 4.2 | - | - | |
Ensemble-VGG 16 | >138 | 6002 | Y | - | - | 8.8 | - | |
Inception-V3 | Regression | 25 | 5002 | N | 6.1 | 5.7 | 8.1 | 8.0 |
4.6 | 4.2 | 7.2 | 7.5 | |||||
Regression-S | 25 | 5002 | - | M: 4.4 | M: 4.0 | M: 7.4 | M: 7.5 | |
F: 4.7 | F: 5.0 | F: 6.9 | F: 7.4 | |||||
Ensemble | 25 | 5002 | Y | 4.8 | 4.4 | 7.1 | 7.0 | |
ResNet50 | Regression | 24 | 5002 | N | 6.8 | 6.7 | 8.6 | 8.4 |
5.0 | 4.5 | 7.6 | 7.7 | |||||
Regression-S | 24 | 5002 | - | M: 4.6 | M: 4.3 | M: 7.7 | M: 7.7 | |
F: 5.3 | F: 4.7 | F: 7.4 | F: 7.6 | |||||
Ensemble | 24 | 5002 | Y | 4.3 | 4.2 | 7.1 | 7.2 | |
EfficientNet-B5 | Regression | 28 | 4562 | N | 6.6 | 5.7 | 8.1 | 8.0 |
5.4 | 4.9 | 7.2 | 7.1 | |||||
Regression-S | 30 | 4562 | - | M: 4.9 | M: 4.5 | M: 7.5 | M: 7.1 | |
F: 5.8 | F: 5.3 | F: 6.7 | F: 6.9 | |||||
Ensemble | 30 | 4562 | Y | 5.5 | 4.9 | 7.4 | 7.2 | |
TinyViT | Regression | 21 | 5002 | N | 6.0 | 5.6 | 8.4 | 7.8 |
4.6 | 4.4 | 7.0 | 7.1 | |||||
Regression-S | 21 | 5122 | - | M: 4.5 | M: 4.5 | M: 7.2 | M: 7.2 | |
F: 4.7 | F: 4.3 | F: 6.7 | F: 6.8 | |||||
Ensemble | 21 | 5122 | Y | 4.9 | 4.7 | 6.9 | 7.0 | |
BAE-ViT | 21 | 5122 | Y | 4.4 | 4.1 | 6.7 | 6.9 |
Models | End-to-End | Pretrained on ImageNet-1k | Pretrained on RSNA Data | ||
---|---|---|---|---|---|
Fixed | Non-Fixed | Fixed | Non-Fixed | ||
Inception-V3 | 4.8 | 9.5 | 5.2 | 5.3 | 4.9 |
ResNet50 | 4.5 | 9.7 | 5.1 | 5.8 | 5.2 |
TinyViT | 4.9 | 9.4 | 5.8 | 5.2 | 5.8 |
Test Method | Center-Crop | Multi-Crop |
---|---|---|
Inception-V3 | 10.7 | 10.5 |
ResNet50 | 10.6 | 10.8 |
EfficientNet-B5 | 11.3 | 11.2 |
TinyViT | 10.3 | 10.5 |
BAE-ViT | 9.9 | 10.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, J.; Chen, W.; Joshi, T.; Zhang, X.; Loh, P.-L.; Jog, V.; Bruce, R.J.; Garrett, J.W.; McMillan, A.B. BAE-ViT: An Efficient Multimodal Vision Transformer for Bone Age Estimation. Tomography 2024, 10, 2058-2072. https://doi.org/10.3390/tomography10120146
Zhang J, Chen W, Joshi T, Zhang X, Loh P-L, Jog V, Bruce RJ, Garrett JW, McMillan AB. BAE-ViT: An Efficient Multimodal Vision Transformer for Bone Age Estimation. Tomography. 2024; 10(12):2058-2072. https://doi.org/10.3390/tomography10120146
Chicago/Turabian StyleZhang, Jinnian, Weijie Chen, Tanmayee Joshi, Xiaomin Zhang, Po-Ling Loh, Varun Jog, Richard J. Bruce, John W. Garrett, and Alan B. McMillan. 2024. "BAE-ViT: An Efficient Multimodal Vision Transformer for Bone Age Estimation" Tomography 10, no. 12: 2058-2072. https://doi.org/10.3390/tomography10120146
APA StyleZhang, J., Chen, W., Joshi, T., Zhang, X., Loh, P.-L., Jog, V., Bruce, R. J., Garrett, J. W., & McMillan, A. B. (2024). BAE-ViT: An Efficient Multimodal Vision Transformer for Bone Age Estimation. Tomography, 10(12), 2058-2072. https://doi.org/10.3390/tomography10120146