DAFT-Net: Dual Attention and Fast Tongue Contour Extraction Using Enhanced U-Net Architecture
Abstract
:1. Introduction
- The internal structure of U-Net has been meticulously reconfigured by eliminating one convolutional layer from each encoder and decoder block within the U-Net framework, thus significantly reducing computational demand.
- A novel attention learning module has been developed featuring a synergistic integration of the CBAM and AG modules. This module serves as a comprehensive attention mechanism with dual functionality, thus supplanting the traditional skip connections to enhance the network’s feature representation capabilities.
- The DAFT-Net model’s performance was rigorously evaluated against other neural networks using metrics such as Intersection over Union (IoU), loss, and processing time across three ultrasound datasets: the NS, TJU, and TIMIT datasets. The model achieved 94.93% accuracy and a processing speed of 34.55 ms per image on the NS dataset.
2. Methods
2.1. The Proposed DAFT-Net
2.2. Network Internal Design
2.2.1. U-Net
2.2.2. U-Net Simplified Design
2.3. Integrated Attention Module
2.3.1. Gated Attention
2.3.2. CBAM
3. Experiment
3.1. Data Preparation
3.2. Data Preprocessing
3.3. Experimental Implement
3.4. Evaluation Metrics
3.5. Experiment Setting
4. Experimental Results and Discussion
5. Ablation Experiment Validation
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
References
- Gilbert, J.M.; Gonzalez, J.A.; Cheah, L.A.; Ell, S.R.; Green, P.; Moore, R.K.; Holdsworth, E. Restoring speech following total removal of the larynx by a learned transformation from sensor data to acoustics. J. Acoust. Soc. Am. Express Lett. 2017, 141, EL307–EL313. [Google Scholar] [CrossRef] [PubMed]
- Ji, Y.; Liu, L.; Wang, H.; Liu, Z.; Niu, Z.; Denby, B. Updating the silent speech challenge benchmark with deep learning. Speech Commun. 2018, 98, 42–50. [Google Scholar] [CrossRef]
- Liu, H.; Zhang, J. Improving Ultrasound Tongue Image Reconstruction from Lip Images Using Self-supervised Learning and Attention Mechanism. arXiv 2021, arXiv:2106.11769. [Google Scholar]
- McKeever, L.; Cleland, J.; Delafield-Butt, J. Using ultrasound tongue imaging to analyse maximum performance tasks in children with Autism: A pilot study. Clin. Linguist. Phon. 2022, 36, 127–145. [Google Scholar] [CrossRef] [PubMed]
- Eshky, A.; Cleland, J.; Ribeiro, M.S.; Sugden, E.; Richmond, K.; Renals, S. Automatic audiovisual synchronisation for ultrasound tongue imaging. Speech Commun. 2021, 132, 83–95. [Google Scholar] [CrossRef]
- Stone, M. A guide to analysing tongue motion from ultrasound images. Clin. Linguist. Phon. 2005, 19, 455–501. [Google Scholar] [CrossRef] [PubMed]
- Hsiao, M.Y.; Wu, C.H.; Wang, T.G. Emerging role of ultrasound in dysphagia assessment and intervention: A narrative review. Front. Rehabil. Sci. 2021, 2, 708102. [Google Scholar] [CrossRef] [PubMed]
- Cloutier, G.; Destrempes, F.; Yu, F.; Tang, A. Quantitative ultrasound imaging of soft biological tissues: A primer for radiologists and medical physicists. Insights Imaging 2021, 12, 1–20. [Google Scholar] [CrossRef] [PubMed]
- Trencsényi, R.; Czap, L. Possible methods for combining tongue contours of dynamic MRI and ultrasound records. Acta Polytech. Hung. 2021, 18, 143–160. [Google Scholar] [CrossRef]
- Li, M.; Kambhamettu, C.; Stone, M. Automatic contour tracking in ultrasound images. Clin. Linguist. Phon. 2005, 19, 545–554. [Google Scholar] [CrossRef] [PubMed]
- Ghrenassia, S.; Ménard, L.; Laporte, C. Interactive segmentation of tongue contours in ultrasound video sequences using quality maps. In Proceedings of the Medical Imaging 2014: Image Processing, San Diego, CA, USA, 16–18 February 2014; SPIE: Bellingham, WA, USA, 2014; Volume 9034, pp. 1046–1052. [Google Scholar]
- Laporte, C.; Ménard, L. Robust tongue tracking in ultrasound images: A multi-hypothesis approach. In Proceedings of the Interspeech, Dresden, Germany, 6–10 September 2015; pp. 633–637. [Google Scholar]
- Xu, K.; Yang, Y.; Stone, M.; Jaumard-Hakoun, A.; Leboullenger, C.; Dreyfus, G.; Roussel, P.; Denby, B. Robust contour tracking in ultrasound tongue image sequences. Clin. Linguist. Phon. 2016, 30, 313–327. [Google Scholar] [CrossRef] [PubMed]
- Tang, L.; Hamarneh, G. Graph-based tracking of the tongue contour in ultrasound sequences with adaptive temporal regularization. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, San Francisco, CA, USA, 13–18 June 2010; pp. 154–161. [Google Scholar]
- Tang, L.; Bressmann, T.; Hamarneh, G. Tongue contour tracking in dynamic ultrasound via higher-order MRFs and efficient fusion moves. Med. Image Anal. 2012, 16, 1503–1520. [Google Scholar] [CrossRef] [PubMed]
- Fabre, D.; Hueber, T.; Bocquelet, F.; Badin, P. Tongue tracking in ultrasound images using eigentongue decomposition and artificial neural networks. In Proceedings of the Interspeech 2015-16th Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015. [Google Scholar]
- Berry, J.; Fasel, I. Dynamics of tongue gestures extracted automatically from ultrasound. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 22–27 May 2011; pp. 557–560. [Google Scholar]
- Li, B.; Xu, K.; Feng, D.; Mi, H.; Wang, H.; Zhu, J. Denoising convolutional autoencoder based B-mode ultrasound tongue image feature extraction. In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 7130–7134. [Google Scholar]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- Li, G.; Chen, J.; Liu, Y.; Wei, J. wUnet: A new network used for ultrasonic tongue contour extraction. Speech Commun. 2022, 141, 68–79. [Google Scholar] [CrossRef]
- Nie, W.; Yu, Y.; Zhang, C.; Song, D.; Zhao, L.; Bai, Y. Temporal-spatial Correlation Attention Network for Clinical Data Analysis in Intensive Care Unit. IEEE Trans. Biomed. Eng. 2023, 71, 583–595. [Google Scholar] [CrossRef] [PubMed]
- Saha, P.; Liu, Y.; Gick, B.; Fels, S. Ultra2speech-a deep learning framework for formant frequency estimation and tracking from ultrasound tongue images. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, 4–8 October 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 473–482. [Google Scholar]
- Peng, C.; Zhang, X.; Yu, G.; Luo, G.; Sun, J. Large kernel matters–improve semantic segmentation by global convolutional network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4353–4361. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
- Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; Van Der Laak, J.A.; Van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef] [PubMed]
- Nie, W.; Zhang, C.; Song, D.; Bai, Y.; Xie, K.; Liu, A.A. Chest X-ray Image Classification: A Causal Perspective. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada, 8–12 October 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 25–35. [Google Scholar]
- Itti, L.; Koch, C.; Niebur, E. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 1254–1259. [Google Scholar] [CrossRef]
- Li, X.; Zhang, W.; Ding, Q. Understanding and improving deep learning-based rolling bearing fault diagnosis with attention mechanism. Signal Process. 2019, 161, 136–154. [Google Scholar] [CrossRef]
- Nie, W.; Zhang, C.; Song, D.; Zhao, L.; Bai, Y.; Xie, K.; Liu, A. Deep reinforcement learning framework for thoracic diseases classification via prior knowledge guidance. Comput. Med. Imaging Graph. 2023, 108, 102277. [Google Scholar] [CrossRef] [PubMed]
- Eom, H.; Lee, D.; Han, S.; Hariyani, Y.S.; Lim, Y.; Sohn, I.; Park, K.; Park, C. End-to-end deep learning architecture for continuous blood pressure estimation using attention mechanism. Sensors 2020, 20, 2338. [Google Scholar] [CrossRef] [PubMed]
- Kaul, C.; Manandhar, S.; Pears, N. Focusnet: An attention-based fully convolutional network for medical image segmentation. In Proceedings of the 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), Venice, Italy, 8–11 April 2019; pp. 455–458. [Google Scholar]
- Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Tong, X.; Wei, J.; Sun, B.; Su, S.; Zuo, Z.; Wu, P. ASCU-Net: Attention gate, spatial and channel attention u-net for skin lesion segmentation. Diagnostics 2021, 11, 501. [Google Scholar] [CrossRef] [PubMed]
- Santamaría, J. Testing the Robustness of JAYA Optimization on 3D Surface Alignment of Range Images: A Revised Computational Study. IEEE Access 2024, 12, 19009–19020. [Google Scholar] [CrossRef]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Method | Dataset | IoU | Loss | Time (ms) | ||
---|---|---|---|---|---|---|
NS | 0.9098 ± 0.0943 | 0.1740 ± 0.0387 | 38.03 ± 6.54 | 5 | 6 | |
U-Net | TJU | 0.8869 ± 0.0728 | 0.1415 ± 0.0305 | 39.25 ± 5.12 | 5 | 6 |
TIMIT | 0.8887 ± 0.0869 | 0.1346 ± 0.0642 | 38.76 ± 3.55 | 5 | 6 | |
NS | 0.9145 ± 0.0676 | 0.1653 ± 0.0432 | 45.88 ± 7.42 | 4 | 4 | |
Unet++ | TJU | 0.8935 ± 0.0784 | 0.1329 ± 0.0553 | 47.05 ± 4.25 | 4 | 3 |
TIMIT | 0.8967 ± 0.0535 | 0.1332 ± 0.0798 | 46.86 ± 3.54 | 4 | 4 | |
NS | 0.9290 ± 0.0627 | 0.1629 ± 0.0474 | 43.44 ± 8.38 | 3 | 3 | |
SA-Unet | TJU | 0.9057 ± 0.0898 | 0.1356 ± 0.0582 | 43.97 ± 4.65 | 3 | 5 |
TIMIT | 0.9102 ± 0.0369 | 0.1344 ± 0.0690 | 43.65 ± 5.47 | 3 | 5 | |
NS | 0.9321 ± 0.0886 | 0.1577 ± 0.0354 | 43.52 ± 7.36 | 2 | 2 | |
SegAN | TJU | 0.9095 ± 0.0613 | 0.1305 ± 0.0576 | 45.16 ± 5.82 | 2 | 2 |
TIMIT | 0.9133 ± 0.0445 | 0.1299 ± 0.0787 | 44.22 ± 6.41 | 2 | 2 | |
NS | 0.9091 ± 0.0629 | 0.1667 ± 0.0673 | 44.63 ± 8.91 | 6 | 5 | |
Deeplab V3 | TJU | 0.8853 ± 0.0776 | 0.1338 ± 0.0885 | 46.35 ± 6.85 | 6 | 4 |
TIMIT | 0.8876 ± 0.0334 | 0.1326 ± 0.0694 | 45.78 ± 5.57 | 6 | 3 | |
NS | 0.9493 ± 0.0587 | 0.1452 ± 0.0565 | 34.55 ± 4.83 | 1 | 1 | |
DAFT-Net | TJU | 0.9195 ± 0.0369 | 0.1262 ± 0.0682 | 35.17 ± 5.25 | 1 | 1 |
TIMIT | 0.9206 ± 0.0592 | 0.1258 ± 0.0478 | 34.93 ± 3.71 | 1 | 1 |
Method | Simplify | AG | CBAM | Accuracy | Time (ms) |
---|---|---|---|---|---|
× | × | × | 0.9098 ± 0.0943 | 38.03 ± 6.54 | |
✔ | × | × | 0.8916 ± 0.0816 | 36.92 ± 5.81 | |
× | ✔ | × | 0.9175 ± 0.0728 | 37.93 ± 8.26 | |
U-Net | × | × | ✔ | 0.9226 ± 0.0682 | 37.34 ± 6.35 |
✔ | ✔ | × | 0.9132 ± 0.0553 | 36.53 ± 7.81 | |
✔ | × | ✔ | 0.9305 ± 0.0636 | 35.47 ± 6.53 | |
× | ✔ | ✔ | 0.9337 ± 0.0534 | 35.68 ± 5.38 | |
DAFT-Net | ✔ | ✔ | ✔ | 0.9493 ± 0.0587 | 34.55 ± 4.83 |
Model | Result | Model | Result |
---|---|---|---|
U-Net | U-Net + Simplify + AG | ||
U-Net + Simplify | U-Net + Simplify + CBAM | ||
U-Net + AG | U-Net + AG + CBAM | ||
U-Net + CBAM | U-Net + Simplify + AG + CBAM |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, X.; Lu, W.; Liu, H.; Zhang, W.; Li, Q. DAFT-Net: Dual Attention and Fast Tongue Contour Extraction Using Enhanced U-Net Architecture. Entropy 2024, 26, 482. https://doi.org/10.3390/e26060482
Wang X, Lu W, Liu H, Zhang W, Li Q. DAFT-Net: Dual Attention and Fast Tongue Contour Extraction Using Enhanced U-Net Architecture. Entropy. 2024; 26(6):482. https://doi.org/10.3390/e26060482
Chicago/Turabian StyleWang, Xinqiang, Wenhuan Lu, Hengxin Liu, Wei Zhang, and Qiang Li. 2024. "DAFT-Net: Dual Attention and Fast Tongue Contour Extraction Using Enhanced U-Net Architecture" Entropy 26, no. 6: 482. https://doi.org/10.3390/e26060482
APA StyleWang, X., Lu, W., Liu, H., Zhang, W., & Li, Q. (2024). DAFT-Net: Dual Attention and Fast Tongue Contour Extraction Using Enhanced U-Net Architecture. Entropy, 26(6), 482. https://doi.org/10.3390/e26060482