Recent Advances in Synthesis and Interaction of Speech, Text, and Vision
Abstract
:1. Introduction
- Equal access to information: An audio description guarantees an equal chance for individuals with visual impairments to view visual media such as live events, TV shows, films, and instructional videos.
- Social inclusion: The audio gives visually impaired persons a sense of community and belonging by allowing them to take part in common cultural events and recreational events, including athletic events, art exhibitions, and museum tours.
- Independent navigation: For the blind or visually impaired in public spaces, audio description is a crucial part of independent navigation. This makes it easier for people to walk safely and offers insightful information about their surroundings.
- Accessibility in the learning process: In educational contexts, audio description is a method used to deliver visual elements such as charts and graphs. It is crucial that visually impaired students participate in their studies and be completely incorporated into various topics.
- Employability: People with visual impairments must have access to job descriptions that are easy to read. This promotes inclusion and equal employment opportunities by allowing individuals to interact with visual presentations, diagrams, and other work-related materials.
- Multitasking and convenience [2]: Audio descriptions provide a convenient alternative for people who may be working in multitasking mode or cannot focus on visual content, allowing them to perceive information by ear while engaged in other activities.
- Language diversity: People who speak different languages or have different levels of proficiency in the language of visual content can benefit from audio descriptions because they provide an oral explanation that overcomes language barriers [3].
- Learning styles: Individuals differ in how they absorb information, and some may gain more or prefer auditory knowledge [4]. Audio descriptions are provided for people who retain information better by listening than by seeing.
2. Integration of Natural Language Processing (NLP) and Computer Vision (CV)
2.1. Explanation of the Symbiotic Relationship between NLP and CV
2.2. Importance of Joint Processing in Converting Visual Data to Meaningful Audio Descriptions
- Form: Includes characters, places, words, or any recognizable shape or object.
- Motion: Refers to any state or sign of motion, including actions and the passage of time.
- Color: Includes the hue and skin tone of the characters.
- Sound: Refers to sounds that can only be discerned through visual cues.
- Camera Perspective: Includes aspects such as point of view, scale, bird’s eye view, and camera special effects.
- Supporting Information: Consists of extraneous information and details such as changes in information.
2.3. Examples of Successful AI-Powered Visual Assistance Applications
2.3.1. Object Recognition and Text-to-Speech
- Seeing AI (Microsoft) [13], a free app designed for blind and visually impaired people, uses Artificial Intelligence (AI) to describe the surroundings audibly. Its features include instant text voicing, document text recognition, barcode scanning, facial recognition with age and gender estimation, currency recognition, scene description, audio-augmented reality for space exploration, indoor navigation, color identification, handwriting reading, light estimation, and integration with other image recognition applications. This multifaceted tool allows users to easily navigate their surroundings.
- Envision AI [14], an award-winning Optical Character Recognition (OCR) app designed for the visually impaired, uses AI and OCR to audibly interpret the visual world, promoting independence. With full spoken language support, it quickly reads a text in 60 languages, scans documents, recognizes PDFs and images, interprets handwritten notes, and describes the scenes. The app also detects colors, scans barcodes for product information, and recognizes nearby people and objects. Envision allows the sharing of images and documents from different applications and provides voice descriptions to enhance accessibility.
- TapTapSee [15] is a specialized application designed to assist blind or visually impaired people in identifying objects during their daily activities. Users can take pictures by tapping on any part of the screen, which makes it easier to photograph two-dimensional or three-dimensional objects from different angles. The app then provides voice identification depending on VoiceOver activation. Recognized for its usefulness, TapTapSee has received notable awards, including the American Foundation for the Blind 2014 Access Award Recipient and RNIB (Royal National Institute of Blind People) recognition as an App of the Month in March 2013. The application includes features such as image recognition; the ability to repeat the last identification; uploading; saving images from a photographic film with appropriate definitions; and sharing the results via text messages, email, or social networks. Developed by CloudSight Inc., a Los Angeles-based technology company specializing in image captioning and understanding, TapTapSee is a sophisticated solution that promotes the independence of the visually impaired.
2.3.2. Navigation and Location Assistance
- BlindSquare [18], an innovative navigation solution for people with visual impairments, combines GPS, compass, and FourSquare data to provide comprehensive assistance indoors and outdoors. Developed in collaboration with insights from the blind community, the app uses algorithms to extract relevant information transmitted through high-quality speech synthesis. Enabling voice commands as a premium service increases user control. BlindSquare is a universal GPS solution, offering step-by-step instructions and searching for detailed information regarding nearby locations and compatibility with other navigation applications. Acting as a four-square client, the application supports registration and related actions. It is a paid application; it supports 25 languages; and it has received awards such as the GSMA (Groupe Speciale Mobile Association) Global Mobile Awards 2013 for the best mobile product or service in the field of healthcare, BlindSquare.
- Aira [19] is a visual translation placement service that provides real-time communication between visually impaired people and professionally trained visual translators through the Aira Explorer application. The application is accessed by pressing a button on the main screen; assisting with requests ; and increasing independence and efficiency in describing, reading, explaining, and navigating in various environments. Live video streaming includes GPS location detection, which allows agents to interact with the user’s environment through an integrated dashboard that includes web data, maps, location tracking, search engines, text messages, and rideshare integration. Aira Access distributes the service to organizations that are members of the Aira Access network, allowing visually impaired people to use the service for free on partner sites, contributing to accessibility and inclusivity.
2.3.3. Face Recognition and Identification
2.3.4. Assistance from Sighted Volunteers
2.3.5. General Visual Assistance
2.3.6. Wearable Devices
2.3.7. Types of Hardware
- GPUs (Graphics Processing Units): GPUs remain the workhorse of most deep learning applications, including generative AI, due to their parallel processing capabilities. Top-of-the-line options include examples such as NVIDIA A100 and H100 Tensor Core GPUs, which are designed specifically for AI and scientific workloads. GPUs used widely by researches for either intermediate or more managable experiments include NVIDIA RTX 30 and 40 series—high-end consumer GPUs with substantial power for generative models.
- Specialized AI Accelerators: These chips offer even greater efficiency and performance for specific AI workloads. One popular option is Google TPUs (Tensor Processing Units), optimized for Google’s TensorFlow framework and commonly used in Google Cloud. Graphcore IPUs (Intelligence Processing Units) are designed for flexibility and handling large, complex models. Additionally, AWS Trainium and Inferentia, Amazon’s custom AI accelerator chips, may be a viable option at the time of the writing.
- Model Size and Complexity: Larger, more complex generative models require more powerful hardware with higher memory capacity.
- Image Tasks: Models for image generation often demand higher GPU memory (VRAM) compared to purely text-based models.
- Speech Tasks: Generating realistic speech can be computationally intensive and might require specialized speech-oriented hardware or careful optimization for real-time applications.
- Cloud vs. Local: Cloud-based solutions (e.g., Google Colab, AWS instances) offer access to powerful hardware without upfront investment but might incur recurring costs. Local hardware allows for full control and can be more economical for very frequent use.
3. Dynamics of Vision, Text, and Sound in Artificial Intelligence
3.1. Methods and Techniques in Image Captioning
- The encoder–decoder architecture is the foundation of the majority of image captioning models. To extract visual information from images, Convolutional Neural Networks (CNNs) are used as encoders, and Recurrent Neural Networks (RNNs) or Transformers are used as decoders to create captions. This framework has been widely used in early image captioning models because of its straightforward and efficient construction [31,32,33,34,35].Gated Recurrent Units (GRUs) and Long Short-Term Memory (LSTM) are types of RNNs with gating mechanisms to control the flow of information. They address the vanishing gradient problem and are commonly used in sequence-to-sequence tasks such as image captioning, thus effectively capturing sequential dependencies [32].
- Reinforcement Learning was used to fine-tune the image captioning models. The model is trained to maximize the reward signal, which is often computed based on the quality of the generated captions. Moreover, it allows for the optimization of non-differentiable metrics and improves caption quality [32,40,41,42].
3.2. Evolution of Text-to-Speech (TTS) Technologies and Techniques
3.3. Advancements in Image-to-Speech Systems
3.4. Image Generation Based on a Text Description
- Semantic enhancement GANs (DC-GANs, GAN-INT, GAN-CLS, GAN-INT-CLS, Dong-GAN, Paired-D GAN, and MC-GAN);
- Resolution enhancement GANs (StackGAN, StackGAN++, AttGAN, obj-GANs, HDGAN, and DM-GAN);
- Diversity enhancement GANs (AC-GANs, TAC-GAN, Text-SeGAN, MirrorGAN, and Scene Graph GAN);
- Motion enhancement GANs (ObamaNet, T2S, T2V, and StoryGAN).
- Subject terms: Denotes the subject of the image.
- Style modifiers: Indicates a specific artistic style for the image.
- Image prompts: Provides a reference image to convey the desired style or subject.
- Quality boosters: Terms intended to enhance the quality of generated images.
- Repeating terms: Repetition of subjects or style terms to reinforce desired elements.
- Magic terms: Terms that are semantically different from the rest of the prompt, aiming to produce unexpected or surprising results.
3.5. Ethical Considerations and Potential Unintended Consequences
3.6. Challenges and Opportunities
4. Overview of Existing Image Captioning, Text-to-Speech, and Image-to-Speech Datasets
4.1. Image Captioning Datasets
4.2. Text-to-Speech Datasets
4.3. Image-to-Speech Datasets
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- World Health Organization. Blindness and Vision Impairment. 2023. Available online: https://www.who.int/news-room/fact-sheets/detail/blindness-and-visual-impairment (accessed on 13 October 2023).
- Sri, K.S.; Mounika, C.; Yamini, K. Audiobooks that converts Text, Image, PDF-Audio & Speech-Text: For physically challenged & improving fluency. In Proceedings of the 2022 International Conference on Inventive Computation Technologies (ICICT), Lalitpur, Nepal, 20–22 July 2022; pp. 83–88. [Google Scholar]
- Unlocking Communication: The Power of Audio Description in Overcoming Language Barriers|Acadestudio. Available online: https://www.acadestudio.com/blog/how-audio-description-is-breaking-down-language-barriers/ (accessed on 21 January 2024).
- Pashler, H.; McDaniel, M.; Rohrer, D.; Bjork, R. Learning styles: Concepts and evidence. Psychol. Sci. Public Interest 2008, 9, 105–119. [Google Scholar] [CrossRef] [PubMed]
- Moens, M.F.; Pastra, K.; Saenko, K.; Tuytelaars, T. Vision and language integration meets multimedia fusion. IEEE Multimed. 2018, 25, 7–10. [Google Scholar] [CrossRef]
- Guo, J.; He, H.; He, T.; Lausen, L.; Li, M.; Lin, H.; Shi, X.; Wang, C.; Xie, J.; Zha, S.; et al. Gluoncv and gluonnlp: Deep learning in Computer Vision and natural language processing. J. Mach. Learn. Res. 2020, 21, 845–851. [Google Scholar]
- Mogadala, A.; Kalimuthu, M.; Klakow, D. Trends in integration of vision and language research: A survey of tasks, datasets, and methods. J. Artif. Intell. Res. 2021, 71, 1183–1317. [Google Scholar] [CrossRef]
- Kleege, G. 7 Audio Description Described. In More than Meets the Eye: What Blindness Brings to Art; Oxford University Press: Oxford, UK, 2018. [Google Scholar] [CrossRef]
- Snyder, J. The Visual Made Verbal: A Comprehensive Training Manual and Guide to the History and Applications of audio Description; Academic Publishing: San Diego, CA, USA, 2020. [Google Scholar]
- Snyder, J. Audio description guidelines and best practices. In American Council of the Blind’s Audio Description Project; American Council of the Blind: Alexandria, VA, USA, 2010. [Google Scholar]
- Bittner, H. Audio description guidelines: A comparison. New Perspect. Transl. 2012, 20, 41–61. [Google Scholar]
- Massiceti, D. Computer Vision and Natural Language Processing for People with Vision Impairment. Ph.D. Thesis, University of Oxford, Oxford, UK, 2019. [Google Scholar]
- Microsoft Corporation. Seeing AI. Available online: https://www.microsoft.com/en-us/ai/seeing-ai (accessed on 3 November 2023).
- Envision. Envision—Perceive Possibility. Available online: https://www.letsenvision.com/ (accessed on 3 November 2023).
- CloudSight, Inc. TapTapSee—Blind and Visually Impaired Assistive Technology—Powered by CloudSight.ai Image Recognition API. Available online: https://www.taptapseeapp.com (accessed on 3 November 2023).
- GAATES, the Global Alliance for Accessible Technologies and Environments. Aipoly App Opens Up the World for People with Vision Disabilities. 2017. Available online: https://globalaccessibilitynews.com/2017/03/28/aipoly-app-opens-up-the-world-for-people-with-vision-disabilities/ (accessed on 3 November 2023).
- Turkel, A. iDentifi. Available online: https://www.getidentifi.com (accessed on 3 November 2023).
- BlindSquare. Available online: https://www.blindsquare.com/ (accessed on 3 November 2023).
- We’re Aira, a Visual Interpreting Service. Available online: https://aira.io/ (accessed on 3 May 2023).
- NoorCam. NoorCam MyEye. Available online: https://www.noorcam.com/en-ae/noorcam-myeye (accessed on 3 November 2023).
- Be My Eyes. Be My Eyes—See the world together. Available online: https://www.bemyeyes.com/ (accessed on 3 November 2023).
- Lookout—Assisted Vision—Apps on Google Play. Available online: https://play.google.com/store/apps/details?id=com.google.android.apps.accessibility.reveal&hl=en%5Ctextunderscore%7B%7DUS&pli=1 (accessed on 3 November 2023).
- Cyber Timez, Inc. Cyber Timez. Available online: https://www.cybertimez.com (accessed on 3 November 2023).
- Eyesynth—Visión a través del oído. Available online: https://eyesynth.com/ (accessed on 4 April 2024).
- eSight—Electronic Eyewear for the Visually Impaired. 2023. Available online: https://www.esighteyewear.com (accessed on 3 April 2024).
- GiveVision. Available online: https://www.givevision.net (accessed on 3 April 2024).
- NuEyes—Empowering Your Vision. Available online: https://www.nueyes.com/ (accessed on 4 April 2024).
- Beautemps, D.; Schwartz, J.L.; Sato, M. Analysis by Synthesis: A (Re-)Emerging Program of Research for Language and Vision. Biolinguistics 2010, 4, 287–300. [Google Scholar] [CrossRef]
- Vinciarelli, A.; Pantic, M.; Bourlard, H. Open Challenges in Modelling, Analysis and Synthesis of Human Behaviour in Human–Human and Human–Machine Interactions. Cogn. Comput. 2015, 7, 397–413. [Google Scholar] [CrossRef]
- Ashok, K.; Ashraf, M.; Thimmia Raja, J.; Hussain, M.Z.; Singh, D.K.; Haldorai, A. Collaborative analysis of audio-visual speech synthesis with sensor measurements for regulating human–robot interaction. Int. J. Syst. Assur. Eng. Manag. 2022, 1–8. [Google Scholar] [CrossRef]
- Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 652–663. [Google Scholar] [CrossRef] [PubMed]
- Hossain, M.Z. Deep Learning Techniques for Image Captioning. Ph.D. Thesis, Murdoch University, Perth, WA, Australia, 2020. [Google Scholar]
- Seshadri, M.; Srikanth, M.; Belov, M. Image to language understanding: Captioning approach. arXiv 2020, arXiv:2002.09536. [Google Scholar]
- Chen, F.; Li, X.; Tang, J.; Li, S.; Wang, T. A Survey on Recent Advances in Image Captioning. J. Phys. Conf. Ser. 2021, 1914, 012053. [Google Scholar] [CrossRef]
- Wang, C.; Zhou, Z.; Xu, L. An integrative review of image captioning research. J. Phys. Conf. Ser. 2021, 1748, 042060. [Google Scholar] [CrossRef]
- Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015; pp. 2048–2057. [Google Scholar]
- Jin, J.; Fu, K.; Cui, R.; Sha, F.; Zhang, C. Aligning where to see and what to tell: Image caption with region-based attention and scene factorization. arXiv 2015, arXiv:1506.06272. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Li, G.; Zhu, L.; Liu, P.; Yang, Y. Entangled transformer for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8928–8937. [Google Scholar]
- Ren, Z.; Wang, X.; Zhang, N.; Lv, X.; Li, L.J. Deep reinforcement learning-based image captioning with embedding reward. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 290–298. [Google Scholar]
- Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7008–7024. [Google Scholar]
- Yan, S.; Wu, F.; Smith, J.S.; Lu, W.; Zhang, B. Image captioning using adversarial networks and reinforcement learning. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 248–253. [Google Scholar]
- Dai, B.; Fidler, S.; Urtasun, R.; Lin, D. Towards diverse and natural image descriptions via a conditional gan. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2970–2979. [Google Scholar]
- Shetty, R.; Rohrbach, M.; Anne Hendricks, L.; Fritz, M.; Schiele, B. Speaking the same language: Matching machine to human captions by adversarial training. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4135–4144. [Google Scholar]
- Amirian, S.; Rasheed, K.; Taha, T.R.; Arabnia, H.R. Image captioning with generative adversarial network. In Proceedings of the 2019 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, 5–7 December 2019; pp. 272–275. [Google Scholar]
- Cornia, M.; Baraldi, L.; Cucchiara, R. Show, control and tell: A framework for generating controllable and grounded captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8307–8316. [Google Scholar]
- Klatt, D. The Klattalk text-to-speech conversion system. In Proceedings of the ICASSP’82. IEEE International Conference on Acoustics, Speech, and Signal Processing, Paris, France, 3–5 May 1982; Volume 7, pp. 1589–1592. [Google Scholar]
- Taylor, P. Text-to-Speech Synthesis; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
- Black, A.W.; Taylor, P.A. CHATR: A generic speech synthesis system. In Proceedings of the COLING-94, Kyoto, Japan, 5–9 August 1994; Volume 2, pp. 983–986. [Google Scholar]
- Campbell, N. Prosody and the selection of units for concatenative synthesis. In Proceedings of the ESCA/IEEE 2nd Workshop on Speech Synthesis, New Paltz, NY, USA, 12–15 September 1994; pp. 61–64. [Google Scholar]
- Hunt, A.J.; Black, A.W. Unit selection in a concatenative speech synthesis system using a large speech database. In Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Atlanta, GA, USA, 9 May 1996; Volume 1, pp. 373–376. [Google Scholar]
- Campbell, N. CHATR: A high-definition speech re-sequencing system. In Proceedings of the 3rd Joint Meeting of the Acoustical Society of America and the Acoustical Society of Japan, Honolulu, HI, USA, 2–6 December 1996. [Google Scholar]
- Tan, X.; Qin, T.; Soong, F.; Liu, T.Y. A survey on neural speech synthesis. arXiv 2021, arXiv:2106.15561. [Google Scholar]
- Yoshimura, T.; Tokuda, K.; Masuko, T.; Kobayashi, T.; Kitamura, T. Duration modeling for HMM-based speech synthesis. In Proceedings of the ICSLP, Sydney, NSW, Australia, 30 November–4 December 1998; Volume 98, pp. 29–32. [Google Scholar]
- Yoshimura, T.; Tokuda, K.; Masuko, T.; Kobayashi, T.; Kitamura, T. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In Proceedings of the Sixth European Conference on Speech Communication and Technology, Budapest, Hungary, 5–9 September 1999. [Google Scholar]
- Tokuda, K.; Yoshimura, T.; Masuko, T.; Kobayashi, T.; Kitamura, T. Speech parameter generation algorithms for HMM-based speech synthesis. In Proceedings of the 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), Istanbul, Turkey, 5–9 June 2000; Volume 3, pp. 1315–1318. [Google Scholar]
- Yoshimura, T.; Tokuda, K.; Masuko, T.; Kobayashi, T.; Kitamura, T. Mixed excitation for HMM-based speech synthesis. In Proceedings of the Seventh European conference on speech Communication and Technology, Aalborg, Denmark, 3–7 September 2001. [Google Scholar]
- Zen, H.; Toda, T.; Tokuda, K. The Nitech-NAIST HMM-based speech synthesis system for the Blizzard Challenge 2006. IEICE Trans. Inf. Syst. 2008, 91, 1764–1773. [Google Scholar] [CrossRef]
- Zen, H.; Tokuda, K.; Masuko, T.; Kobayashi, T.; Kitamura, T. Hidden semi-Markov model based speech synthesis. In Proceedings of the Eighth International Conference on Spoken Language Processing, Jeju Island, Republic of Korea, 4–8 October 2004. [Google Scholar]
- Tokuda, K.; Zen, H.; Black, A.W. An HMM-based speech synthesis system applied to English. In Proceedings of the IEEE Speech Synthesis Workshop, Santa Monica, CA, USA, 13 September 2002; pp. 227–230. [Google Scholar]
- Black, A.W.; Zen, H.; Tokuda, K. Statistical parametric speech synthesis. In Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, Honolulu, HI, USA, 15–20 April 2007; Volume 4, pp. 1229–1232. [Google Scholar]
- Zen, H.; Tokuda, K.; Black, A.W. Statistical parametric speech synthesis. Speech Commun. 2009, 51, 1039–1064. [Google Scholar] [CrossRef]
- Zen, H.; Senior, A.; Schuster, M. Statistical parametric speech synthesis using deep neural networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 7962–7966. [Google Scholar]
- Qian, Y.; Fan, Y.; Hu, W.; Soong, F.K. On the training aspects of deep neural network (DNN) for parametric TTS synthesis. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 3829–3833. [Google Scholar]
- Fan, Y.; Qian, Y.; Xie, F.L.; Soong, F.K. TTS synthesis with bidirectional LSTM based recurrent neural networks. In Proceedings of the Fifteenth Annual Conference of the International Speech Communication Association, Singapore, 14–18 September 2014. [Google Scholar]
- Zen, H. Acoustic modeling in statistical parametric speech synthesis-from HMM to LSTM-RNN. In Proceedings of the The First, International Workshop on Machine Learning in Spoken Language Processing (MLSLP2015), Aizu, Japan, 19–20 September 2015; pp. 125–132. [Google Scholar]
- Zen, H.; Sak, H. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 4470–4474. [Google Scholar]
- Wang, W.; Xu, S.; Xu, B. First Step Towards End-to-End Parametric TTS Synthesis: Generating Spectral Parameters with Neural Attention. In Proceedings of the Interspeech, San Francisco, CA, USA, 8–12 September 2016; pp. 2243–2247. [Google Scholar]
- Oord, A.v.d.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv 2016, arXiv:1609.03499. [Google Scholar]
- Wang, Y.; Skerry-Ryan, R.; Stanton, D.; Wu, Y.; Weiss, R.J.; Jaitly, N.; Yang, Z.; Xiao, Y.; Chen, Z.; Bengio, S.; et al. Tacotron: Towards end-to-end speech synthesis. arXiv 2017, arXiv:1703.10135. [Google Scholar]
- Ping, W.; Peng, K.; Chen, J. Clarinet: Parallel wave generation in end-to-end text-to-speech. arXiv 2018, arXiv:1807.07281. [Google Scholar]
- Ren, Y.; Hu, C.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; Liu, T.Y. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv 2020, arXiv:2006.04558. [Google Scholar]
- Donahue, J.; Dieleman, S.; Bińkowski, M.; Elsen, E.; Simonyan, K. End-to-end adversarial text-to-speech. arXiv 2020, arXiv:2006.03575. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Stephenson, B.; Hueber, T.; Girin, L.; Besacier, L. Alternate Endings: Improving prosody for incremental neural tts with predicted future text input. arXiv 2021, arXiv:2102.09914. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Luong, H.T.; Yamagishi, J. Nautilus: A versatile voice cloning system. IEEE/Acm Trans. Audio Speech Lang. Process. 2020, 28, 2967–2981. [Google Scholar] [CrossRef]
- Ruggiero, G.; Zovato, E.; Di Caro, L.; Pollet, V. Voice cloning: A multi-speaker text-to-speech synthesis approach based on transfer learning. arXiv 2021, arXiv:2102.05630. [Google Scholar]
- Arik, S.; Chen, J.; Peng, K.; Ping, W.; Zhou, Y. Neural voice cloning with a few samples. Adv. Neural Inf. Process. Syst. 2018, 31, 10019–10029. [Google Scholar]
- Hsu, W.N.; Harwath, D.; Song, C.; Glass, J. Text-free image-to-speech synthesis using learned segmental units. arXiv 2020, arXiv:2012.15454. [Google Scholar]
- Stephen, O.; Mishra, D.; Sain, M. Real time object detection and multilingual speech synthesis. In Proceedings of the 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kanpur, India, 6–8 July 2019; pp. 1–3. [Google Scholar]
- Ma, S.; McDuff, D.; Song, Y. Unpaired image-to-speech synthesis with multimodal information bottleneck. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7598–7607. [Google Scholar]
- Bourbakis, N. Automatic Image-to-Text-to-Voice Conversion for Interactively Locating Objects in Home Environments. In Proceedings of the 2008 20th IEEE International Conference on Tools with Artificial Intelligence, Dayton, OH, USA, 3–5 November 2008; Volume 2, pp. 49–55. [Google Scholar]
- Hasegawa-Johnson, M.; Black, A.; Ondel, L.; Scharenborg, O.; Ciannella, F. Image2speech: Automatically generating audio descriptions of images. Casablanca 2017, 2017, 65. [Google Scholar]
- Effendi, J.; Sakti, S.; Nakamura, S. End-to-end image-to-speech generation for untranscribed unknown languages. IEEE Access 2021, 9, 55144–55154. [Google Scholar] [CrossRef]
- Wang, X.; Van Der Hout, J.; Zhu, J.; Hasegawa-Johnson, M.; Scharenborg, O. Synthesizing spoken descriptions of images. IEEE/Acm Trans. Audio Speech Lang. Process. 2021, 29, 3242–3254. [Google Scholar] [CrossRef]
- Ning, H.; Zheng, X.; Yuan, Y.; Lu, X. Audio description from image by modal translation network. Neurocomputing 2021, 423, 124–134. [Google Scholar] [CrossRef]
- Ivezić, D.; Bagić Babac, M. Trends and Challenges of Text-to-Image Generation: Sustainability Perspective. Croat. Reg. Dev. J. 2023, 4, 56–78. [Google Scholar] [CrossRef]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
- Agnese, J.; Herrera, J.; Tao, H.; Zhu, X. A survey and taxonomy of adversarial neural networks for text-to-image synthesis. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2020, 10, e1345. [Google Scholar] [CrossRef]
- Jabbar, A.; Li, X.; Omar, B. A survey on generative adversarial networks: Variants, applications, and training. ACM Comput. Surv. (CSUR) 2021, 54, 1–49. [Google Scholar] [CrossRef]
- Zhang, C.; Zhang, C.; Zhang, M.; Kweon, I.S. Text-to-image diffusion model in generative ai: A survey. arXiv 2023, arXiv:2303.07909. [Google Scholar]
- DALL·E: Creating Images from Text. Available online: https://openai.com/research/dall-e (accessed on 16 February 2024).
- Liu, V.; Chilton, L.B. Design guidelines for prompt engineering text-to-image generative models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, New Orleans, LA, USA, 30 April–6 May 2022; pp. 1–23. [Google Scholar]
- Oppenlaender, J. A taxonomy of prompt modifiers for text-to-image generation. Behav. Inf. Technol. 2023, 1–14. [Google Scholar] [CrossRef]
- Ordonez, V.; Kulkarni, G.; Berg, T. Im2text: Describing images using 1 million captioned photographs. Adv. Neural Inf. Process. Syst. 2011, 24, 1143–1151. [Google Scholar]
- Hodosh, M.; Young, P.; Hockenmaier, J. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 2013, 47, 853–899. [Google Scholar] [CrossRef]
- Young, P.; Lai, A.; Hodosh, M.; Hockenmaier, J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2014, 2, 67–78. [Google Scholar] [CrossRef]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Part V 13. pp. 740–755. [Google Scholar]
- Chen, X.; Zitnick, C.L. Learning a recurrent visual representation for image caption generation. arXiv 2014, arXiv:1411.5654. [Google Scholar]
- Mathews, A.; Xie, L.; He, X. Senticap: Generating image descriptions with sentiments. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
- Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, VIC, Australia, 15–20 July 2018; pp. 2556–2565. [Google Scholar]
- Gurari, D.; Zhao, Y.; Zhang, M.; Bhattacharya, N. Captioning images taken by people who are blind. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Part XVII 16. pp. 417–434. [Google Scholar]
- Pont-Tuset, J.; Uijlings, J.; Changpinyo, S.; Soricut, R.; Ferrari, V. Connecting vision and language with localized narratives. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Part V 16. pp. 647–664. [Google Scholar]
- Sidorov, O.; Hu, R.; Rohrbach, M.; Singh, A. Textcaps: A dataset for image captioning with reading comprehension. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Part II 16. pp. 742–758. [Google Scholar]
- Schuhmann, C.; Köpf, A.; Vencu, R.; Coombes, T.; Beaumont, R. Laion Coco: 600M Synthetic Captions From Laion2B-en|LAION. 2022. Available online: https://laion.ai/blog/laion-coco/ (accessed on 5 November 2023).
- Ito, K.; Johnson, L. The lj speech dataset 2017. Available online: https://keithito.com/LJ-Speech-Dataset/ (accessed on 5 November 2023).
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An asr corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar]
- Zen, H.; Dang, V.; Clark, R.; Zhang, Y.; Weiss, R.J.; Jia, Y.; Chen, Z.; Wu, Y. Libritts: A corpus derived from librispeech for text-to-speech. arXiv 2019, arXiv:1904.02882. [Google Scholar]
- Zandie, R.; Mahoor, M.H.; Madsen, J.; Emamian, E.S. Ryanspeech: A corpus for conversational text-to-speech synthesis. arXiv 2021, arXiv:2106.08468. [Google Scholar]
- Maniati, G.; Vioni, A.; Ellinas, N.; Nikitaras, K.; Klapsas, K.; Sung, J.S.; Jho, G.; Chalamandaris, A.; Tsiakoulis, P. SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis. arXiv 2022, arXiv:2204.03040. [Google Scholar]
- Jia, Y.; Ramanovich, M.T.; Wang, Q.; Zen, H. CVSS corpus and massively multilingual speech-to-speech translation. arXiv 2022, arXiv:2201.03713. [Google Scholar]
- Shi, Y.; Bu, H.; Xu, X.; Zhang, S.; Li, M. Aishell-3: A multi-speaker mandarin tts corpus and the baselines. arXiv 2020, arXiv:2010.11567. [Google Scholar]
- Puchtler, P.; Wirth, J.; Peinl, R. Hui-audio-corpus-german: A high quality tts dataset. In Proceedings of the KI 2021: Advances in Artificial Intelligence: 44th German Conference on AI, Virtual Event, 27 September–1 October 2021; pp. 204–216. [Google Scholar]
- Mussakhojayeva, S.; Janaliyeva, A.; Mirzakhmetov, A.; Khassanov, Y.; Varol, H.A. Kazakhtts: An open-source kazakh text-to-speech synthesis dataset. arXiv 2021, arXiv:2104.08459. [Google Scholar]
- Ardila, R.; Branson, M.; Davis, K.; Henretty, M.; Kohler, M.; Meyer, J.; Morais, R.; Saunders, L.; Tyers, F.M.; Weber, G. Common voice: A massively-multilingual speech corpus. arXiv 2019, arXiv:1912.06670. [Google Scholar]
- Harwath, D.; Recasens, A.; Surís, D.; Chuang, G.; Torralba, A.; Glass, J. Jointly discovering visual objects and spoken words from raw sensory input. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 649–665. [Google Scholar]
- Palmer, I.; Rouditchenko, A.; Barbu, A.; Katz, B.; Glass, J. Spoken ObjectNet: A bias-controlled spoken caption dataset. arXiv 2021, arXiv:2110.07575. [Google Scholar]
- Harwath, D.; Glass, J. Deep multimodal semantic embeddings for speech and images. In Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, AZ, USA, 13–17 December 2015; pp. 237–244. [Google Scholar]
Dataset | Source | Volume of Images | Volume of Captions | Annotation Style | Purpose (Usage) |
---|---|---|---|---|---|
SBU Captioned Photo Dataset [96] | Web-based, Flickr | Over 1 million | Over 1 million | Visual relevance, filtering | General image captioning |
Flickr8k [97] | Flickr | 8092 | 8092 × 5 | Crowdsourced, main aspects | Sentence-based image annotation, search |
Flickr30k [98] | Flickr | 31,783 | 31,783 × 5 | English captions, diverse | Image captioning, multimodal learning, and natural language processing |
MS COCO [99] | Internet, crowd workers | 328,000 | Over million labeled instances | Object recognition, and segmentation | Image captioning, object recognition, and segmentation |
MS COCO Captions [100] | MS COCO dataset | over 330,000 | Over million | Human-created, guidelines | Caption quality, scene description |
SentiCap [101] | MS COCO dataset | Several thousand | Over 2000 | Sentiment-enriched captions | Emotionally expressive image captions |
Conceptual Captions [102] | Web-based, Flume pipeline | Approximately million | approximately million | Image and text filtering, diverse styles | Evaluating Image caption creation models |
VizWiz-Captions [103] | VizWiz mobile app | 39,181 | 39,181 × 5 | Captions for blind users | Real-world image conditions for blind photographers |
Localized Narratives [104] | MS COCO, Flickr30k, ADE20K, Open Images | 849,000 + 671,000 | Voice descriptions with mouse traces | Vision and language research, image captioning | |
TextCaps [105] | Open Images v3 dataset | 28,000 | 145,000 | OCR system and human annotators | Reading abilities of image captioning models |
LAION COCO [106] | Publicly available web-images, English subset of Laion-5B | 600 million | 600 million | Synthetic captions | Large-scale captioning resource, complementarity study |
Dataset | Source | Volume | Purpose (Usage) |
---|---|---|---|
LJ Speech [107] | Non-fiction books published between 1884 and 1964, LibriVox project | 24 h | Research in TTS, voice synthesis models |
LibriTTS (SLR60) [109] | LibriSpeech, Project Gutenberg | 585 h | TTS research, contextual information extraction |
RyanSpeech [110] | Ryan Chatbot, Taskmaster-2, and LibriTTS datasets | 10 h | Development of TTS systems |
SOMOS [111] | Derived from LJ Speech, LPCNet vocoder | 20,000 synthetic utterances | Evaluation of TTS synthesis, refinement of models |
CVSS (Common Voice Speech-to-Speech) [112] | Common Voice speech corpus, CoVoST 2 Speech-to-Text translation dataset | 1900 h | Multilingual Speech-to-Speech translation |
AISHELL-3 [113] | Emotionally neutral recordings from 218 Mandarin speakers | 85 h | Training multilingual TTS systems |
HUI-Audio-Corpus-German [114] | LibriVox | Minimum 20 h per speaker | TTS research, especially for German |
KazakhTTS [115] | Manually extracted articles from news websites | 93 h | Advancing Kazakh TTS applications |
Dataset | Source | Volume of Spoken Captions | Purpose (Usage) |
---|---|---|---|
Places Audio Captions [117] | Places 205 image dataset | Over 400k | Image-to-Speech tasks, scene understanding |
Spoken ObjectNet [118] | ObjectNet dataset | 50,273 | Image-to-Speech tasks, object recognition |
Localized Narratives [104] | MS COCO, Flickr30k, ADE20K, Open Images | 849,000 + 671,000 | Research at the interface of vision and language |
Flickr Audio Caption [119] | Flickr 8k | 40,000 | Image-to-Speech tasks, diverse image descriptions |
SpokenCOCO [80] | MS COCO | Approximately 600,000 | Image-to-Speech tasks, diverse image descriptions |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Orynbay, L.; Razakhova, B.; Peer, P.; Meden, B.; Emeršič, Ž. Recent Advances in Synthesis and Interaction of Speech, Text, and Vision. Electronics 2024, 13, 1726. https://doi.org/10.3390/electronics13091726
Orynbay L, Razakhova B, Peer P, Meden B, Emeršič Ž. Recent Advances in Synthesis and Interaction of Speech, Text, and Vision. Electronics. 2024; 13(9):1726. https://doi.org/10.3390/electronics13091726
Chicago/Turabian StyleOrynbay, Laura, Bibigul Razakhova, Peter Peer, Blaž Meden, and Žiga Emeršič. 2024. "Recent Advances in Synthesis and Interaction of Speech, Text, and Vision" Electronics 13, no. 9: 1726. https://doi.org/10.3390/electronics13091726