MDPI - Publisher of Open Access Journals

18 pages, 697 KB

Open AccessReview

Lip-Reading: Advances and Unresolved Questions in a Key Communication Skill

by Martina Battista, Francesca Collesei, Eva Orzan, Marta Fantoni and Davide Bottari

Audiol. Res. 2025, 15(4), 89; https://doi.org/10.3390/audiolres15040089 - 21 Jul 2025

Viewed by 609

Lip-reading, i.e., the ability to recognize speech using only visual cues, plays a fundamental role in audio-visual speech processing, intelligibility, and comprehension. This capacity is integral to language development and functioning; it emerges in early development, and it slowly evolves. By linking psycholinguistics, [...] Read more.

Lip-reading, i.e., the ability to recognize speech using only visual cues, plays a fundamental role in audio-visual speech processing, intelligibility, and comprehension. This capacity is integral to language development and functioning; it emerges in early development, and it slowly evolves. By linking psycholinguistics, psychophysics, and neurophysiology, the present narrative review explores the development and significance of lip-reading across different stages of life, highlighting its role in human communication in both typical and atypical development, e.g., in the presence of hearing or language impairments. We examined how relying on lip-reading becomes crucial when communication occurs in noisy environments and, on the contrary, the impacts that visual barriers can have on speech perception. Finally, this review highlights individual differences and the role of cultural and social contexts for a better understanding of the visual counterpart of speech. Full article

(This article belongs to the Special Issue Breaking Down Listening Barriers for Students with Hearing Difficulties)

► Show Figures

Figure 1

44 pages, 18791 KB

Open AccessReview

Spatiotemporal Feature Enhancement for Lip-Reading: A Survey

by Yinuo Ma and Xiao Sun

Appl. Sci. 2025, 15(8), 4142; https://doi.org/10.3390/app15084142 - 9 Apr 2025

Viewed by 1525

Abstract

Lip-reading, a crucial technique to recognize human lip movement patterns for semantic output, has gained increasing attention due to its broad applications in public safety, healthcare, the military, and entertainment. Spatiotemporal feature enhancement techniques have played a significant role in advancing lip-reading research [...] Read more.

Lip-reading, a crucial technique to recognize human lip movement patterns for semantic output, has gained increasing attention due to its broad applications in public safety, healthcare, the military, and entertainment. Spatiotemporal feature enhancement techniques have played a significant role in advancing lip-reading research in deep learning. This paper presents a comprehensive review of the latest advancements in methods for lip-reading by exploring key properties of diversity enhancement techniques, involving spatial features, spatiotemporal convolution, attention mechanisms, pulse features, audio-visual features, and so on. Furthermore, according to different network structures, the six spatiotemporal feature enhancement method for lip-reading is offered. And each spatiotemporal feature enhancement method was divided into different subclasses based on the differences in the architecture structure, feature attributes, and application types. Ultimately, this is followed by an in-depth discussion of state-of-the-art spatiotemporal feature enhancement methods, accompanied by an analysis of the challenges and limitations faced, and a discussion of future research directions. From different views, this comprehensive review reveals the limitations and intrinsic disparities among these techniques in different categories for scholars to embark on innovative paths in the advancement of lip-reading. Full article

► Show Figures

Figure 1

17 pages, 3439 KB

Open AccessArticle

A Novel Approach for Visual Speech Recognition Using the Partition-Time Masking and Swin Transformer 3D Convolutional Model

by Xiangliang Zhang, Yu Hu, Xiangzhi Liu, Yu Gu, Tong Li, Jibin Yin and Tao Liu

Sensors 2025, 25(8), 2366; https://doi.org/10.3390/s25082366 - 8 Apr 2025

Cited by 1 | Viewed by 1058

Abstract

Visual speech recognition is a technology that relies on visual information, offering unique advantages in noisy environments or when communicating with individuals with speech impairments. However, this technology still faces challenges, such as limited generalization ability due to different speech habits, high recognition [...] Read more.

Visual speech recognition is a technology that relies on visual information, offering unique advantages in noisy environments or when communicating with individuals with speech impairments. However, this technology still faces challenges, such as limited generalization ability due to different speech habits, high recognition error rates caused by confusable phonemes, and difficulties adapting to complex lighting conditions and facial occlusions. This paper proposes a lip reading data augmentation method—Partition-Time Masking (PTM)—to address these challenges and improve lip reading models’ performance and generalization ability. Applying nonlinear transformations to the training data enhances the model’s generalization ability when handling diverse speakers and environmental conditions. A lip-reading recognition model architecture, Swin Transformer and 3D Convolution (ST3D), was designed to overcome the limitations of traditional lip-reading models that use ResNet-based front-end feature extraction networks. By adopting a strategy that combines Swin Transformer and 3D convolution, the proposed model enhances performance. To validate the effectiveness of the Partition-Time Masking data augmentation method, experiments were conducted on the LRW video dataset using the DC-TCN model, achieving a peak accuracy of 92.15%. The ST3D model was validated on the LRW and LRW1000 video datasets, achieving a maximum accuracy of 56.1% on the LRW1000 dataset and 91.8% on the LRW dataset, outperforming current mainstream lip reading models and demonstrating superior performance on challenging easily confused samples. Full article

(This article belongs to the Special Issue Sensors for Biomechanical and Rehabilitation Engineering)

► Show Figures

Figure 1

31 pages, 6388 KB

Open AccessArticle

Polymers Used in Transparent Face Masks—Characterization, Assessment, and Recommendations for Improvements Including Their Sustainability

by Katie E. Miller, Ann-Carolin Jahn, Brian M. Strohm, Shao M. Demyttenaere, Paul J. Nikolai, Byron D. Behm, Mariam S. Paracha and Massoud J. Miri

Polymers 2025, 17(7), 937; https://doi.org/10.3390/polym17070937 - 30 Mar 2025

Viewed by 1040

Abstract

By 2050, 700 million people will have hearing loss, requiring rehabilitation services. For about 80% of deaf and hard-hearing individuals, face coverings hinders their ability to lip-read. Also, the normal hearing population experiences issues socializing when wearing face masks. Therefore, there is a [...] Read more.

By 2050, 700 million people will have hearing loss, requiring rehabilitation services. For about 80% of deaf and hard-hearing individuals, face coverings hinders their ability to lip-read. Also, the normal hearing population experiences issues socializing when wearing face masks. Therefore, there is a need to evaluate and further develop transparent face masks. In this work, the properties of polymers used in ten commercial transparent face masks were determined. The chemical composition of the polymers including nose bridges and ear loops was determined by FTIR spectroscopy. The focus of the characterizations was on the polymers in the transparent portion of each face mask. In half of the masks, the transparent portion contained PET, while in the other masks it consisted of PETG, PC, iPP, PVC, or SR (silicone rubber). Most masks had been coated with anti-fog material, and a few with scratch-resistant compounds, as indicated by XRF/EDX, SEM/EDX, and contact angle measurements. Thermal, molecular weight, and mechanical properties were determined by TGA/DSC, SEC, and tensile tests, respectively. To measure optical properties, UV-Vis reflectance and UV-Vis haze were applied. An assessment of the ten masks and recommendations to develop better transparent face masks were made, including improvement of their sustainability. Full article

(This article belongs to the Section Polymer Applications)

► Show Figures

Graphical abstract

10 pages, 1139 KB

Open AccessProceeding Paper

Deepening Mathematical Understanding Using Visualization and Interactive Learning for Deaf Students

by Stefanie Amiruzzaman, Md Amiruzzaman, Heena Begum, Deepshikha Bhati and Tsung Heng Wu

Eng. Proc. 2025, 89(1), 4; https://doi.org/10.3390/engproc2025089004 - 21 Feb 2025

Viewed by 555

Abstract

Learning mathematical concepts is challenging for several students. Paying attention to class lectures and following instructions for different steps to solve a problem are the keys to success in learning mathematical concepts. However, deaf and hard-of-hearing (DHH) students have challenges in focusing on [...] Read more.

Learning mathematical concepts is challenging for several students. Paying attention to class lectures and following instructions for different steps to solve a problem are the keys to success in learning mathematical concepts. However, deaf and hard-of-hearing (DHH) students have challenges in focusing on their teacher’s mouths to lipread or depend on an interpreter’s sign language. Visualization is important in learning as it enhances attention and keeps students focused on a subject. Interactive learning or a hands-on approach can help learners engage in a topic and provide an opportunity to better understand a concept. Combining these two techniques (i.e., visualization and interactive learning), we present an interactive number line (INL) tool to help DHH students understand mathematical concepts such as mean, median, mode, and range. This tool has proven to be useful for visual learning and/or for new learners as it helps provide an activity-based learning environment and feedback. Full article

(This article belongs to the Proceedings of 2024 IEEE 7th International Conference on Knowledge Innovation and Invention)

► Show Figures

Figure 1

22 pages, 36914 KB

Open AccessArticle

Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading

by Samar Daou, Achraf Ben-Hamadou, Ahmed Rekik and Abdelaziz Kallel

Technologies 2025, 13(1), 26; https://doi.org/10.3390/technologies13010026 - 9 Jan 2025

Cited by 2 | Viewed by 2580

Abstract

Lipreading involves recognizing spoken words by analyzing the movements of the lips and surrounding area using visual data. It is an emerging research topic with many potential applications, such as human–machine interaction and enhancing audio-based speech recognition. Recent deep learning approaches integrate visual [...] Read more.

Lipreading involves recognizing spoken words by analyzing the movements of the lips and surrounding area using visual data. It is an emerging research topic with many potential applications, such as human–machine interaction and enhancing audio-based speech recognition. Recent deep learning approaches integrate visual features from the mouth region and lip contours. However, simple methods such as concatenation may not effectively optimize the feature vector. In this article, we propose extracting optimal visual features using 3D convolution blocks followed by a ResNet-18, while employing a graph neural network to extract geometric features from tracked lip landmarks. To fuse these complementary features, we introduce a cross-attention mechanism that combines visual and geometric information to obtain an optimal representation of lip movements for lipreading tasks. To validate our approach for Arabic, we introduce the first large-scale Lipreading in the Wild for Arabic (LRW-AR) dataset, consisting of 20,000 videos across 100 word classes, spoken by 36 speakers. Experimental results on both the LRW-AR and LRW datasets demonstrate the effectiveness of our approach, achieving accuracies of 85.85% and 89.41%, respectively. Full article

(This article belongs to the Section Information and Communication Technologies)

► Show Figures

Figure 1

22 pages, 10697 KB

Open AccessArticle

Lip-Reading Classification of Turkish Digits Using Ensemble Learning Architecture Based on 3DCNN

by Ali Erbey and Necaattin Barışçı

Appl. Sci. 2025, 15(2), 563; https://doi.org/10.3390/app15020563 - 8 Jan 2025

Viewed by 1105

Abstract

Understanding others correctly is of great importance for maintaining effective communication. Factors such as hearing difficulties or environmental noise can disrupt this process. Lip reading offers an effective solution to these challenges. With the growing success of deep learning architectures, research on lip [...] Read more.

Understanding others correctly is of great importance for maintaining effective communication. Factors such as hearing difficulties or environmental noise can disrupt this process. Lip reading offers an effective solution to these challenges. With the growing success of deep learning architectures, research on lip reading has gained momentum. The aim of this study is to create a lip reading dataset for Turkish digit recognition and to conduct predictive analyses. The dataset has divided into two subsets: the face region and the lip region. CNN, LSTM, and 3DCNN-based models, including C3D, I3D, and 3DCNN+BiLSTM, were used. While LSTM models are effective in processing temporal data, 3DCNN-based models, which can process both spatial and temporal information, achieved higher accuracy in this study. Experimental results showed that the dataset containing only the lip region performed better; accuracy rates for CNN, LSTM, C3D, and I3D on the lip region were 67.12%, 75.53%, 86.32%, and 93.24%, respectively. The 3DCNN-based models achieved higher accuracy due to their ability to process spatio-temporal data. Furthermore, an additional 1.23% improvement was achieved through ensemble learning, with the best result reaching 94.53% accuracy. Ensemble learning, by combining the strengths of different models, provided a meaningful improvement in overall performance. These results demonstrate that 3DCNN architectures and ensemble learning methods yield high success in addressing the problem of lip reading in the Turkish language. While our study focuses on Turkish digit recognition, the proposed methods have the potential to be successful in other languages or broader lip reading applications. Full article

(This article belongs to the Special Issue Applications, Challenges and Promises of Computer Vision and Digital Imaging Processing)

► Show Figures

Figure 1

24 pages, 494 KB

Open AccessArticle

Access to Sexual and Reproductive Health and Rights Services for Young Women with and Without Disabilities During a Pandemic

by Jill Hanass-Hancock, Ayanda Nzuza, Thesandree Padayachee, Kristin Dunkle, Samantha Willan, Mercilene Tanyaradzwa Machisa and Bradley Carpenter

Disabilities 2024, 4(4), 972-995; https://doi.org/10.3390/disabilities4040060 - 21 Nov 2024

Cited by 1 | Viewed by 1984

Abstract

Young women with and without disabilities in South Africa experience challenges accessing sexual reproductive health and rights (SRHR) services, and this may increase during a crisis. We conducted a longitudinal cohort study with 72 young women with and without disabilities (18–25 years) in [...] Read more.

Young women with and without disabilities in South Africa experience challenges accessing sexual reproductive health and rights (SRHR) services, and this may increase during a crisis. We conducted a longitudinal cohort study with 72 young women with and without disabilities (18–25 years) in eThekwini, South Africa (2020–2022) via a series of in-depth interviews including quantitative and qualitative data on participants’ experiences during the COVID-19 pandemic and access to SRHR. Participants reported that barriers to accessing SRHR services included lockdown regulations, prioritization of COVID-19 at health care facilities, fear of COVID-19 infection, transport challenges, and youth-unfriendly clinics. Participants with disabilities experienced additional barriers to SRHR services, including ongoing (inaccessible services) and pandemic-specific (e.g., masks making lipreading impossible) barriers. Participants reported both non-partner and partner violence, with women with disabilities reporting this more frequently, physical and sexual partner violence, as well as physical and emotional abuse from caregivers. Participants with disabilities were not reporting incidents of violence to caregivers or officials because they had ‘normalized’ the experience of violence, were not believed when trying to disclose, feared that reporting would increase their problems, or could not access services due to disability-related barriers. Inclusive and accessible SRHR information, education, and services are needed. This includes disability-specific staff training, disability audits, and caregiver support and training. Full article

► Show Figures

Figure 1

19 pages, 3739 KB

Open AccessArticle

Segmenting Speech: The Role of Resyllabification in Spanish Phonology

by Iván Andreu Rascón

Languages 2024, 9(11), 346; https://doi.org/10.3390/languages9110346 - 7 Nov 2024

Cited by 1 | Viewed by 1715

Abstract

Humans segment speech naturally based on the transitional probabilities between linguistic elements. For bilingual speakers navigating between a first (L1) and a second language (L2), L1 knowledge can influence their perception, leading to transfer effects based on phonological similarities or differences. Specifically, in [...] Read more.

Humans segment speech naturally based on the transitional probabilities between linguistic elements. For bilingual speakers navigating between a first (L1) and a second language (L2), L1 knowledge can influence their perception, leading to transfer effects based on phonological similarities or differences. Specifically, in Spanish, resyllabification occurs when consonants at the end of a syllable or word boundary are repositioned as the onset of the subsequent syllable. While the process can lead to ambiguities in perception, current academic discussions debate the duration of canonical and resyllabified productions. However, the role of bilingualism in the visual perception of syllable and word segmentation remains unknown to date. The present study explores the use of bilingual skills in the perception of articulatory movements and visual cues in speech perception, addressing the gap in the literature regarding the visibility of syllable pauses in lipreading. The participants in this study, 80 native Spanish speakers and 195 L2 learners, were subjected to audio, visual-only, and audiovisual conditions to assess their segmentation accuracy. The results indicated that both groups could segment speech effectively, with audiovisual cues providing the most significant benefit. Native speakers performed more consistently, while proficiency influenced L2 learners’ accuracy. The results show that aural syllabic segmentation is acquired at early stages of proficiency, while visual syllabic segmentation is acquired at higher levels of proficiency. Full article

(This article belongs to the Special Issue The Effects of Language Experience on Speech Perception and Speech Production)

► Show Figures

Figure 1

19 pages, 5044 KB

Open AccessArticle

Improving the Performance of Automatic Lip-Reading Using Image Conversion Techniques

by Ki-Seung Lee

Electronics 2024, 13(6), 1032; https://doi.org/10.3390/electronics13061032 - 9 Mar 2024

Cited by 2 | Viewed by 2543

Abstract

Variation in lighting conditions is a major cause of performance degradation in pattern recognition when using optical imaging. In this study, infrared (IR) and depth images were considered as possible robust alternatives against variations in illumination, particularly for improving the performance of automatic [...] Read more.

Variation in lighting conditions is a major cause of performance degradation in pattern recognition when using optical imaging. In this study, infrared (IR) and depth images were considered as possible robust alternatives against variations in illumination, particularly for improving the performance of automatic lip-reading. The variations due to lighting conditions were quantitatively analyzed for optical, IR, and depth images. Then, deep neural network (DNN)-based lip-reading rules were built for each image modality. Speech recognition techniques based on IR or depth imaging required an additional light source that emitted light in the IR range, along with a special camera. To mitigate this problem, we propose a method that does not use an IR/depth image directly, but instead estimates images based on the optical RGB image. To this end, a modified U-net was adopted to estimate the IR/depth image from an optical RGB image. The results show that the IR and depth images were rarely affected by the lighting conditions. The recognition rates for the optical, IR, and depth images were 48.29%, 95.76%, and 92.34%, respectively, under various lighting conditions. Using the estimated IR and depth images, the recognition rates were 89.35% and 80.42%, respectively. This was significantly higher than for the optical RGB images. Full article

(This article belongs to the Special Issue Deep and Classic Machine Learning in Signal, Image, and Video Analysis)

► Show Figures

Figure 1

13 pages, 15764 KB

Open AccessArticle

Lip-Reading Advancements: A 3D Convolutional Neural Network/Long Short-Term Memory Fusion for Precise Word Recognition

by Themis Exarchos, Georgios N. Dimitrakopoulos, Aristidis G. Vrahatis, Georgios Chrysovitsiotis, Zoi Zachou and Efthymios Kyrodimos

BioMedInformatics 2024, 4(1), 410-422; https://doi.org/10.3390/biomedinformatics4010023 - 4 Feb 2024

Cited by 10 | Viewed by 5079

Abstract

Lip reading, the art of deciphering spoken words from the visual cues of lip movements, has garnered significant interest for its potential applications in diverse fields, including assistive technologies, human–computer interaction, and security systems. With the rapid advancements in technology and the increasing [...] Read more.

Lip reading, the art of deciphering spoken words from the visual cues of lip movements, has garnered significant interest for its potential applications in diverse fields, including assistive technologies, human–computer interaction, and security systems. With the rapid advancements in technology and the increasing emphasis on non-verbal communication methods, the significance of lip reading has expanded beyond its traditional boundaries. These technological advancements have led to the generation of large-scale and complex datasets, necessitating the use of cutting-edge deep learning tools that are adept at handling such intricacies. In this study, we propose an innovative approach combining 3D Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks to tackle the challenging task of word recognition from lip movements. Our research leverages a meticulously curated dataset, named MobLip, encompassing various speech patterns, speakers, and environmental conditions. The synergy between the spatial information extracted by 3D CNNs and the temporal dynamics captured by LSTMs yields impressive results, achieving an accuracy rate of up to 87.5%, showcasing robustness to lighting variations and speaker diversity. Comparative experiments demonstrate our model’s superiority over existing lip-reading approaches, underlining its potential for real-world deployment. Furthermore, we discuss ethical considerations and propose avenues for future research, such as multimodal integration with audio data and expanded language support. In conclusion, our 3D CNN-LSTM architecture presents a promising solution to the complex problem of word recognition from lip movements, contributing to the advancement of communication technology and opening doors to innovative applications in an increasingly visual world. Full article

(This article belongs to the Special Issue Feature Papers on Methods in Biomedical Informatics)

► Show Figures

Figure 1

30 pages, 4582 KB

Open AccessArticle

Lip2Speech: Lightweight Multi-Speaker Speech Reconstruction with Gabor Features

by Zhongping Dong, Yan Xu, Andrew Abel and Dong Wang

Appl. Sci. 2024, 14(2), 798; https://doi.org/10.3390/app14020798 - 17 Jan 2024

Cited by 1 | Viewed by 3111

Abstract

In environments characterised by noise or the absence of audio signals, visual cues, notably facial and lip movements, serve as valuable substitutes for missing or corrupted speech signals. In these scenarios, speech reconstruction can potentially generate speech from visual data. Recent advancements in [...] Read more.

In environments characterised by noise or the absence of audio signals, visual cues, notably facial and lip movements, serve as valuable substitutes for missing or corrupted speech signals. In these scenarios, speech reconstruction can potentially generate speech from visual data. Recent advancements in this domain have predominantly relied on end-to-end deep learning models, like Convolutional Neural Networks (CNN) or Generative Adversarial Networks (GAN). However, these models are encumbered by their intricate and opaque architectures, coupled with their lack of speaker independence. Consequently, achieving multi-speaker speech reconstruction without supplementary information is challenging. This research introduces an innovative Gabor-based speech reconstruction system tailored for lightweight and efficient multi-speaker speech restoration. Using our Gabor feature extraction technique, we propose two novel models: GaborCNN2Speech and GaborFea2Speech. These models employ a rapid Gabor feature extraction method to derive lowdimensional mouth region features, encompassing filtered Gabor mouth images and low-dimensional Gabor features as visual inputs. An encoded spectrogram serves as the audio target, and a Long Short-Term Memory (LSTM)-based model is harnessed to generate coherent speech output. Through comprehensive experiments conducted on the GRID corpus, our proposed Gabor-based models have showcased superior performance in sentence and vocabulary reconstruction when compared to traditional end-to-end CNN models. These models stand out for their lightweight design and rapid processing capabilities. Notably, the GaborFea2Speech model presented in this study achieves robust multi-speaker speech reconstruction without necessitating supplementary information, thereby marking a significant milestone in the field of speech reconstruction. Full article

(This article belongs to the Special Issue Advanced Technology in Speech and Acoustic Signal Processing)

► Show Figures

Figure 1

27 pages, 843 KB

Open AccessArticle

EMOLIPS: Towards Reliable Emotional Speech Lip-Reading

by Dmitry Ryumin, Elena Ryumina and Denis Ivanko

Mathematics 2023, 11(23), 4787; https://doi.org/10.3390/math11234787 - 27 Nov 2023

Cited by 1 | Viewed by 2621

Abstract

In this article, we present a novel approach for emotional speech lip-reading (EMOLIPS). This two-level approach to emotional speech to text recognition based on visual data processing is motivated by human perception and the recent developments in multimodal deep learning. The proposed approach [...] Read more.

In this article, we present a novel approach for emotional speech lip-reading (EMOLIPS). This two-level approach to emotional speech to text recognition based on visual data processing is motivated by human perception and the recent developments in multimodal deep learning. The proposed approach uses visual speech data to determine the type of speech emotion. The speech data are then processed using one of the emotional lip-reading models trained from scratch. This essentially resolves the multi-emotional lip-reading issue associated with most real-life scenarios. We implemented these models as a combination of EMO-3DCNN-GRU architecture for emotion recognition and 3DCNN-BiLSTM architecture for automatic lip-reading. We evaluated the models on the CREMA-D and RAVDESS emotional speech corpora. In addition, this article provides a detailed review of recent advances in automated lip-reading and emotion recognition that have been developed over the last 5 years (2018–2023). In comparison to existing research, we mainly focus on the valuable progress brought with the introduction of deep learning to the field and skip the description of traditional approaches. The EMOLIPS approach significantly improves the state-of-the-art accuracy for phrase recognition due to considering emotional features of the pronounced audio-visual speech up to 91.9% and 90.9% for RAVDESS and CREMA-D, respectively. Moreover, we present an extensive experimental investigation that demonstrates how different emotions (happiness, anger, disgust, fear, sadness, and neutral), valence (positive, neutral, and negative) and binary (emotional and neutral) affect automatic lip-reading. Full article

(This article belongs to the Section E: Applied Mathematics)

► Show Figures

Figure 1

28 pages, 453 KB

Open AccessReview

Data-Driven Advancements in Lip Motion Analysis: A Review

by Shad Torrie, Andrew Sumsion, Dah-Jye Lee and Zheng Sun

Electronics 2023, 12(22), 4698; https://doi.org/10.3390/electronics12224698 - 18 Nov 2023

Cited by 1 | Viewed by 3847

Abstract

This work reviews the dataset-driven advancements that have occurred in the area of lip motion analysis, particularly visual lip-reading and visual lip motion authentication, in the deep learning era. We provide an analysis of datasets and their usage, creation, and associated challenges. Future [...] Read more.

This work reviews the dataset-driven advancements that have occurred in the area of lip motion analysis, particularly visual lip-reading and visual lip motion authentication, in the deep learning era. We provide an analysis of datasets and their usage, creation, and associated challenges. Future research can utilize this work as a guide for selecting appropriate datasets and as a source of insights for creating new and innovative datasets. Large and varied datasets are vital to a successful deep learning system. There have been many incredible advancements made in these fields due to larger datasets. There are indications that even larger, more varied datasets would result in further improvement upon existing systems. We highlight the datasets that brought about the progression in lip-reading systems from digit- to word-level lip-reading, and then from word- to sentence-level lip-reading. Through an in-depth analysis of lip-reading system results, we show that datasets with large amounts of diversity increase results immensely. We then discuss the next step for lip-reading systems to move from sentence- to dialogue-level lip-reading and emphasize that new datasets are required to make this transition possible. We then explore lip motion authentication datasets. While lip motion authentication has been well researched, it is not very unified on a particular implementation, and there is no benchmark dataset to compare the various methods. As was seen in the lip-reading analysis, large, diverse datasets are required to evaluate the robustness and accuracy of new methods attempted by researchers. These large datasets have pushed the work in the visual lip-reading realm. Due to the lack of large, diverse, and publicly accessible datasets, visual lip motion authentication research has struggled to validate results and real-world applications. A new benchmark dataset is required to unify the studies in this area such that they can be compared to previous methods as well as validate new methods more effectively. Full article

(This article belongs to the Special Issue Convolutional Neural Networks and Vision Applications, 3rd Edition)

► Show Figures

Figure 1

18 pages, 3281 KB

Open AccessArticle

The Effect of Cued-Speech (CS) Perception on Auditory Processing in Typically Hearing (TH) Individuals Who Are Either Naïve or Experienced CS Producers

by Cora Jirschik Caron, Coriandre Vilain, Jean-Luc Schwartz, Clémence Bayard, Axelle Calcus, Jacqueline Leybaert and Cécile Colin

Brain Sci. 2023, 13(7), 1036; https://doi.org/10.3390/brainsci13071036 - 7 Jul 2023

Cited by 3 | Viewed by 1806

Abstract

Cued Speech (CS) is a communication system that uses manual gestures to facilitate lipreading. In this study, we investigated how CS information interacts with natural speech using Event-Related Potential (ERP) analyses in French-speaking, typically hearing adults (TH) who were either naïve or experienced [...] Read more.

Cued Speech (CS) is a communication system that uses manual gestures to facilitate lipreading. In this study, we investigated how CS information interacts with natural speech using Event-Related Potential (ERP) analyses in French-speaking, typically hearing adults (TH) who were either naïve or experienced CS producers. The audiovisual (AV) presentation of lipreading information elicited an amplitude attenuation of the entire N1 and P2 complex in both groups, accompanied by N1 latency facilitation in the group of CS producers. Adding CS gestures to lipread information increased the magnitude of effects observed at the N1 time window, but did not enhance P2 amplitude attenuation. Interestingly, presenting CS gestures without lipreading information yielded distinct response patterns depending on participants’ experience with the system. In the group of CS producers, AV perception of CS gestures facilitated the early stage of speech processing, while in the group of naïve participants, it elicited a latency delay at the P2 time window. These results suggest that, for experienced CS users, the perception of gestures facilitates early stages of speech processing, but when people are not familiar with the system, the perception of gestures impacts the efficiency of phonological decoding. Full article

(This article belongs to the Special Issue Advances in Understanding the Phenomena and Processing in Audiovisual Speech Perception)

► Show Figures

Figure 1

Search Results (39)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (39)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI