1. Introduction
Lip reading recognition is a technology that identifies a speaker’s expressed content based on lip movements. This field integrates multiple research areas, including video processing, natural language processing, audio processing, and pattern recognition, demonstrating broad potential across various applications. Visual-assisted speech recognition leverages noise-resistant visual features, such as lip movements, to enhance speech recognition accuracy in noisy environments [
1]. In the healthcare domain, this technology aids individuals with speech impairments by facilitating pronunciation training, supporting speech rehabilitation, and providing personalized articulation correction for patients with Parkinson’s disease, thereby improving their speech clarity and communication abilities [
2,
3,
4]. Moreover, lip movement features have been utilized as a novel biometric modality in security applications, such as liveness detection [
5].
Despite significant advancements in lip-reading technology in recent years, several technical challenges remain to be addressed. The primary challenge arises from the impact of dynamic variations in ambient lighting conditions, which interfere with the extraction of visual features. Some studies have proposed lip-reading feature extraction methods tailored for varying illumination conditions [
6]. Another significant difficulty lies in the visual confusion effect caused by consonants with highly similar lip shapes and tongue positions during pronunciation, such as /m/, /b/, and /p/, as well as their syllabic combinations like pat, mat, and bat [
7]. This phenomenon of homophenes significantly increases the complexity of classifier discrimination. Additionally, variations in sequence length further complicate recognition and limit the generalization ability of models, a challenge that has been discussed in prior research [
8]. Researchers have proposed various models that demonstrate promising performance in specific aspects to address these issues, but they still exhibit limitations. In the spatial domain, the local receptive field characteristic of CNNs constrains contextual modeling, making it challenging to resolve semantic ambiguities in visually similar phonemes [
9]. In the temporal domain, fixed-window 3D CNNs struggle to accommodate the dynamic elasticity of phoneme durations, necessitating the integration of self-attention mechanisms to capture long-range dependencies [
10]. Feature space reconstruction methods based on autoencoders, such as multi-band feature fusion strategies [
11], can enhance single-frame representations by decoupling deep features. From the feature representation perspective, traditional models’ rigid convolutional kernels are not well suited for capturing the non-rigid nature of lip movements, highlighting the need for deformable convolutions to achieve multi-scale dynamic perception.
To address the aforementioned challenges, this paper proposes a novel lip-reading data augmentation method—Partition-Time Masking (PTM)—to expand and enrich the dataset. A new lip-reading recognition model architecture, Swin Transformer and 3D Convolution (ST3D), was also designed. This model overcomes the limitations of traditional lip-reading recognition models based on ResNet for front-end feature extraction by combining Swin Transformer and 3D convolution strategies. Experiments were conducted on the LAW and LAW1000 datasets to validate the model.
Section 2 introduces the proposed data augmentation method (PTM) and lip reading recognition method (ST3D).
Section 3 presents the experimental setup and results.
Section 4 discusses the performance of the proposed methods.
Section 5 concludes this paper.
2. Related Works
Researchers have constructed various types of lip reading recognition datasets, providing important support for related studies. Early representative datasets include AVLetters, an English lip-reading dataset recorded by 100 participants, covering the pronunciation of 26 English letters [
12]. Subsequently, the AVICAR dataset was proposed, focusing on lip reading recognition in-vehicle environments, containing 10 Arabic numeral samples recorded by 100 speakers in moving vehicles and addressing variations in lighting and background objects [
13]. The GRID dataset focuses on phrase-level lip-reading recognition and contains 34,000 short sentences recorded by 34 speakers, each contributing 1000 samples [
14]. The OuluVS dataset includes 53 speakers and covers 10 common greeting phrases. Each phrase is repeated 5 times by each speaker, generating a total of 1000 samples, with 5 images from different angles for each sample, thereby increasing the diversity and richness of the samples [
15,
16]. The LRS2 [
17] dataset is a sentence-level audio-free video dataset, and it was compiled by extracting videos from BBC television programs. Similarly, the LRS3 [
18] dataset is another sentence-level dataset, created by extracting videos from TEDx talks in a similar manner, with 150,000 sentences sourced from TED programs. The LRW dataset is derived from video clips of over 1000 interviewees in television programs, containing 500 common word categories, each consisting of thousands of samples. The training set of this dataset contains 488,766 samples, the validation set includes 25,000 samples, and the test set consists of 25,000 samples. This dataset is also one of the largest publicly available English isolated word lip-reading datasets to date [
19]. The CAS-VAR-W1K (LRW-1000) dataset focuses on word-level recognition in outdoor environments, with 1000 categories and more than 2000 participants, containing 718,018 video samples. The total number of samples exceeds one million Chinese character instances, with each category comprising one or more Chinese characters. This dataset exhibits significant variations in the number of samples per class, video resolution, and lighting conditions, as well as the speaker’s posture, age, gender, and makeup, in order to simulate real-world conditions [
20].
Data augmentation methods enhance the performance of network models by increasing the number of samples in the dataset [
21]. In lip-reading recognition research, data augmentation methods are divided into two categories: those that do not consider the temporal dimension of the data and those that account for the temporal dimension. Data augmentation methods that do not consider the temporal dimension focus on processing individual frame images of the lip reading data without manipulating the temporal dimension. The study of [
22] proposed data augmentation by introducing model prior knowledge, while ref. [
23] enhanced the data by randomly cropping portions of the input data. The Mixup method increases the sample size by randomly mixing different training samples [
24], and CutMix creates new training samples by overlaying part of one sample onto another [
25].
Considering data augmentation methods that consider the temporal dimension, Stafylakis proposed the Word Boundary method, which effectively enhances model performance by incorporating word boundary information [
26]. Wang et al. [
27], based on SpecAugment [
28], introduced the Time Masking data augmentation method, which is widely used in time-series research and is one of the most effective data augmentation techniques to date.
Data augmentation methods that consider the temporal dimension of the data include the Word Boundary method proposed by Stafylakis, which adds indicators containing word boundary information as additional input to the model. These indicators are concatenated with the encoder’s input to form a new input, which is then processed by a temporal model [
26]. Wang et al. [
27] proposed the Time Masking data augmentation method based on SpecAugment [
28], which was initially applied in the field of automatic speech recognition (ASR) and later widely used in time-series research. Time Masking has become one of the most effective data augmentation methods currently available.
Traditional lip-reading recognition models trace back to 1984, when Petajan and his team developed the first lip-reading recognition system, where words were the smallest recognition unit [
29]. Subsequently, they introduced the Nostral Tracking method to optimize lip position detection and localization [
30]. Researchers such as Goldschen et al. [
31] enhanced the spatiotemporal feature extraction of lip movement by employing hidden Markov models (HMMs) [
32]. In 2007, Zhao and colleagues introduced a novel spatiotemporal local binary pattern feature extraction method to address the challenges of recognizing isolated English phrases in lip reading [
33]. In 2011, Zhao et al. proposed a mathematical mapping approach that projects images from high-dimensional to low-dimensional manifold space, aiming to resolve the issue of lip reading recognition by focusing solely on speech content without considering other speakers [
34]. In 2013, Pei et al. proposed a new node-splitting criterion through unsupervised random forest manifold alignment, which effectively captured the motion trajectories of key lip points and demonstrated superior performance across various datasets [
35]. In traditional lip-reading recognition research, the active appearance model (AAM) is commonly used. Its core focus is on manually constructing discriminative features to represent lip movements [
36].
Deep learning-based lip-reading recognition technology has become mainstream. In 2014, Kuniaki Noda and his team at Waseda University [
9] adopted a variant of AlexNet [
37], which utilizes deep neural networks to extract feature data from the lips. This study demonstrated the effectiveness of combining deep learning techniques with traditional speech processing methods, providing new perspectives and theoretical support for subsequent lip reading research. In 2017, Assael et al. [
38] innovatively introduced connectionist temporal classification (CTC) loss and employed a spatiotemporal convolutional network (TCN) [
39] as the front-end feature extraction network, with a bidirectional GRU (Bi-GRU) for temporal classification at the back end. This architecture achieved remarkable results on the GRID dataset, with a recognition accuracy of 95.2%. The success of this model provided important insights for future deep learning-based lip reading research. In the same year, Stafylakis et al. [
40] combined TCN [
41] with ResNet34 [
42] and used Bi-GRU for temporal modeling. In 2020, Petridis et al. [
43], inspired by DenseNet [
44], proposed a novel network architecture called densely connected temporal convolutional network (DC-TCN). This network effectively utilized the features from shallow networks to address the standard gradient vanishing issue in deep learning training. By combining it with the ResNet18 network for feature extraction, it achieved an accuracy of 88.36% on the LRW dataset and 43.65% on the LRW1000 dataset. In the same year, Ma et al. [
41] introduced an innovative network architecture called depthwise separable temporal convolutional network (DS-TCN). This architecture replaced standard convolution with depthwise separable convolution and incorporated ResNet for temporal modeling. On the LRW dataset, DS-TCN achieved a classification accuracy of 46.6%, while, on the LRW1000 dataset, it achieved 88.5% accuracy. In 2022, Petridis and his team proposed a lip-reading recognition model based on DC-TCN and introduced the Time Masking data augmentation method [
27]. Their model achieved an astonishing 92.1% classification accuracy on the LRW dataset. That study demonstrated that by refining and optimizing various network components, the performance of lip-reading recognition models could be significantly improved, providing strong guidance and reference for the future development of lip-reading technology.
The innovative applications of Transformer in lip reading have primarily focused on spatiotemporal modeling optimization and multimodal feature fusion. In terms of spatiotemporal modeling, Ma et al. [
45] were the first to introduce a hierarchical Transformer into sentence-level Chinese lip-reading tasks. Their approach employed a pinyin-based phoneme separation strategy to better align with the characteristics of Mandarin pronunciation. Wang et al. [
46] proposed the 3D convolutional visual transformer (3DCvT), which innovatively integrates dilated convolutional kernels with a bidirectional GRU, enabling joint modeling of short- and long-term motion features. To address the challenge of cross-speaker generalization, Feng et al. [
47] designed LipFormer, which incorporates a dual-stream Transformer with cross-attention between visual and landmark features, effectively decoupling lip deformations from facial muscle movements. This method significantly enhances recognition robustness for unseen speakers. Regarding feature optimization, Koumparoulis et al. [
48] identified that traditional 3D convolutional models with max-pooling layers tend to suppress the transmission of micro-expression features. To mitigate this issue, their improved EfficientNetV2-Transformer architecture eliminates redundant downsampling operations, preserving fine-grained visual details for enhanced feature extraction.
4. Experiments and Results
4.1. Data Preprocessing
This study conducted experiments on the LRW and LRW1000 datasets. Prior to experimentation, preprocessing steps were applied to the datasets as follows.
LRW Dataset: First, the RetinaFace tool [
24] is used to detect faces in the videos. Then, 25 consecutive frames containing the speaking portion (approximately 1 s) are extracted from each video, and the remaining frames are discarded. If a sample contains fewer than 25 valid frames, padding frames with a value of 0 are added to ensure consistency in the number of input frames [
25]. Next, each frame is cropped from the region
to obtain lip images of size 96 × 96 [
22]. These images are further randomly cropped to 88 × 88 and subjected to random horizontal flipping. Finally, the RGB images are converted to grayscale before being fed into the model.
LRW1000 Dataset: The faces in the videos are detected, and all of the frames containing the speaker’s lip movements are extracted (the number of frames is variable). These frames are cropped to 128 × 128 pixel lip images, which are then resized to 88 × 88. The images are randomly horizontally flipped, and the RGB data are then converted to grayscale for model input.
Dataset Splitting: The LRW dataset consists of 488,766 training samples, with 25,000 samples for both the validation and test sets. The LRW1000 dataset contains 629,366 training samples, 63,381 validation samples, and 52,441 test samples.
4.2. Experimental Results of Data Augmentation Methods
To validate the performance enhancement of the Partition-Time Masking (PTM) method on lip reading recognition models, this study selected three commonly used models in lip reading research: DC-TCN, MS-TCN, and Bi-GRU. All experiments were conducted under the same software and hardware environment. The specific experimental configuration was as follows: RTX 3090 GPU, Windows 10 operating system, CUDA version 11.1, TorchVision version 0.9.0, and PyTorch version 1.8.0. The initial learning rate was 0.0003, and the number of training epochs was 80. The DC-TCN and MS-TCN networks were trained on the LRW dataset, while the Bi-GRU network was trained on the LRW1000 dataset. Additionally, a control group using the Word Boundary method was included. The experimental results with different data augmentation methods are shown in
Table 1.
An analysis of the experimental results, which is presented in
Table 1, revealed that, when the data are divided into two sub-sequences, the performance of the three network models—DC-TCN, MS-TCN, and Bi-GRU—improved in all experimental groups except for the DC-TCN with the word boundary group. The performance improvement in these groups surpassed that of the baseline group with Time Masking augmentation.
We analyzed the accuracy decline observed in the DC-TCN with the Word Boundary experiment. It was found that, during the early stages of training, the accuracy and loss of this group fluctuated, whereas other models stabilized in the last few rounds of training. This suggests that the model might not have been fully trained due to insufficient training epochs. To verify this hypothesis, we retrained, using different numbers of training epochs, the DC-TCN model with the Word Boundary and RD strategies and the DC-TCN model with the Word Boundary method and Time Masking. The experimental results are shown in
Table 2, where it is evident that the optimal number of training epochs was 90.
Further experiments were conducted to determine the impact of the number of subsequences and mask selection in the Partition-Time Masking method on the final network performance and to identify the optimal strategy combination. The DC-TCN network was trained on different enhancement strategies for the LRW dataset. The experimental accuracy for RD was 90.03%, RT 89.82%, RF 89.61%, PD 90.08%, and PT was 89.94%. The results indicate that network performance improves progressively as the number of subsequences increases from 2 to 3; however, when the number of subsequences is increased to 5, overfitting may occur, leading to a reduction in network performance. Compared to using the average frame of the original input as the mask value, selecting the average frame of each subsequence as the mask value better enhances the model’s performance. The optimal strategy currently is RD, where the number of subsequences is three, and the mask value for each subsequence is the average frame of that subsequence. In summary, the Partition-Time Masking method significantly improves network performance more than the Time Masking method.
4.3. Experimental Results of the Lip Reading Recognition Models
The experiment was conducted using the LRW and LRW1000 datasets. The primary goal of this experiment was to evaluate the accuracy of the models in the lip reading recognition task. Thus, the highest output accuracy of each model was selected as the key performance metric. The accuracy was calculated as the ratio of correctly predicted samples to the total number of samples, providing a direct quantitative indicator of model performance. Experimental configuration: the CPU used was Intel(R) Core(TM) i9-10900k; and the GPU was NVIDIA GeForce RTX3090, which had 32 GB of RAM. The software environment included Python 3.8 and PyTorch 1.10.1. The parameter settings during the model training process are as follows: learning rate was 0.0004, dropout rate was 0.3, training epochs was 100, optimizer was adamW, and the batch size was 32.
To verify the generalization ability and scalability of the lip reading model ST3D, this study designed three variants of ST3D with different parameter configurations: ST3D-I, ST3D-II, and ST3D-III. The main differences between these variants lie in the number (N) of Swin Transformer blocks and the number (H) of attention heads in each stage. In ST3D-I, N is set to 1, 1, 3, 1 and H is set to 1, 3, 6, 9. In ST3D-II, N is set to 2, 2, 6, 2 and H is set to 3, 3, 6, 12. In ST3D-III, N is set to 2, 2, 9, 2 and H is set to 3, 9, 12, 18. These configurations were used to explore the scalability of the ST3D architecture. These models were experimentally tested on the LRW and LRW-1000 lip reading datasets with the other parameters held constant. On the LRW dataset, the accuracy of ST3D-I was 90.5%, ST3D-II was 91%, and ST3D-III was 91.8%. On the LRW-1000 dataset, ST3D-I achieved an accuracy of 55.2%, ST3D-II 56.1%, and ST3D-III 55.7%. The experimental results show that ST3D-III achieved the best recognition accuracy on the LRW dataset, while ST3D-II performed best on the LRW-1000 dataset. Therefore, different parameter configurations can be selected depending on the specific task requirements.
This study also explored the model’s optimization strategies and validated their impact on lip reading task performance through ablation experiments. The optimization strategies include the use of word boundary methods to add additional input dimensions during the construction of the temporal model, applying Mixup, Cutmix, Time Masking, and Partition-Time Masking methods during the data preprocessing phase, as well as label smoothing techniques. The experimental results are shown in
Table 3. With the cumulative application of these strategies, the model’s accuracy gradually improved on both the LRW and LRW1000 datasets. Among these, the inclusion of Mixup and Cutmix methods led to a significant accuracy boost, with improvements of 0.7% and 0.3% on LRW, respectively. Subsequently, the label smoothing mechanism further optimized the model’s performance. The introduction of word boundary information significantly improved recognition accuracy. Finally, at the last stage, the Partition-Time Masking data augmentation strategy showed more pronounced performance improvements compared to the Time Masking method, achieving up to a 0.4% increase.
5. Discussion
This paper first proposes a data augmentation method, PTM. It uses three models—DC-TCN, MS-TCN, and Bi-GRU—as experimental subjects to validate the data augmentation performance using the LAW and LAW1000 datasets. During the experiments, we added Word Boundary and Time Masking as control groups. Based on the experimental results, we found that the PTM model performed better in most experimental groups and outperformed the Time Masking model. The highest accuracy achieved on the LAW dataset was 91.84%, while, on the LAW1000 dataset, the highest accuracy was 55.7%.
However, in one experimental group of the DC-TCN model, the accuracy using the proposed PTM model was significantly lower than that of the Time Masking model. We conducted an in-depth analysis of the training process of the DC-TCN network, observing the changes in loss and accuracy during each training round. This may be due to an insufficient number of training epochs. We decided to increase the number of training epochs to verify this hypothesis to improve the model performance. In subsequent experiments, we set the training epochs to 90 and 100 as control groups, which are commonly used in lip reading research. We retrained the model using the DC-TCN network with the Word Boundary and RD strategies. The experimental results indicated that the highest accuracy of our proposed data augmentation method was achieved at 90 epochs, reaching 92.15%. We also explored the impact of different strategies on the performance of the data augmentation method. The experimental results showed that the best strategy was PD, where the number of subsequences is 3, and the mask value for each subsequence is the average frame of the respective subsequence.
This paper proposes a new lip-reading recognition model, ST3D. We compared the ST3D model with several commonly used lip-reading recognition models in the literature, and the experimental results are shown in
Table 4. As seen from the table, compared to the current lip-reading recognition networks that use ResNet18 as the front-end feature extraction network, the application of the Swin Transformer structure in this study yielded better results. Integrating the Swin Transformer architecture into lip-reading models enhances their ability to process information, particularly by addressing the limitation of traditional CNNs, which primarily focus on local features. The Swin Transformer’s flexibility allows it to adaptively adjust the receptive field size, effectively capturing rich information ranging from fine-grained details to global contexts. This enables the model to recognize subtle lip movements and understand their significance within facial expressions and contextual information. Moreover, the self-attention mechanism embedded in the Swin Transformer is particularly advantageous for lip reading. It prioritizes key regions crucial for recognition, thereby improving the extraction and emphasis of important features. This is especially critical for handling the non-rigid nature of lip movements, which requires the model to detect fine details and comprehend and interpret dynamic changes in motion. The model achieved a recognition accuracy of 91.8% on the LRW dataset, which is comparable to the best current recognition results, and a recognition accuracy of 56.1% on the LRW-1000 dataset, surpassing the best existing recognition results. The experimental results validate the effectiveness of ST3D, demonstrating that the model achieved good results on both significant datasets and showed certain advantages in feature extraction.
To investigate the advantages of the ST3D model in addressing visual ambiguity and the correlation between the lips and other facial features, we selected the most easily confused samples (with low recognition accuracy) from both the LRW and LRW1000 datasets. These samples were chosen based on a comprehensive analysis of the results from various models. The changes in the confusion rates of the 10 most easily confused samples from the LRW and LRW1000 datasets are listed in
Table 5 and
Table 6, respectively. As shown, these 10 easily confused samples are very similar in both spelling and pronunciation. The confusion rate is the proportion of samples in a category incorrectly identified as other categories out of the total number of samples in that category. From
Table 5, we can see that, in the LRW dataset, only 2 out of the 10 easily confused samples of the ST3D model had higher confusion rates than the DC-TCN model, one sample had the same confusion rate as DC-TCN, and the remaining seven samples had lower confusion rates than DC-TCN.
Table 6 shows that, in the LRW1000 dataset, 5 of the 10 easily confused samples of the ST3D model had higher confusion rates than DC-TCN, while the remaining 5 samples had lower confusion rates than DC-TCN. Overall, ST3D was found to be more effective in handling visual ambiguities on the LRW dataset; however, its performance in addressing visual ambiguities did not improve on the LRW1000 dataset.
We further categorized the LRW1000 dataset into three difficulty levels based on the length of input sequences (i.e., the number of video frames). Specifically, sequences with lengths ranging from 15 to 30 frames were classified as simple, those between 5 and 15 as moderate, and those with fewer than 5 frames as difficult. We compared the performance of the ST3D model and Bi-GRU models’ performance across samples of different difficulty levels with the results presented in
Figure 3. The experimental results indicate that the ST3D model consistently outperformed the MS-TCN model across all difficulty levels, and both models exhibited performance improvements as sequence length increases. Further analysis revealed a positive correlation between the recognition accuracy and the amount of information contained within a sequence. Shorter sequences may lack sufficient information to support accurate recognition and are more susceptible to external noise. Moreover, the training dataset showed an imbalanced distribution of sequence lengths, with significantly more long-sequence samples than short-sequence ones. This imbalance leads models to focus more on long-sequence samples during training. Additionally, compared to the MS-TCN model, the ST3D model demonstrated superior temporal modeling and feature extraction capabilities, enabling it to perform better when processing samples of the same difficulty level.
While our approach significantly improves visual speech recognition, some limitations remain. Dataset Scope: This study focused on word-level recognition using LRW and LRW1000, which serve as challenging benchmarks. However, they do not encompass sentence-level lip-reading, which is crucial for real-world applications. Future work should extend our method to datasets like LRS2 and LRS3 to evaluate its generalizability in sentence-level tasks. Input Scale Analysis: We analyzed the impact of sequence length on recognition accuracy but did not explicitly investigate the variations in input scales (e.g., different spatial resolutions of lip regions). Examining their influence on feature extraction and model performance could provide valuable insights for optimization. Computational Efficiency: The integration of Swin Transformer and 3D convolution strengthens spatial–temporal feature learning but increases computational cost. Future research should explore lightweight architectures or knowledge distillation to improve efficiency without compromising accuracy. Addressing these limitations will further enhance the robustness and applicability of our approach.
6. Conclusions
This paper proposes a data augmentation method, PTM, which first partitions the input video data into multiple subsequences and then applies Time Masking to each subsequence individually. Using the DC-TCN model for training and validation on the LAW dataset, the highest accuracy can reach 92.15%. In addition, we propose a lip-reading model, ST3D, which introduces the Swin Transformer structure into the lip-reading model, enhancing the model’s information processing capability. This addresses the limitation of traditional CNNs, which focus on local information and help capture and analyze the complex dynamic features in lip-reading recognition, thereby improving overall recognition efficiency and accuracy. On the LAW1000 dataset, the highest accuracy can reach 56.1%, and, on the LAW dataset, the highest accuracy can reach 91.8%. However, there are still areas for improvement, and the recognition accuracy can be further enhanced. In the next step, we will improve our lip-reading recognition algorithm to boost the recognition rate.
The data augmentation method and lip-reading model we propose has broad application potential. We can enhance the robustness of lip-reading recognition in complex scenarios, such as assisting individuals with hearing impairments in noisy environments or multi-party conversations or in applying it to silent command recognition in public security fields. These advancements can extend the boundaries of human–computer interaction applications. However, the research still has limitations. For instance, the model relies on detecting high-quality lip regions, and if there are occlusions or extreme lighting changes in the video, the recognition performance may degrade significantly. Although the hybrid architecture of the Swin Transformer and 3D convolution enhances spatiotemporal feature fusion, its large number of parameters makes it unsuitable for real-time applications. Future research will focus on model lightweight and real-time performance improvement, such as adopting parameter compression techniques from compact 3D convolutions to designing more efficient hybrid architectures and exploring multimodal deep fusion—such as combining audio residual signals or facial expression features—to improve recognition accuracy. Additionally, efforts will be made to enhance the model’s robustness and generalization ability, such as, for example, by developing adversarial training strategies to address the interference from lighting and occlusion. Finally, practical application scenarios will be explored to validate the model’s usability in medical assistance and security monitoring while optimizing real-time interaction experiences.