Improved Localization and Recognition of Handwritten Digits on MNIST Dataset with ConvGRU
Abstract
:1. Introduction
- Feature extraction and decoupling model: By incorporating an improved Generative Adversarial Network (GAN), a novel decoupling model is constructed. This model consists of two feature extractors that separate the input information into background and foreground features. This approach enables the prediction process to focus on the motion within the sequence, optimizing parameter selection for the digit position prediction network, reducing training time, and improving efficiency.
- Temporal feature learning: To address the insufficient learning of temporal features in digit position prediction, this paper integrates GRU into the position prediction network. GRU automatically captures long-term temporal dependencies between data points and, compared to Long Short-Term Memory (LSTM) networks, significantly reduces the number of parameters. This allows the network to better utilize learned temporal features, reducing the blurriness of predicted frames and accelerating the prediction of future video segment positions.
- Model stability in position prediction: To further enhance the stability and practicality of video position prediction, this paper modifies the GRU’s gating mechanism from linear operations to convolutional ones. The Convolutional Gated Recurrent Networks (ConvGRU) learns long-term spatiotemporal dependencies, ensuring the continuous transfer of feature information and reducing the loss of spatiotemporal features during prediction. This guarantees consistency in spatiotemporal information throughout the sequence, improving the model’s prediction accuracy and training speed.
2. Related Work
3. Methodology
3.1. Video Frame Feature Extraction
- Replace the pooling layers in the generator with transposed strided convolutions, while the pooling layers in the discriminator are replaced with strided convolutions for spatial downsampling. Their computation formulas are as follows in Equations (2) and (3):Let represent the size of the image. For the transposed convolution operation, the kernel size is , while for the standard convolution operation, the kernel size is . P denotes the padding, and b is the stride for both the convolution and transposed convolution operations.
- Replacing fully connected layers with average pooling layers can enhance model stability, as fully connected networks often have too many parameters, which can lead to overfitting. To improve the model in its convergence speed, we directly connect the generator’s input with the features from the convolutional layers and link the discriminator’s output with the feature maps from the convolutional layers.
- By normalizing the entire structure of the network model, we can better address the issue of excessive overall bias caused by having too many network layers.
- In the generator, the Leaky ReLU function is used for the intermediate hidden layers, while the output layer employs the sigmoid activation function. In the discriminator, we use the Maxout activation function for all layers.
3.2. Image Decoupling Model
3.3. Video Prediction Model
- Data preparation: Before building the video prediction framework, we need to collect a video dataset. We are using the MNIST dataset and performing simple preprocessing to adapt it for video prediction purposes.
- Training the decoupling model: We use the training dataset to train a decoupling model that extracts background features and motion features from the sequences. We utilize the two abstract matrices extracted by the decoupling model as inputs for the prediction model.
- Preparing inputs for the prediction model: In a loop, the model concatenates the background feature matrix of the first frame with the motion feature matrix of the i frame along the first dimension. We use the concatenated matrix as the input to the prediction model based on ConvGRU.
- Training the prediction model: The prediction model uses the input features to learn the temporal dependencies between sequences. It outputs the **motion information** of subsequent sequences.
- Generating future sequence frames: The model uses a decoder constructed with a deep transposed convolutional network to fuse the predicted motion features with the background features output by the decoupling model. It generates abstract matrices for future sequence frames.
- Visualizing results: Finally, the model visualizes the abstract matrices output by the fusion module. It obtains visually perceivable future video sequences.
3.4. Proposed Feature Extraction and Attention Mechanism
Algorithm 1 Feature extraction and attention mechanism |
1: Feature Extraction Process |
2: Input: Batch of video frames |
3: Output: Processed features and attention maps |
4: // Temporal Feature Extraction |
5: Extract temporal features: |
6: Apply temporal convolution layers |
7: Process with batch normalization |
8: Apply ReLU activation |
9: Store result as |
10: // Spatial Feature Extraction |
11: Extract spatial features: |
12: Apply spatial convolution layers |
13: Process with batch normalization |
14: Apply ReLU activation |
15: Store result as |
16: // Attention Computation |
17: Compute temporal attention: |
18: |
19: |
20: Compute spatial attention: |
21: |
22: |
23: // Feature Fusion |
24: Fuse features: |
25: Concatenate and |
26: Apply channel-wise attention |
27: Apply spatial attention |
28: Process with convolution block |
29: Output final fused features |
30: Return: Fused features , Attention maps , |
Algorithm 2 ConvGRU-based training process |
Require: Video sequence , Learning rate , Max epochs |
Ensure: Model parameters , Predictions |
|
4. Experiments
4.1. MNIST Dataset
4.2. Implementation Details
- Training configuration:
- –
- Optimizer: Adam (, , ).
- –
- Batch size: 32 samples per GPU.
- –
- Training epochs: 100.
- –
- Learning rate schedule: cosine annealing with warm restarts.
- Model architecture:
- –
- ConvGRU hidden dimensions: [64, 128, 256].
- –
- Attention heads: Eight.
- –
- Feature fusion channels: 512.
- –
- Total parameters: 2.8 M.
- Data augmentation:
- –
- Random rotation: .
- –
- Random scaling: [0.9, 1.1].
- –
- Random translation: .
- –
- Gaussian noise: .
4.3. Experimental Results
4.4. Ablation Experiment
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Khorsheed, E.A.; Al-Sulaifanie, A.K. Handwritten Digit Classification Using Deep Learning Convolutional Neural Network. J. Soft Comput. Data Min. 2024, 5, 79–90. [Google Scholar] [CrossRef]
- Korovai, K.; Zhelezniakov, D.; Yakovchuk, O.; Radyvonenko, O.; Sakhnenko, N.; Deriuga, I. Handwriting Enhancement: Recognition-Based and Recognition-Independent Approaches for On-device Online Handwritten Text Alignment. IEEE Access 2024, 12, 99334–99348. [Google Scholar] [CrossRef]
- Jagtap, J. Review of handwritten document recognition strategies: Patent perspective. Collnet J. Sci. Inf. Manag. 2023, 17, 323–355. [Google Scholar]
- Daniel, R.; Prasad, B.; Pasam, P.K.; Sudarsa, D.; Sudhakar, A.; Rajanna, B.V. Handwritten digit recognition using quantum convolution neural network. Int. J. Artif. Intell. 2024, 13, 533–541. [Google Scholar] [CrossRef]
- Absur, M.N.; Nasif, K.F.A.; Saha, S.; Nova, S.N. Revolutionizing Image Recognition: Next-Generation CNN Architectures for Handwritten Digits and Objects. In Proceedings of the 2024 IEEE Symposium on Wireless Technology & Applications (ISWTA), Kuala Lumpur, Malaysia, 20–21 July 2024; pp. 173–178. [Google Scholar]
- Wang, S.T.; Li, I.H.; Wang, W.Y. Implementation of Handwritten Character Recognition and Writing in Pyramidal Manipulator Using CNN. Int. J. iRobotics 2023, 6, 12–16. [Google Scholar]
- Jabde, M.K.; Patil, C.H.; Vibhute, A.D.; Mali, S. A Comprehensive Literature Review on Air-written Online Handwritten Recognition. Int. J. Comput. Digit. Syst. 2024, 15, 307–322. [Google Scholar] [CrossRef]
- Rakshit, P.; Mukherjee, H.; Halder, C.; Obaidullah, S.M.; Roy, K. Historical digit recognition using CNN: A study with English handwritten digits. Sādhanā 2024, 49, 39. [Google Scholar] [CrossRef]
- Suresh Kumar, K.; Divya Bharathi, K. Integrating Handwritten Digit Recognition with Learning Management Systems for Evaluated Answer Scripts. In Proceedings of the International Conference on Emerging Trends in Expert Applications & Security; Springer: Berlin/Heidelberg, Germany, 2024; pp. 179–189. [Google Scholar]
- Kumari, R.; Srivastava, N. Variations of Left and Right Hand Writers in Forging Signatures Written in Nastaleeq Script; Punjab Academy of Forensic Medicine & Toxicology: Faridkot, India, 2022. [Google Scholar]
- Gao, Z.; Tan, C.; Wu, L.; Li, S.Z. Simvp: Simpler yet better video prediction. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3170–3180. [Google Scholar]
- Hu, X.; Huang, Z.; Huang, A.; Xu, J.; Zhou, S. A dynamic multi-scale voxel flow network for video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6121–6131. [Google Scholar]
- Kern, F.; Tschanter, J.; Latoschik, M.E. Handwriting for Text Input and the Impact of XR Displays, Surface Alignments, and Sentence Complexities. IEEE Trans. Vis. Comput. Graph. 2024, 30, 2357–2367. [Google Scholar] [CrossRef]
- Wang, S.; Sheng, H.; Zhang, Y.; Yang, D.; Shen, J.; Chen, R. Blockchain-empowered distributed multicamera multitarget tracking in edge computing. IEEE Trans. Ind. Inform. 2023, 20, 369–379. [Google Scholar] [CrossRef]
- Wu, Y.; Sheng, H.; Zhang, Y.; Wang, S.; Xiong, Z.; Ke, W. Hybrid motion model for multiple object tracking in mobile devices. IEEE Internet Things J. 2022, 10, 4735–4748. [Google Scholar] [CrossRef]
- Sheng, H.; Cong, R.; Yang, D.; Chen, R.; Wang, S.; Cui, Z. UrbanLF: A comprehensive light field dataset for semantic segmentation of urban scenes. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 7880–7893. [Google Scholar] [CrossRef]
- Wang, T.; Sheng, H.; Chen, R.; Yang, D.; Cui, Z.; Wang, S.; Cong, R.; Zhao, M. Light field depth estimation: A comprehensive survey from principles to future. High-Confid. Comput. 2024, 4, 100187. [Google Scholar] [CrossRef]
- Cong, R.; Sheng, H.; Yang, D.; Cui, Z.; Chen, R. Exploiting spatial and angular correlations with deep efficient transformers for light field image super-resolution. IEEE Trans. Multimed. 2023, 26, 1421–1435. [Google Scholar] [CrossRef]
- Sheng, H.; Wang, S.; Yang, D.; Cong, R.; Cui, Z.; Chen, R. Cross-view recurrence-based self-supervised super-resolution of light field. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 7252–7266. [Google Scholar] [CrossRef]
- Gupta, H.; Kaur, A.; Kavita; Verma, S.; Rawat, P. Recognition of Handwritten Digits Using Convolutional Neural Network in Python and Comparison of Performance for Various Hidden Layers. In Proceedings of the International Conference on Innovative Computing and Communication; Springer: Berlin/Heidelberg, Germany, 2023; pp. 727–739. [Google Scholar]
- Wu, B.; Nair, S.; Martin-Martin, R.; Fei-Fei, L.; Finn, C. Greedy hierarchical variational autoencoders for large-scale video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2318–2328. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
- Kumar, M.; Babaeizadeh, M.; Erhan, D.; Finn, C.; Levine, S.; Dinh, L.; Kingma, D. Videoflow: A flow-based generative model for video. arXiv 2019, arXiv:1903.01434. [Google Scholar]
- Fateh, A.; Birgani, R.T.; Fateh, M.; Abolghasemi, V. Advancing Multilingual Handwritten Numeral Recognition With Attention-Driven Transfer Learning. IEEE Access 2024, 12, 41381–41395. [Google Scholar] [CrossRef]
- Ge, L.; Liao, W.; Wang, S.; Bak-Jensen, B.; Pillai, J.R. Modeling daily load profiles of distribution network for scenario generation using flow-based generative network. IEEE Access 2020, 8, 77587–77597. [Google Scholar] [CrossRef]
- Kong, Y.; Fu, Y. Human action recognition and prediction: A survey. Int. J. Comput. Vis. 2022, 130, 1366–1401. [Google Scholar] [CrossRef]
- Barve, Y.; Saini, J.R.; Pal, K.; Kotecha, K. A novel evolving sentimental bag-of-words approach for feature extraction to detect misinformation. Int. J. Adv. Comput. Sci. Appl 2022, 13, 266–275. [Google Scholar] [CrossRef]
- Torralba, E.M. Fibonacci Numbers as Hyperparameters for Image Dimension of a Convolu-tional Neural Network Image Prognosis Classification Model of COVID X-ray Images. Int. J. Multidiscip. Appl. Bus. Educ. Res. 2022, 3, 1703–1716. [Google Scholar] [CrossRef]
- Cevikalp, H.; Chome, E. Robust and compact maximum margin clustering for high-dimensional data. Neural Comput. Appl. 2024, 36, 5981–6003. [Google Scholar] [CrossRef]
- Pintea, S.L.; Sharma, S.; Vossepoel, F.C.; van Gemert, J.C.; Loog, M.; Verschuur, D.J. Seismic inversion with deep learning: A proposal for litho-type classification. Comput. Geosci. 2021, 26, 351–364. [Google Scholar] [CrossRef]
- Walker, W. Probabilistic Unsupervised Learning using Recognition Parameterized Models. Ph.D. Thesis, UCL University College London, London, UK, 2024. [Google Scholar]
- Oprea, S.; Martinez-Gonzalez, P.; Garcia-Garcia, A.; Castro-Vargas, J.A.; Orts-Escolano, S.; Garcia-Rodriguez, J.; Argyros, A. A review on deep learning techniques for video prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2806–2826. [Google Scholar] [CrossRef]
- Liu, Y.; Liu, B. A modified uncertain maximum likelihood estimation with applications in uncertain statistics. Commun. Stat.-Theory Methods 2024, 53, 6649–6670. [Google Scholar] [CrossRef]
- Ilmi, N.; Budi, W.T.A.; Nur, R.K. Handwriting digit recognition using local binary pattern variance and K-Nearest Neighbor classification. In Proceedings of the 2016 4th International Conference on Information and Communication Technology (ICoICT), Bandung, Indonesia, 25–27 May 2016; pp. 1–5. [Google Scholar]
- ZHU, J.; LAI, J.; GAN, L.; CHEN, Z.; LIU, H.; XU, G. Video prediction model combining involution and convolution operators. J. Comput. Appl. 2024, 44, 113. [Google Scholar]
- Wang, J.; Hu, X. Convolutional neural networks with gated recurrent connections. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3421–3435. [Google Scholar] [CrossRef]
- Liu, B.; Lv, J.; Fan, X.; Luo, J.; Zou, T. Application of an improved DCGAN for image generation. Mob. Inf. Syst. 2022, 2022, 9005552. [Google Scholar] [CrossRef]
- Saxena, D.; Cao, J. Generative adversarial networks (GANs) challenges, solutions, and future directions. ACM Comput. Surv. (CSUR) 2021, 54, 1–42. [Google Scholar] [CrossRef]
- Li, X.f.; Cheng, S.l.; Yang, H.y.; Yan, Q.; Wang, B.; Sun, Y.t.; Yan, H.; Zhao, Q.x.; Xin, Y.j. Vibration characteristics and elastic wave propagation properties of mirror-symmetric structures of trichiral ligaments. Photonics Nanostructures-Fundam. Appl. 2023, 54, 101120. [Google Scholar] [CrossRef]
- Shao, H.; Ma, E.; Zhu, M.; Deng, X.; Zhai, S. MNIST Handwritten Digit Classification Based on Convolutional Neural Network with Hyperparameter Optimization. Intell. Autom. Soft Comput. 2023, 36. [Google Scholar] [CrossRef]
Model | SSIM (↑) | PSNR (↑) | MSE (↓) | FVD (↓) | Training Time (h) | Memory Usage (GB) |
---|---|---|---|---|---|---|
SimVP (CVPR’22) | 0.892 | 28.3 | 0.046 | 168.2 | 24.5 | 11.2 |
DMVFN (CVPR’23) | 0.901 | 29.1 | 0.042 | 156.3 | 18.3 | 9.8 |
PredRNN (NIPS’21) | 0.887 | 27.8 | 0.049 | 172.1 | 22.7 | 10.5 |
PhyDNet (CVPR’20) | 0.885 | 27.5 | 0.051 | 175.4 | 23.1 | 12.3 |
E3D-LSTM (ICLR’20) | 0.889 | 28.1 | 0.047 | 170.8 | 25.6 | 13.1 |
ConvLSTM (NIPS’15) | 0.873 | 26.4 | 0.058 | 185.2 | 20.2 | 8.7 |
MAU (ICCV’21) | 0.895 | 28.7 | 0.044 | 163.5 | 21.8 | 10.8 |
MotionRNN (ICLR’21) | 0.898 | 28.9 | 0.043 | 159.7 | 19.5 | 10.1 |
CrevNet (ICLR’20) | 0.888 | 27.9 | 0.048 | 171.3 | 23.8 | 11.5 |
Ours (ConvGRU) | 0.913 | 30.2 | 0.038 | 148.5 | 16.8 | 8.9 |
Scenario | Accuracy (%) | SSIM | PSNR (dB) | Processing Time (ms) | Memory Usage (MB) | Frame Rate (fps) |
---|---|---|---|---|---|---|
Standard Writing Conditions | ||||||
Isolated digits | 95.3 ± 0.4 | 0.913 | 31.2 | 15.2 | 256 | 65.8 |
Two-digit sequence | 94.1 ± 0.5 | 0.902 | 30.5 | 16.8 | 268 | 59.5 |
Three-digit sequence | 92.8 ± 0.6 | 0.894 | 29.8 | 18.4 | 275 | 54.3 |
Four-digit sequence | 91.5 ± 0.7 | 0.885 | 29.1 | 20.1 | 282 | 49.8 |
Writing Style Variations | ||||||
Cursive writing | 93.1 ± 0.5 | 0.895 | 30.1 | 16.8 | 264 | 59.5 |
Connected digits | 91.8 ± 0.7 | 0.882 | 29.4 | 18.5 | 271 | 54.1 |
Fast writing | 90.5 ± 0.8 | 0.875 | 28.9 | 17.9 | 268 | 55.9 |
Slow writing | 94.2 ± 0.4 | 0.908 | 30.8 | 16.5 | 262 | 60.6 |
Environmental Challenges | ||||||
Low light | 89.7 ± 0.9 | 0.865 | 28.3 | 19.8 | 273 | 50.5 |
Motion blur | 88.4 ± 1.0 | 0.858 | 27.9 | 20.5 | 275 | 48.8 |
Background noise | 90.2 ± 0.8 | 0.871 | 28.6 | 19.1 | 270 | 52.4 |
Perspective distortion | 87.9 ± 1.1 | 0.852 | 27.5 | 21.2 | 278 | 47.2 |
Special Cases | ||||||
Overlapping digits | 89.5 ± 1.0 | 0.867 | 28.4 | 20.3 | 276 | 49.3 |
Different fonts | 92.3 ± 0.6 | 0.889 | 29.7 | 17.5 | 267 | 57.1 |
Mixed styles | 91.1 ± 0.8 | 0.878 | 29.0 | 18.8 | 272 | 53.2 |
Non-standard forms | 88.7 ± 1.2 | 0.861 | 28.1 | 20.8 | 277 | 48.1 |
Model Configuration | Accuracy (%) | SSIM | PSNR (dB) | Time (ms) |
---|---|---|---|---|
Full model | 95.3 | 0.913 | 31.2 | 15.2 |
w/o temporal branch | 91.2 | 0.875 | 29.4 | 12.8 |
w/o spatial branch | 90.8 | 0.869 | 29.1 | 12.5 |
w/o attention | 92.4 | 0.888 | 30.2 | 14.1 |
w/o feature fusion | 89.7 | 0.862 | 28.8 | 13.6 |
Basic ConvGRU | 88.5 | 0.854 | 28.3 | 11.9 |
Attention Mechanism Variants | ||||
Channel attention only | 93.1 | 0.892 | 30.5 | 14.5 |
Spatial attention only | 92.8 | 0.889 | 30.3 | 14.3 |
Self-attention | 94.2 | 0.901 | 30.8 | 15.8 |
Cross-attention | 94.5 | 0.905 | 30.9 | 16.1 |
Feature Fusion Strategies | ||||
Concatenation | 93.8 | 0.898 | 30.6 | 14.8 |
Addition | 92.5 | 0.885 | 30.1 | 14.2 |
Weighted sum | 93.2 | 0.891 | 30.4 | 14.6 |
Adaptive fusion | 94.8 | 0.908 | 31.0 | 15.5 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wen, Y.; Ke, W.; Sheng, H. Improved Localization and Recognition of Handwritten Digits on MNIST Dataset with ConvGRU. Appl. Sci. 2025, 15, 238. https://doi.org/10.3390/app15010238
Wen Y, Ke W, Sheng H. Improved Localization and Recognition of Handwritten Digits on MNIST Dataset with ConvGRU. Applied Sciences. 2025; 15(1):238. https://doi.org/10.3390/app15010238
Chicago/Turabian StyleWen, Yalin, Wei Ke, and Hao Sheng. 2025. "Improved Localization and Recognition of Handwritten Digits on MNIST Dataset with ConvGRU" Applied Sciences 15, no. 1: 238. https://doi.org/10.3390/app15010238
APA StyleWen, Y., Ke, W., & Sheng, H. (2025). Improved Localization and Recognition of Handwritten Digits on MNIST Dataset with ConvGRU. Applied Sciences, 15(1), 238. https://doi.org/10.3390/app15010238