Next Article in Journal
Stability Analysis, Modulation Instability, and Beta-Time Fractional Exact Soliton Solutions to the Van der Waals Equation
Previous Article in Journal
MMCMOO: A Novel Multispectral Pansharpening Method
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

End-to-End Deep Learning Framework for Arabic Handwritten Legal Amount Recognition and Digital Courtesy Conversion

1
Department of Computer Science and IT, Dr. Babasaheb Ambedkar Marathwada University, Chhatrapati Sambhajinagar 431004, India
2
Department of Computer Science, Hodeidah University, Al-Hudaydah P.O. Box 3114, Yemen
3
Department of Software Engineering, Northwestern Polytechnical University, Xi’an 710072, China
4
Department of Artificial Intelligence and Data Science, College of AI Convergence, Daeyang AI Center, Sejong University, Seoul 05006, Republic of Korea
5
Department of Computer Science & Engineering, University of North Texas, Denton, TX 76205, USA
6
Department of Digital and Cyber Forensics, Government Institute of Forensic Science, Chhatrapati Sambhajinagar 431004, India
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2024, 12(14), 2256; https://doi.org/10.3390/math12142256
Submission received: 11 June 2024 / Revised: 2 July 2024 / Accepted: 10 July 2024 / Published: 19 July 2024

Abstract

:
Arabic handwriting recognition and conversion are crucial for financial operations, particularly for processing handwritten amounts on cheques and financial documents. Compared to other languages, research in this area is relatively limited, especially concerning Arabic. This study introduces an innovative AI-driven method for simultaneously recognizing and converting Arabic handwritten legal amounts into numerical courtesy forms. The framework consists of four key stages. First, a new dataset of Arabic legal amounts in handwritten form (“.png” image format) is collected and labeled by natives. Second, a YOLO-based AI detector extracts individual legal amount words from the entire input sentence images. Third, a robust hybrid classification model is developed, sequentially combining ensemble Convolutional Neural Networks (CNNs) with a Vision Transformer (ViT) to improve the prediction accuracy of single Arabic words. Finally, a novel conversion algorithm transforms the predicted Arabic legal amounts into digital courtesy forms. The framework’s performance is fine-tuned and assessed using 5-fold cross-validation tests on the proposed novel dataset, achieving a word level detection accuracy of 98.6% and a recognition accuracy of 99.02% at the classification stage. The conversion process yields an overall accuracy of 90%, with an inference time of 4.5 s per sentence image. These results demonstrate promising potential for practical implementation in diverse Arabic financial systems.

1. Introduction

The Arabic script is one of the most widely used writing systems, employed by over 300 million people worldwide [1]. However, the automatic recognition of Arabic handwriting remains an extremely challenging task. This difficulty stems from the rich complexity of Arabic script, including its cursive style, context-sensitive shapes, and numerous ligatures and overlaps [2,3]. Handwritten Arabic literal amounts are ubiquitous across financial documents and forms. Digitized Arabic documents could facilitate streamlined back-end data processing, improving efficiency and accuracy over manual entry [4]. For banking and finance, the fast digital conversion of deposit slips, cheques, and invoices would speed up transactions and reduce costs [5,6]. Preserving handwritten records into searchable formats can also expand access to historical archives, manuscripts, and other cultural materials. Individuals with visual impairments or limited literacy skills further stand to benefit from technology translating pen strokes into text. Recognizing literal amounts written in Arabic poses unique difficulties compared to other languages [7]. The optical character recognition (OCR) of handwritten Arabic amounts must contend with intricacies like overlapping characters, variant glyph shapes, fragmented characters, and diverse writing styles [7,8,9,10]. The cursive nature of Arabic script leads to heavy merging between characters, making individual glyph segmentation difficult [2,11]. Inconsistent handwriting and sloped alignment further complicate accurate recognition [7]. Prior approaches to recognizing Arabic handwriting legal amounts have focused on segmented words or sub-words [12,13,14,15,16]. However, the accurate segmentation of heavily slanted and merged characters remains challenging. Errors propagate through the recognition pipeline, severely limiting overall accuracy. Preprocessing and careful feature engineering are often needed to extract salient characteristics for classification [17]. The complexity of these traditional OCR pipelines makes end-to-end training difficult. Recent advances in deep learning provide new opportunities for developing end-to-end Arabic handwriting recognition without explicit segmentation [18,19,20]. Deep neural networks can learn robust feature representations directly from raw image data. Convolutional Neural Networks (CNNs) have shown remarkable success in image and text classification tasks [21,22,23,24]. CNNs contain repeated convolutional layers that act as trainable feature extractors. Stacked convolutional layers enable the learning of hierarchical representations, from low-level edges to high-level semantic features [25]. CNNs require minimal preprocessing and can automatically learn discriminative visual characteristics compared to handcrafted features. Sophisticated CNN architectures, like Residual Networks [22] and DenseNets [26], have achieved state-of-the-art results by increasing depth and feature reuse. Transfer learning allows for the leveraging of knowledge from large CNNs pre-trained on datasets like ImageNet [27]. Fine-tuning target data adapts the models to new tasks [28,29]. Recently, Vision Transformers (ViTs) have emerged as an alternative to CNNs [30,31]. ViTs rely entirely on self-attention to model long-range dependencies in images. By capturing global context, ViTs can complement CNNs’ strengths in local feature extraction [32]. Hybrid CNN–ViT models have proven very effective as relates to image recognition benchmarks. Object detection is a key component of many computer vision systems [33]. Deep learning methods like YOLO [34] and Faster R-CNN [35] have driven rapid object detection accuracy and speed advances. YOLO divides images into grids and predicts bounding boxes and class probabilities directly from full images in one pass. This enables real-time processing, while maintaining high accuracy. Faster R-CNN introduces a region proposal network to generate candidate object regions, improving prior R-CNN models. These methods are used in various application domains, i.e., healthcare, autonomous vehicles, face recognition, and extracting the bank cheque fields from bank-printed cheques; YOLO also provides highly performant frameworks for detecting objects like text and handwriting in the answer sheet images [36]. This work proposes an end-to-end framework for Arabic handwritten amount recognition using deep transfer learning. To the best of our knowledge, this represents the first approach targeting the end-to-end recognition of complete legal amounts in Arabic bank cheques. The framework integrates state-of-the-art object detection and image classification models—YOLOv5 for locating amount words and a hybrid CNN–ViT model for word recognition. Rather than segmenting words into individual glyphs, the models recognize complete amounts in a single forward pass. Fine-tuning on a dataset of Arabic amounts allows the models to learn specialized visual features for this task. By recognizing full amounts holistically, the approach aims to overcome cascading errors of segmentation-based OCR. The proposed system requires no complex preprocessing, segmentation, or feature engineering. It can directly recognize amounts from raw cheque images after a simple resizing step. The major contributions presented for this work are briefly summarized here.
  • A novel end-to-end AI-based framework is proposed for the simultaneous word level detection and classification of Arabic handwritten amounts. To the best of our knowledge, this is the first method that aims to recognize full legal amounts in Arabic cheques end-to-end, overcoming the limitations of segmentation-based OCR methods.
  • A new dataset of Arabic legal amount handwriting images has been collected and annotated by native experts.
  • A hybrid AI classification model is introduced, which combines the global and local feature extractors of ensemble CNNs and ViTs, respectively. This complex combination of various AI networks demonstrates its capability to enhance the extracted knowledge, resulting in a superior recognition rate.
  • A novel algorithm is proposed for converting a series of recognized Arabic handwritten words into numerical amounts for legal purposes.
The rest of this article is organized as follows: Section 2 reviews the relevant and recent AI-based Arabic literature studies. Section 3 explains the proposed methodology and describes the datasets, training process, and experiments. Section 4 presents the results and discussion. Section 5 concludes the paper.

2. Related Work

The recognition of Arabic legal amounts has been a topic of interest for several researchers, who have employed diverse methods to tackle the issue; these researchers have only applied their techniques to segmented words within the legal amount rather than the legal amount field. In this section, we reviewed the existing studies in the following three categories: related works in Arabic handwritten legal amount datasets, Arabic handwriting word extraction, and related works in recognizing segmented legal amount words.

2.1. Arabic Handwritten Legal Amount Datasets

Access to a comprehensive Arabic bank cheque database is important for researchers engaged in processing. This necessity extends beyond mere research, as it plays a vital role in comparing and evaluating researchers’ techniques. Our examination of existing databases related to Arabic bank cheque processing reveals that only one database is sourced from actual Arabic bank cheques, which are available for acquisition at a cost [37]; they presented the database for recognizing handwritten Arabic legal and courtesy amounts on cheques. Collecting the database involved collaboration with Al Rajhi Banking and Investment Corporation to collect real-world gray-level cheque images. The work involved gathering real-life data, segmentation, binarization, tagging, and the validation of the tagging process. The resulting databases include Arabic cheques, legal amounts, sub-words, and courtesy amounts written in Indian digits; the collected database has limitations in the uneven distribution of sub-word classes in the collected data that may cause some training problems, particularly for those rarely used classes. The training and testing of such classes become more difficult and may impose restrictions on the type of classifiers. In [38], Meslati et al. used 576 words for training their hybrid neuro-symbolic classifier. These words were taken from a vocabulary of 48 words, and four writers wrote each word three times. The authors also tested their classifier on 532 words, where three writers wrote the 48 words of the vocabulary thrice.
Additionally, they performed experiments with independent training and testing databases. The first database contained 480 words, where 10 writers wrote the 48 words, and the second database contained 1200 words, where 25 writers wrote the 48 words. The authors do not provide any information about the identity of the writers or their characteristics, and the dataset they used is not available online. In [39,40], they introduced a database comprising 4800 words, each corresponding to 48 frequently used Arabic legal terms. This database was built with the participation of 100 distinct writers for each word; the introduced dataset is not available online. Al-Ma’adeed [41] introduced the AHDB (Arabic Handwriting Database), a comprehensive dataset encompassing Arabic words and texts that one hundred individual writers wrote. A subset of this database comprises handwritten representations of numerical values and quantities typically found on cheques. In totality, the AHDB dataset encompasses 4970 distinct words corresponding to precisely 50 distinct legal amounts, each written by 100 different writers; the legal amount sentences part is not available online.

2.2. Arabic Words Extraction from Whole Cheque Images

Much effort has been put into studying the extraction of Arabic handwriting. The commonly adopted methods for word extraction are mainly bottom-up approaches, utilizing connected component analysis [42], structural feature extraction [43], or a combination of both [44]. In [45], AlKhateeb et al. proposed a component-based approach that analyzed the connected components based on baseline information, resulting in a correct extraction rate of 85%. In [46], an SVM-based gap metric was proposed to segment text at the line level using a threshold to classify gaps as either “within” or “between” words. The method was tested on the ICDAR 2007 Handwriting Segmentation Contest datasets and achieved an F-measure of 93%. In [47], a gap metric method was introduced, employing a clustering algorithm to determine segmentation thresholds, distinguishing between “within the word” and “between words” gaps. Testing on the AHDB dataset [41] resulted in a correct extraction rate of 84.8%. In [48], Al-Dmour et al. proposed a method that used two spatial measures—connected component length and the gaps between them. The connected component lengths were clustered to differentiate between letters, sub-words, and words using the Self-Organizing Map (SOM) algorithm.
Additionally, gaps were clustered into two groups to indicate whether the gap occurred between words or within a word. The method was tested on the AHDB dataset [41] and achieved a correct extraction rate of 86.3%. However, most of these methods rely on gap classification. It is believed that the spaces between words cannot be used as a reliable characteristic for word extraction due to varying writing styles. In [43], N. Aouadi et al. proposed an automatic system for Arabic handwritten word extraction and recognition, which achieved an average recognition rate of 87% on historical handwritten documents. In [49], a CNN-BLSTM-CTC neural network architecture was used for word extraction. The CNN extracted features from the input handwritten text images, which were then fed into a BLSTM network, followed by a CTC to map transcriptions to text-line images. The method was tested on the KHATT Arabic database [50] and achieved an extraction success rate of 80.1%. T. Ben et al. [51] presented an innovative approach for text recognition using an Attention-based CNN-ConvLSTM model followed by a Connectionist Temporal Classification (CTC) function. An Attention-based Convolutional Neural Network (CNN) processed the text-line image inputs to extract key features. These features and the text line transcriptions were then fed into a Convolutional Long Short-term Memory (ConvLSTM) network to establish a mapping between them. The CTC function was applied to learn the alignment between the text-line images and their corresponding transcriptions. The proposed model was evaluated on the KHATT dataset and achieved an extraction rate of 91.7%. The model also demonstrated accuracy on the AHDB database with a 92.8% extraction rate, and a 94.1% extraction rate on the IFN/ENIT database.

2.3. Arabic Words Recognition

In general, several studies used various methods for Arabic handwriting recognition, in particular for Arabic legal amount word recognition. Most researchers have focused on using handcrafted features in Arabic handwritten word recognition. A system for recognizing Arabic handwriting words was presented in [52]. This system utilizes several crucial structural features, including sub-words, diacritics, loops, ascenders, and descenders, which are extracted from the word skeleton and are then input into a neural network. The performance of the proposed system was evaluated using a set of images from the IFN/ENIT benchmark database. In [43], a handcrafted approach was employed to extract several significant structural features such as loops, stems, legs, and diacritics, taking into account their position within the word, such as the beginning, middle, or end, and their location within the upper, central, or lower bands. These extracted features were then subjected to a Markovian classifier. The performance of the model was evaluated using the IFN-ENIT dataset. Practically, several studies have been carried out on Arabic legal amount word recognition. Souici et al. [38] introduced a novel hybrid neuro-symbolic classifier to recognize Arabic legal amount words. The authors utilized contour tracing to extract features, including the number of loops, ascenders, descenders, connected components, and diacritical marks. The classifier was trained on 1200 words from 25 different writers. Hassen et al. [53] proposed an Arabic handwritten monetary word recognition system based on multiple statistical features. The system employed a Histogram of Oriented Gradients, Invariant Moments (IVs), and Gabor filters as statistical features, and utilized the SMO classifier for recognition. In another study, Al-Nuzaili et al. [54] employed Gabor features and utilized the Extreme Learning Machine and Sequential Minimal Optimization classifiers. The Gabor feature extraction was performed in two groups, with image sizes of 128 × 128 and 32 × 64. The performance of the proposed methods was assessed using the AHDB and CENPARMI datasets. Recently, several research efforts have been made in Arabic handwriting recognition. N. Altwaijry et al. [55] presented a Convolutional Neural Network (CNN) model for recognizing Arabic handwriting. The model was trained and evaluated on the Hijja and AHCD datasets, achieving a recognition accuracy of 88% and 97%, respectively. R. Maalej et al. [56] proposed a deep neural network system that combines CNNs and Long Short-term Memory (LSTM) networks. The CNN component extracts important features from the input images, while the LSTM network classifies the inputs based on these features. The system also includes a Connectionist Temporal Classification (CTC) layer, which improves the performance of the LSTM network by aligning its output with the input sequence. M. Elleuch et al. [57] proposed a handwriting recognition model that integrates a CNN with a Support Vector Machine (SVM) classifier. The model uses raw pixel data as input and has been tested on two Arabic handwriting datasets—HACDB and IFN/ENIT. El-Melegy et al. [58] developed a deep learning approach based on a CNN for recognizing Arabic handwritten amounts. The approach was trained and tested on the AHDB dataset.

3. Methods and Materials

The proposed AI framework for Arabic Handwritten Legal Amount Recognition and Digital Courtesy Conversion, as illustrated in Figure 1, offers a sophisticated, end-to-end solution that comprises two main stages—word detection and word classification. Initially, handwritten Arabic data are collected from 160 native speakers and are meticulously annotated at both word and sentence levels to facilitate precise detection and classification. The data undergo rigorous preprocessing, including image normalization, uniform resizing, data splitting, and augmentation, to optimize it for model training. The first stage employs an AI-based YOLO (You Only Look Once) detection model to localize and accurately extract Arabic words from the images. In the second stage, a hybrid classification model, which integrates Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), is used to recognize the extracted words. A courtesy amount conversion algorithm then processes these recognized words to generate the final digital courtesy amount. This comprehensive framework is meticulously designed to ensure high accuracy and efficiency in recognizing and converting handwritten Arabic legal amounts, leveraging state-of-the-art AI techniques to achieve a robust performance.

3.1. Arabic Handwritten Legal Amount (AHLA) Dataset

The AHLA dataset is collected by distributing an advanced designed report with Arabic native speakers. Our dataset contains two kinds of Arabic handwritten expressions as follows:
  • Arabic word-level images are used to express legal amounts of bank cheques, including the colloquial words used in writing Arabic numbers. A total of 10,660 legal amount word images across 33 different classes were collected and labeled, involving the varieties of Arabic vocabulary needed to write out monetary sums, as presented in Table 1.
  • Arabic legal amount sentence images where 160 Arabic native speakers or writers participate in writing 160 legal amount reports in Arabic words, as shown in Figure 2.
The focused scope on legal/financial vocabulary and formats in word- and sentence-level samples is intended to provide challenging real-world examples for developing Arabic handwriting text recognition systems, especially for financial use cases such as processing monetary documents or bank cheques. The driving purpose behind assembling this multi-tiered dataset is to provide the appropriate Arabic language samples to train and test systems that automatically identify and understand handwritten legal amounts on financial documents and then convert those semantic phrases into corresponding numeric currency totals suitable for digital processing and banking.

3.2. Data Preprocessing and Preparation

3.2.1. Arabic Legal Amount Sentence Images

In the initial phase of preparing the legal amount sentences’ images for word detection, we embarked on the task of annotating the images. This annotation process involved the precise delineation of objects of interest (Arabic handwritten legal amount word) through the utilization of bounding boxes. To facilitate this task, we harnessed the capabilities of Roboflow’s annotation tool. This resource was pivotal in streamlining the annotation process, enhancing our labels’ accuracy and consistency, as shown in Figure 2. Following meticulous annotation, we partitioned the annotated images into subsets; the dataset was segregated into a training set comprising 65% of the data, a validation set encompassing 22%, and a testing set incorporating 13%, as shown in Table 2. Preprocessing steps were executed to prepare the data for training our models. These preprocessing steps included resizing images to conform to a uniform size and normalizing pixel values. The standardization of image sizes and pixel values is of paramount importance in ensuring the effectiveness of the model training process.
Furthermore, a suite of data augmentation techniques was applied to augment the dataset and enhance the model’s capacity to generalize across diverse data instances. These augmentation techniques encompassed operations such as image flipping, rotation, cropping, and adjustments to brightness and contrast. Notably, these data augmentation procedures were exclusively implemented on the training data. The rationale behind this approach is to prevent any bias in the evaluation metrics when assessing model performance on the validation and test datasets. Lastly, the dataset was exported in YOLOv5 PyTorch TXT format. This format was chosen to align with the requirements of the YOLOv5 model architecture, ensuring seamless compatibility between the dataset and the model.

3.2.2. Arabic Word-Level Images

Preprocessing plays a crucial role in Arabic handwritten legal amount word recognition because of the unique attributes of Arabic writing, variations in writing styles, and quality. A comprehensive preprocessing pipeline is critical to improve the input image quality and facilitate accurate segmentation and recognition. The preprocessing pipeline for Arabic handwritten legal amount word recognition includes image scaling, data splitting, and data augmentation. Image scaling resizes input images to a consistent resolution of 224 × 224 pixels, standardizing the image dimensions to ensure compatibility with models and to mitigate the impact of variations in image sizes. In the context of multiclass recognition approaches, the dataset undergoes a division process to facilitate training, validation, and testing procedures. Specifically, this division randomly allocates word images from each class into three subsets—training, testing, and validation. This division is executed for all multiclass recognition approaches, as follows: 70% of the word images corresponding to each class are allocated to the training set, 20% to the validation set, and 10% to the testing set. The data distribution across these subsets is presented in Table 3. Data augmentation methods are utilized on training data to enhance the robustness of models and to reduce the risk of overfitting. These techniques include rescaling to normalize pixel values, shear transformations to mimic diverse writing styles, random zooming to replicate varying text sizes, and horizontal flipping for orientation diversity. These operations generate new images from the existing ones, introducing controlled variations and enhancing the models’ generalization ability to real-world scenarios.

3.2.3. Detection Task: AI-Based YOLO Detection Model

YOLOv5 (You Only Look Once version 5) is a state-of-the-art real-time object detection system developed by Glen Jocher and the Ultralytics team [59]. It is an open-source framework for training and deploying object detection models. YOLOv5 uses a single Convolutional Neural Network (CNN) architecture to predict the bounding boxes and class probabilities for objects in an image; the bounding boxes around detected objects allow for further processing, such as tracking and recognition. The architecture is designed to be fast and efficient, making it suitable for real-time applications. In the context of a transfer learning task for Arabic handwriting word extraction, YOLOv5 will be used as a pre-trained object detection model for the Arabic handwriting legal amount word extraction task. YOLOv5 consists of three main architectural blocks, CSPDarknet, PANet, and yolo head detection, as shown in Figure 3. The CSPDarknet block is the backbone of the model and is responsible for extracting features from the input image. It is created by incorporating the cross-stage partial network (CSPNet) into Darknet. The CSPDarknet consists of a series of convolutional layers with Cross-Stage Partial connections, combining features from earlier layers with features from deeper layers. This helps improve the model’s accuracy and robustness and ensures inference speed. As seen in Figure 3, the C3 block is a building block of the CSPDarknet architecture; the block consists of a series of convolutional layers, each with a kernel size of 3 × 3, that are used to extract features from the input image, batch normalization, and a Leaky ReLU activation function typically follows the C3 block. The Path aggregation network (PANet) is the neck of the YOLOv5 model, and it boosts information flow by adopting a new feature pyramid network (FPN) structure with an enhanced bottom-up path, which improves the propagation of low-level features.
Additionally, adaptive feature pooling links the feature grid and all feature levels, propagating useful information in each feature level directly to the following sub-network; this improves the utilization of accurate localization signals in lower layers, which can significantly enhance the location accuracy of objects. Finally, the detection block, also known as the model head, generates three different sizes of feature maps (18 × 18, 36 × 36, and 72 × 72) to achieve multi-scale prediction, allowing the model to handle small, medium, and large objects. This is particularly useful when working with Arabic handwritten words, as the sizes and shapes of these words can vary significantly between writers. Multi-scale detection ensures the model can adapt to size changes in detecting words. The final detection layers in YOLOv5 use the sigmoid activation function, while the middle or hidden layers utilize Leaky ReLU activation functions. The final output vectors of the detection block include the bounding boxes, confidence scores, and class probabilities for each object detected (words). This work ignores any prediction with a confidence score lower than 0.5. The detected words are arranged in descending order depending on the xmin value, to match the rules of writing in the Arabic language, and then segments these words and passes them to the next step for recognition. To retrain the YOLOv5s model, we utilized 200 legal amount images from the AHLA dataset, which were annotated. The labeling and annotation of the images were accomplished online using the Roboflow platform (https://roboflow.com). The images were randomly partitioned into 80% training, 15% validation, and 5% testing sets.

3.3. Classification Task: The Proposed Hybrid Ensemble CNNs with the ViT

In this work, to recognize the Arabic handwritten amount words, we propose a novel deep learning architecture for Arabic handwritten word recognition that hybrid ensembles two CNN models—Xception and InceptionResNetV2—with a Vision Transformer (ViT), as shown in Figure 4. The model takes Arabic word images as input and passes them through the two CNNs independently, which act as feature extractors. Their outputs are flattened and concatenated to ensemble their learned visual features. In parallel, the image is fed to the ViT model, which encodes global contextual relationships. The concatenated CNN ensemble output is fused with the ViT output through concatenation. This allows the model to represent robust local features from the CNN ensemble and long-range dependencies from the ViT. The concatenated multi-model representation is fed to a classifier. By leveraging an ensemble of CNNs, our model can capture diverse complementary local visual patterns. The ViT incorporates global context and long-range sequence relationships. By hybrid integrating the CNN ensemble with the ViT, our model architecture is designed to combine their respective strengths. The entire hybrid model is trained end-to-end to recognize Arabic handwritten words. Extensive experiments show that our hybrid ensembling approach performs better than individual CNNs or ViT models, demonstrating the effectiveness of fusing diverse models.

3.3.1. Global Knowledge Extraction: Ensemble Transfer Learning via Various CNNs

We carefully select the best-performance AI models for detection and classification stages based on empirical and experimental studies. The trail-based error strategy checks various individual models and then selects the models with the best recognition performance.
  • Xception: Xception is a deep Convolutional Neural Network architecture proposed by Chollet et al. [60]. It relies on depth-wise separable convolutions as basic building blocks. Depth-wise separable convolutions split a standard convolution into two layers—a depth-wise convolution to filter inputs per channel, followed by a pointwise convolution to combine channel outputs. This factorization dramatically reduces computation compared to traditional convolutions. Xception achieves a strong image classification performance with only 36 convolutional layers, much less than other state-of-the-art CNNs. It provides an efficient and lightweight alternative to models like Inception V3 and ResNet [60]. Using separable convolutions allows Xception to gain accuracy from considerably increased depth, while maintaining a relatively low computational budget. Key strengths of Xception include its efficiency, accuracy, and modular structure based on repeating convolution blocks [61].
  • InceptionResNetV2: Inception-ResNetV2 was proposed by Szegedy et al. [62] to improve upon earlier Inception architectures. It combines the Inception structure with residual connections, inspired by the success of ResNet models [22]. Inception modules employ multiple filter sizes in parallel to capture details at various scales. Residual connections help optimization and accuracy by allowing direct gradient propagation across layers. Inception-ResNetV2 achieves high accuracy on image classification benchmarks like ImageNet, surpassing its predecessors—the hybrid Inception–Resnet design benefits from multi-scale feature extraction and residual learning. Additional enhancements include batch normalization and factorized convolutions to reduce computational requirements. The model is readily transferable and can be a strong feature extractor for many vision tasks.

3.3.2. Local Knowledge Extraction: Vision Transformer (ViT)

The Vision Transformer (ViT) was proposed by Dosovitskiy et al. [30] as a new approach to applying transformers to image recognition tasks. While transformers have been hugely successful in NLP, ViT was the first model to demonstrate their potential for computer vision. The key insight is representing an image as a sequence of patches embedded as tokens, which allows modeling with standard transformer encoders. ViT relies solely on attention mechanisms without convolution layers to learn relationships between image patches. ViT achieves a strong performance on image classification benchmarks, rivaling state-of-the-art CNNs. The global receptive field of transformers allows modeling longer-range dependencies compared to CNNs. ViT has become hugely influential, leading to numerous further adaptations of transformers for vision [31].

3.4. Courtesy Amount Conversion

The writing structure of the Arabic legal amount sentence starts from right to left, with the highest value on the right and decreasing to the left; in other words, the largest value (million part) comes in the first part on the right, then followed by the smaller value (thousand part), until the smallest value (one part) comes in the last part on the left. In this section, we proposed an algorithm to generate the courtesy amount from the Arabic legal amount sentence after recognizing each word, named the LegalToCourtesy algorithm (see Algorithm 1).
The algorithm computes predicted legal amount words’ courtesy amount value (CAV). The input is a set of predicted legal words (L) words, which come out from the classification phase, and the output is the CAV. Let us assume that L = [ 6 , 10 , 1000000 , 800 , 20 , 1000 , 3 , 100 , 5 , 60 ] is a generated list dictionary of predicted legal amount words that output from the processed in-word detection phase and word classification phase from the legal amount sentence image in Figure 5.
Algorithm 1 LegalToCourtesy Algorithm
Input:  L = p w 1 , p w 2 , , p w n ← Set of predicted legal word
Output: CAV courtesy amount value
  • P = 1,000,000 ,     1000 ,     100 ← Part value of numbers
  • Initial M P = M i l l i o n s   P a r t
  • Initial T P = T h o u s a n d s   P a r t
  • Initial H P = H u n d r e d s   p a r t
  • Initial T O P = T e n s   a n d   O n e s   P a r t
  • For each part in P do
  • If part exists in L
  • i n d e x = l . i n d e x ( p a r t )
  • If match (part, 1,000,000) = true
  • M P = L [ 0 : i n d e x ]
  • Delete MP from L
  • Else if match(part,1000) = true
  • T P = L [ 0 : i n d e x ]
  • Delete TP from L
  • Else if match(part,100) = true
  • H P = L [ 0 : i n d e x ]
  • Else
  • T O P = L
  • C o m p u t e   M P V   a c o r d i n g   t o   C a l M i l T h P a r t V a l u e   f u n c t i o n M i l l i o n s   P a r t   V a l u e
  • C o m p u t e   T P V   a c o r d i n g   t o   C a l M i l T h P a r t V a l u e   f r m u l a T h o u s a n d s   P a r t   V a l u e
  • H P V = s u m H O P H u n d r e d s   P a r t   V a l u e
  • T O P V = s u m T O P P T e n s   a n d   o n e s   P a r t   V a l u e
  • C A V = M P V + T P V + H O P V
The algorithm splits L into four sets of part values—the millions part (MP), thousands part (TP), hundreds part (HP), and the tens and ones’ part (TOP). The algorithm checks if the part value exists in L; if it exists, the index of the part value is used to determine the corresponding set of words for that part. The value sets of MP and TP are then calculated based on the CalMilThPartValue function, as presented in Algorithm 2, and keep results in MPV and TPV for MP and TP sets, respectively, as shown in Figure 6.
The TPV and TOPV are calculated by the sum of the TP and TOP element sets, respectively. Finally, the CAV is calculated as follows:
C A V = M P V + T P V + H P V + T O P V = 16,000,000 + 820,000 + 300 + 65 = 160,820,365
Algorithm 2 CalMilThPartValue Function
Define the CalMilThPartValue function as
Input:  S P = s p 1 , s p 2 , , s p n ← a subset of the part value, Unit ← Part value
Output: Value ← Value of part
  • I n i t i a l   V a l u e = 0
  • If hundreds exist in SP
  • i n d e x = S P . i n d e x ( h u n d r e d s )
  • H S P = S P [ 0 : i n d e x ] h u n d r e d s   p a r t   o f   s u b s e t s   p a r t
  • Delete H S P from S P
  • U S P V = U n i t U n i t s   P a r t   o f   s u b s e t s   p a r t   v a l u e
  • Delete U S P from S P
  • T S P = S P T e n s   p a r t   o f   s u b s e t s   p a r t
  • H S P V = T h e   p r o d u c t   o f   m u l t i p l y i n g   t h e   e l e m e n t s   o f   H S P H u n d r e d s   p a r t   V a l u e
  • T S P V = T h e   s u m   o f   m u l t i p l y i n g   t h e   e l e m e n t s   o f   H S P T e n s   p l a r t   V a l u e
  • V a l u e = H S P V + T S P V × U S P V

3.5. Experimental Setup

This section details the specific configurations for the evaluation experiments. The experiments were conducted on Google Colab using an Nvidia Tesla T4 K80 GPU with 16 GB of RAM. The YOLOv5 algorithm was implemented using PyTorch 1.13.1 and CUDA cu116, and the hybrid CNN–ViT model was implemented using TensorFlow 2.9.2, based on the Python 3.8.10 programming language.

3.6. Evaluation Metrics

Evaluating the performance of object detection approaches involves utilizing various statistical and machine learning metrics, such as ROC curves, precision and Recall, F-scores, and false positives per image [54]. The detection accuracy is typically determined by comparing the results of the object detector to a reference set of ground-truth bounding boxes. Most object detection studies have employed the overlap criterion introduced by Everingham et al. [55] for the Pascal VOC challenge to determine the correctness of detection. This criterion involves assigning detections to ground-truth objects and determining true or false positives based on the overlap between the predicted and ground-truth bounding boxes. According to [55], a detection is correct if the overlap ratio between the predicted and ground-truth boxes exceeds 0.5 (50%).
The overlap criterion used in Pascal VOC is the intersection over union (IoU) and is calculated as follows:
I o U = a 0 = a r e a ( B p B g t ) a r e a ( B p B g t )
where IoU represents the intersection over union; a0 represents the overlap ratio; Bp and Bgt denote the predicted and ground truth bounding boxes, respectively; and area (BpBgt) represents the overlap or intersection of the predicted and ground-truth bounding boxes. In contrast, area (BpBgt) represents the union of these two bounding boxes. By matching detections to the ground truth, it is possible to determine the number of correctly classified objects, referred to as true positives (TPs), incorrect detections or false positives (FPs), and ground-truth objects not missed by the detector or false negatives (FNs). A wide range of evaluation metrics can be computed using the total TPs, FPs, and FNs count.
Our YOLOV5-based word extraction module was assessed using four metrics—Precision, Recall, F1-score, and Average Precision (mAP). Precision measures the accuracy of detected legal amount words out of the total detected words.
P r e c i s i o n ( P R E ) = T P T P + F P .
Conversely, Recall calculates the ratio of correctly detected words to the total number of legal amount words in the dataset.
S e n s i t i v i t y ( S E N ) / R e c a l l R E = T P T P + F N .
The F1-score is the trade-off between Precision and Recall, giving a general idea of the algorithm’s performance.
F 1 s c o r e = 2 × P r e c i s i o n R e c a I l P r e c i s i o n + R e c a I l .
Finally, Average accuracy (mAP) evaluates the algorithm’s performance under different confidence thresholds.
A P = 0 1 P ( R ) d R .
We utilized four metrics to assess the proposed hybrid ensemble’s (CNNs–ViT) performance—accuracy (ACC), Recall, Precision, and F1-score. For the model, we calculated a Confusion Matrix. From the Confusion Matrix, we obtained the values for false negatives (FNs), false positives (FPs), true negatives (TNs), and true positives (TPs).
Accuracy   ( A C C ) = T P + T N T P + T N + F P + F N .

4. Results and Discussion

This section will discuss experimental results; we evaluated the performance of our novel approach for recognizing legal amounts from a set of documents, generating the courtesy amount. Our approach consisted of the following three major components: YOLOv5s-based word segmentation, CNN-based word recognition, and Courtesy Amount Generation.

4.1. Hyperparameter Settings

4.1.1. Hyperparameter Settings of YOLOv5 Model

Several critical hyperparameters were configured in the fine-tuning process of the YOLOv5 algorithm for detecting Arabic handwritten words. The input image size was set to 640 × 640 pixels, while filling any unoccupied areas with a white background, defining the dimensions of images used during training. Mini batches of 32 samples were employed for each iteration, influencing the model’s gradient updates. The training was conducted for 300 epochs using the stochastic gradient descent (SGD) algorithm with a learning rate of 0.01, a weight decay of 0.0005, and an SGD momentum of 0.937, implying that the entire dataset was processed 300 times for model optimization. The dataset configuration, including class information and file paths, was as specified in the data .YAML file. For the initial weights of the YOLOv5 model, ‘yolov5s.pt’ was utilized. Finally, caching was not explicitly enabled in this setup. The other hyperparameters not explicitly mentioned remain consistent with the default values used in YOLOv5s.

4.1.2. Hyperparameter Settings of Hybrid CNN–ViT

In the training process of the hybrid CNN–ViT model for Arabic handwritten word recognition, several key hyperparameters were configured. The input image size was set to 224 × 224 pixels, with any unoccupied area filled with white background, defining the dimensions of images used during training and inference. Mini batches of 16 samples were employed for each iteration, influencing the model’s gradient updates. The training was conducted for 80 epochs and used the Adam optimization algorithm with a learning rate of 1 × 10−4 and a categorical cross-entropy loss. For regularization, a dropout with a rate of 0.5 was applied to the classifier layers. The CNN layers were frozen during training, while the classifier layers were trained from scratch. The Xception, InceptionResNetV2, and ViT models used ImageNet pre-trained weights for initialization. Data augmentation was applied on the fly, including rotations, shifts, flips, and color jittering. The dataset configuration was specified in the data loading module. The other hyperparameters not explicitly mentioned remain consistent with the default values used in the original CNN and ViT model configurations. The model was trained end-to-end using the above settings to optimize it for Arabic handwritten word recognition.
We set these parameters to balance computational efficiency and model performance. The choice of learning rates, batch sizes, and optimization algorithms were determined through a combination of grid search and empirical validation, ensuring a robust performance across various training scenarios. This careful tuning was crucial for achieving the high accuracy rates reported in our results, and it reflects a meticulous approach to model optimization and validation.

4.2. YOLOv5s Model: Word Detection Outcomes

The YOLOv5s-based word extraction was used to detect and extract the words from the legal amount images. The purpose of the loss function is to assess the model’s training progress in each iteration and to compute the discrepancy between the predicted and actual values during the iteration. The YOLOv5 loss function is represented as object loss, box loss, and class loss, where box loss reflects how well the algorithm can pinpoint the object’s center and the accuracy of the predicted bounding box encompassing the object. The target loss (object loss) essentially gauges the likelihood of an object existing in the designated region of interest, and the classification loss (class loss) represents the category loss. As there is only one class in the training set for this study, class loss is equal to 0. The loss function curves for the training procedure are depicted in Figure 7. As demonstrated by the loss curve, at 180 epochs, the loss function of the training set dropped from an initial value of 0.233 to approximately 0.087, and the loss function of the validation set decreased from its starting value of 0.299 to 0.154. The precision of a model refers to its ability to identify an object accurately.
Meanwhile, Recall is a metric that evaluates the extent to which the model searches for all instances of the object when recognizing it. The variation of Precision and Recall during the model’s training, based on the number of epochs, is depicted in Figure 7. It is evident that the highest Precision reached by the model during its training process is 0.997, and the highest Recall achieved is 0.997. The mean Average Precision (mAP) is a crucial metric in determining the effectiveness of an object detection network. It measures the network’s performance by calculating the area under the Precision–Recall curve. The [email protected] metric evaluates the network’s performance when the intersection over union (IOU) threshold is set to 0.5. On the other hand, [email protected]:0.95 considers the average precision across different IOU thresholds ranging from 0.5 to 0.95, evaluated at a step of 0.05. As depicted in Figure 8, the mAP curve during the training process is shown. The ultimate evaluation results show that the model achieved an [email protected] of 0.995 and an [email protected]–0.95 of 0.711. The confusion matrix for the model is shown in Figure 9. The model achieved an overall accuracy of 99.99%, with only 0.01% of objects misclassified as background. This demonstrates the model’s strong performance in accurately detecting words. The sample results in Figure 10 indicate a clear success for the YOLOv5 model in accurately detecting Arabic handwritten legal amount words. The model was tested on 75 Arabic legal amount images, with impressive results, as shown in Table 4; the Precision and Recall rates achieved by the model were 0.977 and 0.961, respectively. The model also achieved a mean Average Precision (mAP) score of 0.979 at an intersection over union (IoU) threshold of 0.5 and a mAP score of 0.596 at IoU thresholds between 0.5 and 0.95.
Additionally, it is important to note that the number of letters in an Arabic word does not significantly impact detection accuracy. Our model has been designed to effectively handle words of varying lengths, and the accuracy remains consistent regardless of the number of letters. Our model’s performance metrics and visual examples demonstrate this robustness throughout the training and evaluation processes.

4.3. Hybrid Models: Classification Outcomes

Two hybrid learning scenarios were designed to recognize Arabic handwritten words by integrating pre-trained CNN models with the Vision Transformer (ViT) model. In the first scenario, referred to as Hybrid A, a hybrid ensemble of two CNN models, Xception and InceptionResNetV2, was fused in parallel with the ViT, as depicted in Figure 4. In the second scenario, Hybrid B, features extracted from Xception and InceptionResNetV2 were concatenated in series with the ViT. These scenarios aim to determine the optimal positioning of the ViT to enhance classification performance. The classification performance of each hybrid deep learning model is comprehensively delineated in Table 5; regarding classification accuracy, Hybrid A achieved an accuracy of 99.02%, while Hybrid B performed lower, with an accuracy of 97.531%. This suggests that Hybrid A outperforms Hybrid B in terms of overall accuracy. Precision (PRE) and F1-Score metrics also demonstrate the superiority of Hybrid A with a PRE of 98.20% and an F1-Score of 98.17%, compared to Hybrid B’s PRE of 97.62% and F1-Score of 97.52%. These differences indicate that Hybrid A achieves a higher accuracy and exhibits a better precision and F1-Score.
Furthermore, the area under the ROC curve (AUC) and sensitivity (SEN) values are crucial for assessing the models’ performance. Hybrid A surpasses Hybrid B in both AUC (99.89% compared to 99.83%) and SEN (98.17% compared to 97.531%). This implies that Hybrid A can better distinguish between classes and is more sensitive in identifying true positive cases. Furthermore, a comprehensive presentation of the classification evaluation results, specifically regarding Receiver Operating Characteristic (ROC) and Precision–Recall (PR) curves, is summarized in Table 6. Hybrid A exhibits impressive results with a macro-average AUC of 99.89% and a micro-average AUC of 99.99%, as well as a macro-average Precision–Recall score of 99.28% and a micro-average Precision–Recall of 99.16%. On the other hand, Hybrid B still performs slightly lower, with a macro-average AUC of 99.80%, a micro-average AUC of 99.99%, along with a macro-average Precision–Recall score of 98.16%, and a micro-average Precision–Recall of 98.96%.
The number of accurately and mistakenly identified samples was assessed using a confusion matrix or contingency table, as depicted in Figure 11. For Hybrid A, 18 samples were misclassified. In contrast, as shown in Figure 12, Hybrid B demonstrated significantly higher misclassifications, with 29 samples being wrongly identified. These findings suggest that Hybrid A outperforms Hybrid B regarding classification accuracy. The lower number of misclassified samples in Hybrid A signifies the higher precision and reliability in its predictions, highlighting its superior performance compared to Hybrid B.

4.4. Courtesy Amount Generation Outcomes

We used 50 images in the test set that were not involved in training to assess the approach’s ability to generate the courtesy amount. The testing results are evaluated based on the generated courtesy amount values. Our approach could generate the courtesy amount with an accuracy rate of 90% and an inference time cost of 4.5 s per image. Our novel approach demonstrated excellent results in extracting the legal amount words, recognizing them, and generating the courtesy amount; Figure 13 shows the samples of courtesy amounts that were correctly generated. Additionally, our approach can detect words even in overlapping cases, as shown in Figure 14a, and can classify words despite spelling mistakes, as demonstrated in Figure 14b. This approach can automate legal amount recognition processing in financial documents, bank cheques, invoices, and other financial transactions. All improperly generated courtesy amounts are caused by inaccurate word detection estimates or incorrect word recognition of legal amounts. Figure 15 shows examples of improperly generated courtesy amounts. In Figure 15a, the error is due to the wrong classification of a word, which is caused by a spelling error that affected the classification process. In Figure 15b, the error arises from detecting a sub-word as a complete word, which impacts the generation of the courtesy amount.

4.5. Comparison of Proposed Methods with Existing Studies

Since there is no work on the Arabic handwriting legal amount recognition, which can generate the Arabic courtesy amount based on the Arabic legal amount recognition, we can compare the YOLOv5-based word extraction method with existing Arabic word extraction methods, as shown in Table 7, and compare hybrid CNN–ViT-based legal amount words recognition methods with existing Arabic words recognition methods, as shown in Table 8.
Our experimental results demonstrate that ViTs, as well as the hybrid models integrating ViT with selected CNNs, exhibit an even better performance, emphasizing the potential of combining these two deep learning paradigms. Hybrid B emerges as the best-performing model, offering high accuracy and strong discriminative abilities, making it a promising choice for practical applications in texture-based classification tasks. The number of accurately and mistakenly identified samples was measured using a confusion matrix or contingency table, as shown in Figure 11 and Figure 12. For Hybrid B, 29 samples were misclassified, while Hybrid A exhibited a significantly lower number of misclassifications, with only 18 samples being wrongly identified. These results suggest that Hybrid A is more robust and accurate in classification than Hybrid B. The lower number of misclassified samples indicates the higher precision and reliability in Hybrid A’s predictions.

4.6. Limitations and Future Work

While promising, the proposed model has several limitations that provide opportunities for future work. Different individuals write the same characters differently, leading to variability in handwriting styles that can impact the model’s accuracy. Additionally, overlapping characters in handwritten Arabic complicate recognition, and spelling errors can affect word detection and classification. Our current framework does not explicitly handle all typos in handwritten words, though it has shown the capability to manage some typographical errors, as demonstrated in Figure 14b. The dataset used for training and evaluation contained only 160 legal amount images, which is relatively small for deep learning models. Training on larger and more diverse datasets would likely improve the model’s accuracy and robustness.
Additionally, the model was only evaluated on recognizing Arabic literal amounts, and its capabilities for other Arabic handwriting tasks remain unknown. Testing the model on additional datasets and tasks, such as recognizing courtesy amounts and free-form handwriting, would provide insight into its versatility. Integrating natural language processing techniques to recognize numerical values from words could expand the model’s capabilities. Furthermore, optimizing the model architecture and hyperparameter tuning could improve efficiency for real-time applications. Addressing these limitations presents worthwhile avenues for future research. With larger datasets, evaluation across tasks, integration of contextual understanding, and optimization, the proposed model can potentially generalize better and pave the way for production-ready Arabic handwriting recognition.

5. Conclusions

This paper introduced a novel deep learning framework for the end-to-end recognition of handwritten Arabic legal amounts on bank cheques, overcoming limitations of traditional segmentation-dependent OCR techniques. The proposed pipeline integrates three main stages. First, YOLOv5-based word detection to accurately detect and extract handwritten words, achieving over 99% accuracy. Second, the hybrid CNN–ViT model is proposed to recognize extracted words by combining the strengths of CNNs and Vision Transformers, achieving over 99% accuracy. Third, the LegalToCourtesy algorithm is proposed to convert the recognized words into their corresponding numerical courtesy amounts, accounting for the right-to-left structure of Arabic text. End-to-end testing demonstrated 90% accuracy in extracting handwritten Arabic legal amounts and correctly converting them into courtesy amounts, without requiring complex preprocessing or segmentation. The robust and accurate performance highlights the effectiveness of leveraging deep transfer learning for Arabic handwriting recognition tasks. This work advances the field of Arabic OCR and document image analysis, presenting new techniques for legal amount recognition in Arabic script. The high accuracy and practical implications suggest a significant potential for integration into real-world systems, particularly in banking and financial services for automating cheque processing and financial workflows.

Author Contributions

H.A.A.: conceptualization; data curation; software; writing—original draft. A.A.: validation; resources; investigation; methodology. M.A.A.-A.: conceptualization; supervision; validation; writing—review and editing; project administration; funding acquisition. R.R.M.: formal analysis; resources; visualization. M.T.: validation; resources; visualization; investigation. S.B.: data curation; conceptualization; supervision; Y.H.G.: formal analysis; resources; validation; writing—review and editing; funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2021-II210755, Dark data analysis technology for data scale and accuracy improvement).

Data Availability Statement

To achieve the goal of this study, we use our private AHLA dataset and source code, which are explained in these URLs: Arabic Handwritten Legal Amount (AHLA) Dataset available at: https://doi.org/10.5281/zenodo.10845222 (accessed on 2 July 2024). Source code of this implementation is accessible here: https://github.com/Hakim-Abdo/ArabicHandwrittenLegalAmountToCourtesyAmount.git (accessed on 2 July 2024).

Acknowledgments

This work was partly supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2021-II210755, Dark data analysis technology for data scale and accuracy improvement). This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. RS-2022-00166402 and RS-2023-00256517).

Conflicts of Interest

The authors declare that there are no conflicts of interest to publish such research findings.

Abbreviations

AIArtificial Intelligence
YOLOYou Only Look Once
CNNConvolutional Neural Network
ViTVision Transformer
OCROptical Character Recognition
TLTransfer Learning
RERecall
ROCReceiver Operating Characteristic
PRPrecision–Recall
AUCArea Under the Curve
ACCAccuracy
PREPrecision
SENSensitivity
CSPCross-stage partial network
SPPSpatial Pyramid Pooling
ConvConvolutional layer

References

  1. Al-Muhtaseb, H.A.; Mahmoud, S.A.; Qahwaji, R.S. Recognition of off-line printed Arabic text using Hidden Markov Models. Signal Process. 2008, 88, 2902–2912. [Google Scholar] [CrossRef]
  2. Tanvir Parvez, M.; Mahmoud, S.A. Arabic handwriting recognition using structural and syntactic pattern attributes. Pattern Recognit. 2013, 46, 141–154. [Google Scholar] [CrossRef]
  3. Suen, C.; Kharma, N.; Cheriet, M.; Liu, C.-L. Character Recognition Systems: A Guide for Students and Practitioners; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2007; p. 326. [Google Scholar]
  4. Al-homed, L.S.; Jambi, K.M.; Al-Barhamtoshy, H.M. A Deep Learning Approach for Arabic Manuscripts Classification. Sensors 2023, 23, 8133. [Google Scholar] [CrossRef] [PubMed]
  5. Djaghbellou, S.; Bouziane, A.; Attia, A.; Akhtar, Z. A Survey on Arabic Handwritten Script Recognition Systems. Int. J. Artif. Intell. Mach. Learn. 2021, 11, 1–17. [Google Scholar] [CrossRef]
  6. Lawgali, A. A Survey on Arabic Character Recognition. Int. J. Signal Process. 2015, 8, 401–426. [Google Scholar] [CrossRef]
  7. Khayyat, M.; Lam, L.; Suen, C.Y. Learning-based word spotting system for Arabic handwritten documents. Pattern Recognit. 2014, 47, 1021–1030. [Google Scholar] [CrossRef]
  8. Slimane, F.; Ingold, R.; Kanoun, S.; Alimi, A.M.; Hennebert, J. A new Arabic printed text image database and evaluation protocols. In Proceedings of the 2009 10th International Conference on Document Analysis and Recognition, Barcelona, Spain, 26–29 July 2009; pp. 946–950. [Google Scholar] [CrossRef]
  9. Shiu, C.W.; Chen, J.; Chen, Y.C. Low-Cost Online Handwritten Symbol Recognition System in Virtual Reality Environment of Head-Mounted Display. Mathematics 2020, 8, 1967. [Google Scholar] [CrossRef]
  10. Baek, S.B.; Shon, J.G.; Park, J.S. CAC: A Learning Context Recognition Model Based on AI for Handwritten Mathematical Symbols in e-Learning Systems. Mathematics 2022, 10, 1277. [Google Scholar] [CrossRef]
  11. Mezghani, N.; Mitiche, A.; Cheriet, M. On-line recognition of handwritten Arabic characters using a Kohonen neural network. In Proceedings of the Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition, Niagra-on-the-Lake, ON, Canada, 6–8 August 2002; pp. 490–495. [Google Scholar] [CrossRef]
  12. Safabakhsh, R.; Adibi, P. Nastaaligh Handwritten Word Recognition Using a Continuous-Density Variable-Duration HMM. Arab. J. Sci. Eng. 2005, 30, 95–118. [Google Scholar]
  13. Farooq, F.; Govindaraju, V.; Perrone, M. Pre-processing methods for handwritten Arabic documents. In Proceedings of the Eighth International Conference on Document Analysis and Recognition (ICDAR’05), Seoul, Republic of Korea, 31 August–1 September 2005; Volume 2005, pp. 267–271. [Google Scholar] [CrossRef]
  14. Simultaneous Segmentation and Recognition of Arabic Characters in an Unconstrained On-Line Cursive Handwritten Document. Available online: https://www.researchgate.net/publication/242308716_Simultaneous_Segmentation_and_Recognition_of_Arabic_Characters_in_an_Unconstrained_On-Line_Cursive_Handwritten_Document (accessed on 10 November 2023).
  15. Parvez, M.T.; Mahmoud, S.A. Offline arabic handwritten text recognition: A Survey. ACM Comput. Surv. 2013, 45, 1–35. [Google Scholar] [CrossRef]
  16. Abdo, H.A.; Abdu, A.; Manza, R.R.; Bawiskar, S. An approach to analysis of arabic text documents into text lines, words, and characters. Indones. J. Electr. Eng. Comput. Sci. 2022, 26, 754–763. [Google Scholar] [CrossRef]
  17. Alma’adeed, S.; Higgens, C.; Elliman, D. Recognition of off-line handwritten arabic words using Hidden Markov Model approach. In Proceedings of the 2002 International Conference on Pattern Recognition, Quebec City, QC, Canada, 11–15 August 2002; Volume 16, pp. 481–484. [Google Scholar] [CrossRef]
  18. Graves, A.; Schmidhuber, J. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. In Proceedings of the Neural Information Processing Systems, Vancouver, BC, Canada, 8–10 December 2008; pp. 297–313. [Google Scholar] [CrossRef]
  19. Bluche, T.; Ney, H.; Kermorvant, C. Feature extraction with convolutional neural networks for handwritten word recognition. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition, Washington, DC, USA, 25–28 August 2013; pp. 285–289. [Google Scholar] [CrossRef]
  20. Krishnan, P.; Dutta, K.; Jawahar, C.V. Word spotting and recognition using deep embedding. In Proceedings of the 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), Vienna, Austria, 24–27 April 2018; pp. 1–6. [Google Scholar] [CrossRef]
  21. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
  22. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  23. Nazir, A.; Cheema, M.N.; Sheng, B.; Li, P.; Li, H.; Xue, G.; Qin, J.; Kim, J.; Feng, D.D. ECSU-Net: An Embedded Clustering Sliced U-Net Coupled with Fusing Strategy for Efficient Intervertebral Disc Segmentation and Classification. IEEE Trans. Image Process. 2022, 31, 880–893. [Google Scholar] [CrossRef] [PubMed]
  24. Abdu, A.; Zhai, Z.; Abdo, H.A.; Algabri, R. Software Defect Prediction Based on Deep Representation Learning of Source Code From Contextual Syntax and Semantic Graph. IEEE Trans. Reliab. 2024, 73, 820–834. [Google Scholar] [CrossRef]
  25. Gu, J.; Wang, Z.; Kuen, J.; Ma, L.; Shahroudy, A.; Shuai, B.; Liu, T.; Wang, X.; Wang, G.; Cai, J.; et al. Recent advances in convolutional neural networks. Pattern Recognit. 2018, 77, 354–377. [Google Scholar] [CrossRef]
  26. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar] [CrossRef]
  27. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Li, F.-F. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
  28. Yosinski, J.; Clune, J.; Bengio, Y.; Lipson, H. How transferable are features in deep neural networks? Adv. Neural Inf. Process. Syst. 2014, 4, 3320–3328. [Google Scholar]
  29. Tan, C.; Sun, F.; Kong, T.; Zhang, W.; Yang, C.; Liu, C. A Survey on Deep Transfer Learning. In Proceedings of the 27th International Conference on Artificial Neural Networks, Rhodes, Greece, 4–7 October 2018; Volume 11141, pp. 270–279. [Google Scholar] [CrossRef]
  30. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
  31. Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in Vision: A Survey. ACM Comput. Surv. 2022, 54, 1–41. [Google Scholar] [CrossRef]
  32. Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. Proc. Mach. Learn. Res. 2020, 139, 10347–10357. [Google Scholar]
  33. Wu, F.; Wang, J.; Liu, J.; Wang, W. Vulnerability detection with deep learning. In Proceedings of the 2017 3rd IEEE International Conference on Computer and Communications (ICCC), Chengdu, China, 13–16 December 2017; pp. 1298–1302. [Google Scholar] [CrossRef]
  34. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
  35. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
  36. Fu, P.; Zhang, X.; Yang, H. Answer sheet layout analysis based on YOLOv5s-DC and MSER. Vis. Comput. 2023, 1–12. [Google Scholar] [CrossRef]
  37. Al-ohali, Y.; Cheriet, M.; Suen, C. Databases for recognition of handwritten Arabic cheques. Pattern Recognit. 2003, 36, 111–121. [Google Scholar] [CrossRef]
  38. Souici-Meslati, L.; Sellami, M. A hybrid approach for arabic literal amounts recognition. Arab. J. Sci. Eng. 2004, 29, 177–194. [Google Scholar]
  39. Farah, N.; Souici, L.; Sellami, M. Classifiers combination and syntax analysis for Arabic literal amount recognition. Eng. Appl. Artif. Intell. 2006, 19, 29–39. [Google Scholar] [CrossRef]
  40. Farah, N.M.; Sellami, M.A. Fuzzy nearest neighbor system: An application to the recognition of handwritten Arabic literal amounts. Jordan J. Appl. Sci.-Nat. Sci. 2005, 7, 48–55. [Google Scholar]
  41. Al-Ma’adeed, S.; Elliman, D.; Higgins, C.A. A data base for Arabic handwritten text recognition research. In Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition, Niagra-on-the-Lake, ON, Canada, 6–8 August 2002; Volume 1, pp. 485–489. [Google Scholar] [CrossRef]
  42. Louloudis, G.; Gatos, B.; Pratikakis, I.; Halatsis, C. Text line and word segmentation of handwritten documents. Pattern Recognit. 2009, 42, 3169–3183. [Google Scholar] [CrossRef]
  43. Aouadi, N.; Echi, A.K. Word Extraction and Recognition in Arabic Handwritten Text. Int. J. Comput. Inf. Sci. 2016, 12, 17–23. [Google Scholar] [CrossRef]
  44. Elzobi, M.; Al-Hamadi, A.; Al Aghbari, Z. Off-line handwritten arabic words segmentation based on structural features and connected components analysis. In Proceedings of the 19th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision, Plzen, Czech Republic, 31 January–3 February 2011; pp. 135–142. [Google Scholar]
  45. AlKhateeb, J.H.; Jiang, J.; Ren, J.; Ipso, S. Interactive Knowledge Discovery for Baseline Estimation and Word Segmentation in Handwritten Arabic Text. In Recent Advances in Technologies; Intechopen: London, UK, 2009. [Google Scholar] [CrossRef]
  46. Papavassiliou, V.; Stafylakis, T.; Katsouros, V.; Carayannis, G. Handwritten document image segmentation into text lines and words. Pattern Recognit. 2010, 43, 369–377. [Google Scholar] [CrossRef]
  47. Al-dmour, A.; Fraij, F. Segmenting Arabic Handwritten Documents into Text lines and Words. Int. J. Adv. Comput. Technol. 2014, 6, 109–119. [Google Scholar]
  48. Al-Dmour, A.; Zitar, R.A. Word extraction from arabic handwritten documents based on statistical measures. Int. Rev. Comput. Softw. 2016, 11, 436–444. [Google Scholar] [CrossRef]
  49. Neche, C.; Belaïd, A.; Kacem-Echi, A. Arabic handwritten documents segmentation into text-lines and words using deep learning. In Proceedings of the 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Sydney, Australia, 22–25 September 2019; Volume 6, pp. 19–24. [Google Scholar] [CrossRef]
  50. Mahmoud, S.A.; Ahmad, I.; Al-Khatib, W.G.; Alshayeb, M.; Tanvir Parvez, M.; Märgner, V.; Fink, G.A. KHATT: An open Arabic offline handwritten text database. Pattern Recognit. 2014, 47, 1096–1112. [Google Scholar] [CrossRef]
  51. Gader, T.B.A.; Echi, A.K. Attention-based CNN-ConvLSTM for Handwritten Arabic Word Extraction. Electron. Lett. Comput. Vis. Image Anal. 2022, 21, 121–134. [Google Scholar] [CrossRef]
  52. Saidi, A.; Lakhdar, A.M.; Beladgham, M. Recognition of Offline Handwritten Arabic Words Using a Few Structural Features. Comput. Mater. Contin. 2021, 66, 2875–2889. [Google Scholar] [CrossRef]
  53. Hassen, H.; Al-Maadeed, S. Arabic handwriting recognition using sequential minimal optimization. In Proceedings of the 1st IEEE International Workshop on Arabic Script Analysis and Recognition, ASAR, Nancy, France, 3–5 April 2017. [Google Scholar]
  54. Al-Nuzaili, Q.; Al-Maadeed, S.; Hassen, H.; Hamdi, A. Arabic Bank Cheque Words Recognition Using Gabor Features. In Proceedings of the 2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR), London, UK, 12–14 March 2018; pp. 84–89. [Google Scholar] [CrossRef]
  55. Altwaijry, N.; Al-Turaiki, I. Arabic handwriting recognition system using convolutional neural network. Neural Comput. Appl. 2021, 33, 2249–2261. [Google Scholar] [CrossRef]
  56. Maalej, R.; Kherallah, M. Convolutional Neural Network and BLSTM for Offline Arabic Handwriting Recognition. In Proceedings of the ACIT 2018—19th International Arab Conference on Information Technology, Werdanye, Lebanon, 28–30 November 2018. [Google Scholar]
  57. Elleuch, M.; Maalej, R.; Kherallah, M. A New design based-SVM of the CNN classifier architecture with dropout for offline Arabic handwritten recognition. In Proceedings of the Procedia Computer Science, New York, NY, USA, 16–19 July 2016; Volume 80. [Google Scholar]
  58. El-Melegy, M.; Abdelbaset, A.; Abdel-Hakim, A.; El-Sayed, G. Recognition of Arabic Handwritten Literal Amounts Using Deep Convolutional Neural Networks. In Proceedings of the Iberian Conference on Pattern Recognition and Image Analysis, Madrid, Spain, 1–4 July 2019; Volume 11868, pp. 169–176. [Google Scholar] [CrossRef]
  59. Jocher, G.; Stoken, A.; Borovec, J.; NanoCode012; ChristopherSTAN; Changyu, L.; Laughing; Hogan, A.; lorenzomammana; tkianai; et al. ultralytics/yolov5: v3.0; Zenodo: Genève, Switzerland, 2020. [Google Scholar] [CrossRef]
  60. Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar] [CrossRef]
  61. Tan, M.; Le, Q.V. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 10691–10700. [Google Scholar]
  62. Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the AAAI’17: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31, pp. 11–24. [Google Scholar] [CrossRef]
  63. Jamal, A.T.; Nobile, N.; Suen, C.Y. End-shape recognition for arabic handwritten text segmentation. In Proceedings of the IAPR Workshop on Artificial Neural Networks in Pattern Recognition, Montreal, QC, Canada, 6–8 October 2014; Volume 8774, pp. 228–239. [Google Scholar] [CrossRef]
  64. Lamsaf, A.; Aitkerroum, M.; Boulaknadel, S.; Fakhri, Y. Text Line and Word Extraction of Arabic Handwritten Documents. In Innovations in Smart Cities Applications Edition 2; Ben Ahmed, M., Boudhir, A.A., Younes, A., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 492–503. [Google Scholar]
  65. Al-Nuzaili, Q.; Hamdi, A.; Hashim, S.Z.M.; Saeed, F.; Khalil, M.S. An enhanced quadratic angular feature extraction model for arabic handwritten literal amount recognition. In Lecture Notes on Data Engineering and Communications Technologies; Springer: Berlin/Heidelberg, Germany, 2018; Volume 5, pp. 369–377. [Google Scholar] [CrossRef]
  66. Korichi, A.; Slatnia, S.; Tagougui, N.; Zouari, R.; Kherallah, M.; Aiadi, O. Recognizing Arabic Handwritten Literal Amount Using Convolutional Neural Networks. In Proceedings of the International Conference on Artificial Intelligence and its Applications, El-Oued, Algeria, 28–30 September 2021; Volume 413, pp. 153–165. [Google Scholar] [CrossRef]
Figure 1. Proposed legal amount recognition end-to-end framework. The English explanation is provided specifically for non-Arabic speakers.
Figure 1. Proposed legal amount recognition end-to-end framework. The English explanation is provided specifically for non-Arabic speakers.
Mathematics 12 02256 g001
Figure 2. Samples of legal amount sentences with English explanation for non-Arabic native speaker. The English explanation is provided specifically for non-Arabic speakers.
Figure 2. Samples of legal amount sentences with English explanation for non-Arabic native speaker. The English explanation is provided specifically for non-Arabic speakers.
Mathematics 12 02256 g002
Figure 3. YOLOv5 model structure for Arabic handwritten word detection. The English explanation is provided specifically for non-Arabic speakers.
Figure 3. YOLOv5 model structure for Arabic handwritten word detection. The English explanation is provided specifically for non-Arabic speakers.
Mathematics 12 02256 g003
Figure 4. The proposed hybrid classification pipeline for Arabic handwritten word recognition. The English explanation is provided specifically for non-Arabic speakers.
Figure 4. The proposed hybrid classification pipeline for Arabic handwritten word recognition. The English explanation is provided specifically for non-Arabic speakers.
Mathematics 12 02256 g004
Figure 5. Sample of legal amount image outputted from word detection phase. The English explanation is provided specifically for non-Arabic speakers.
Figure 5. Sample of legal amount image outputted from word detection phase. The English explanation is provided specifically for non-Arabic speakers.
Mathematics 12 02256 g005
Figure 6. Sample of applying LegalToCourtesy algorithm to calculate the courtesy amount value. The English explanation is provided specifically for non-Arabic speakers.
Figure 6. Sample of applying LegalToCourtesy algorithm to calculate the courtesy amount value. The English explanation is provided specifically for non-Arabic speakers.
Mathematics 12 02256 g006
Figure 7. Training and validation convergence in terms of loss function for the YOLOv5s-based word extraction model.
Figure 7. Training and validation convergence in terms of loss function for the YOLOv5s-based word extraction model.
Mathematics 12 02256 g007
Figure 8. Evaluation prediction performance of the YOLOv5s-based word extraction model through training.
Figure 8. Evaluation prediction performance of the YOLOv5s-based word extraction model through training.
Mathematics 12 02256 g008
Figure 9. Confusion matrix of the YOLOV5-based word extraction model.
Figure 9. Confusion matrix of the YOLOV5-based word extraction model.
Mathematics 12 02256 g009
Figure 10. Sample of Arabic legal amount words detection results with English explanation for non-Arabic native speaker. The English explanation is provided specifically for non-Arabic speakers.
Figure 10. Sample of Arabic legal amount words detection results with English explanation for non-Arabic native speaker. The English explanation is provided specifically for non-Arabic speakers.
Mathematics 12 02256 g010
Figure 11. Performance assessment using confusion matrices for the Hybrid A model.
Figure 11. Performance assessment using confusion matrices for the Hybrid A model.
Mathematics 12 02256 g011
Figure 12. Performance assessment using confusion matrices for the Hybrid B model.
Figure 12. Performance assessment using confusion matrices for the Hybrid B model.
Mathematics 12 02256 g012
Figure 13. Samples of the proposed method results with correctly generated courtesy amounts. The English explanation is provided specifically for non-Arabic speakers.
Figure 13. Samples of the proposed method results with correctly generated courtesy amounts. The English explanation is provided specifically for non-Arabic speakers.
Mathematics 12 02256 g013
Figure 14. Samples of the ability of the proposed approach to detect and classify in some complex cases: (a) word detection in overlapping letters case and (b) spelling mistake word classification. The English explanation is provided specifically for non-Arabic speakers.
Figure 14. Samples of the ability of the proposed approach to detect and classify in some complex cases: (a) word detection in overlapping letters case and (b) spelling mistake word classification. The English explanation is provided specifically for non-Arabic speakers.
Mathematics 12 02256 g014
Figure 15. Samples of improperly generated courtesy amounts: (a) incorrect word recognition and (b) inaccurate word detection.
Figure 15. Samples of improperly generated courtesy amounts: (a) incorrect word recognition and (b) inaccurate word detection.
Mathematics 12 02256 g015
Table 1. Arabic word-level image dataset in terms of 33 classes with their image amounts. The English explanation is provided specifically for non-Arabic speakers.
Table 1. Arabic word-level image dataset in terms of 33 classes with their image amounts. The English explanation is provided specifically for non-Arabic speakers.
Class #English MeaningArabic Legal Amount VocabularyClass #English MeaningArabic Legal Amount Vocabulary
1EightMathematics 12 02256 i00118SevenMathematics 12 02256 i002
2Eight HundredMathematics 12 02256 i00319Seven HundredMathematics 12 02256 i004
3EightyMathematics 12 02256 i00520SeventyMathematics 12 02256 i006
4FiftyMathematics 12 02256 i00721SixMathematics 12 02256 i008
5FiveMathematics 12 02256 i00922Six HundredMathematics 12 02256 i010
6Five HundredMathematics 12 02256 i01123SixtyMathematics 12 02256 i012
7FortyMathematics 12 02256 i01324TenMathematics 12 02256 i014
8FourMathematics 12 02256 i01525ThirtyMathematics 12 02256 i016
9Four HundredMathematics 12 02256 i01726ThousandMathematics 12 02256 i018
10HundredMathematics 12 02256 i01927ThreeMathematics 12 02256 i020
11MillionMathematics 12 02256 i02128Three HundredMathematics 12 02256 i022
12NineMathematics 12 02256 i02329TwentyMathematics 12 02256 i024
13Nine HundredMathematics 12 02256 i02530TwoMathematics 12 02256 i026
14NinetyMathematics 12 02256 i02731Two HundredMathematics 12 02256 i028
15OneMathematics 12 02256 i02932Two MillionMathematics 12 02256 i030
16OnlyMathematics 12 02256 i03133Two ThousandMathematics 12 02256 i032
17ReyalMathematics 12 02256 i033
Table 2. Distribution of dataset splitting for legal amount sentences and separated word images.
Table 2. Distribution of dataset splitting for legal amount sentences and separated word images.
Training Set (70%)Validation Set (20%)Testing Set (10%)Total
Original Dataset36912275566
Data augmentation1107122751304
Table 3. Distribution of word images for legal amounts by class.
Table 3. Distribution of word images for legal amounts by class.
Class #Class LabelTraining Set (70%)Validation Set (20%)Testing Set (10%)Total
0Eight2286533326
1Eight Hundred1063016152
2Eighty2156132308
3Fifty2808041401
4Five2687640384
5Five Hundred1444122207
6Forty2246433321
7Four2266433323
8Four Hundred1333720190
9Hundred2617438373
10Million39411257563
11Nine2326634332
12Nine Hundred1133217162
13Ninety2196232313
14One2266433323
15Only1203418172
16Reyal2707739386
17Seven2447036350
18Seven Hundred1183418170
19Seventy2386835341
20Six2306634330
21Six Hundred1414019200
22Sixty2356735337
23Ten2266434324
24Thirty2286533326
25Thousand39411258564
26Three2416935345
27Three Hundred1484222212
28Twenty2777940396
29Two44312665634
30Two Hundred2236333319
31Two Million2035829290
32Two Thousand2015730288
Total74492119109410,662
Table 4. YOLOv5s model assessment results for word detection in terms of P, R, mAP50, mAP50–95, and inference time.
Table 4. YOLOv5s model assessment results for word detection in terms of P, R, mAP50, mAP50–95, and inference time.
ImagesInstances (Objects)PRE (%)RE (%)mAP50 (%)mAP50–95 (%)Inference Time (ms)
7512740.9580.9860.9670.6251.5 per image
Table 5. Classification assessment results of hybrid deep learning models regarding ACC, PRE, F1-Score, AUC, and SEN.
Table 5. Classification assessment results of hybrid deep learning models regarding ACC, PRE, F1-Score, AUC, and SEN.
AI Hybrid ModelACCSENPREF1-ScoreAUC
Hybrid A: Ensemble CNNs + ViT (parallel hybrid)0.99020.98170.98200.98170.9989
Hybrid B: Ensemble CNNs + ViT (Serial hybrid)0.97531975310.97620.97520.9983
Table 6. Classification assessment results of hybrid models regarding Receiver Operating Characteristic (ROC) and Precision–Recall (PR) curves.
Table 6. Classification assessment results of hybrid models regarding Receiver Operating Characteristic (ROC) and Precision–Recall (PR) curves.
AI Hybrid ModelReceiver Operating Characteristic (ROC) CurvePrecision–Recall (PR) Curve
Macro-Average AreaMicro-Average AreaMacro-Average PRMicro-Average PR
Hybrid A: Ensemble CNNs + ViT (parallel hybrid)0.99890.99990.99280.9916
Hybrid B: Ensemble CNNs + ViT (Serial hybrid)0.99800.99990.99160.9896
Table 7. Comparison of the proposed word extraction method with existing Arabic word extraction methods.
Table 7. Comparison of the proposed word extraction method with existing Arabic word extraction methods.
ReferenceMethodDatasetPerformance Accuracy (%)
Jamal (2014) [63]Metric-based segmentation + ESL-based segmentationIFN/ENIT93.135
Lamsaf (2019) [64]The threshold of distances
between connected components method.
AHDBword extraction rate = 87.9%
Al-dmour (2014) [47]A gap metric method between connected components
with the fuzzy c-means clustering algorithm.
AHDBword extraction rate = 84.8%
Neche (2019) [49]Deep learning method:
Convolutional Neural Network+ Bidirectional Long Short-term Memory + Connectionist Temporal Classification.
KHATT80.1
Gader (2022) [51]Deep learning method: Convolutional Neural Network
+ Attention + Convolutional Long Short-term Memory
+ Connectionist Temporal Classification.
AHDB92.8
KHATT91.7
IFN/ENIT94.1
Proposed word
extraction method (2024)
YOLOv5-based word extractionOwn AHLA datasetPRE = 96.3%.
SEN = 96.6%.
mAP = 98.6%
Table 8. Comparison of the proposed word recognition method with existing Arabic word recognition methods.
Table 8. Comparison of the proposed word recognition method with existing Arabic word recognition methods.
StudyMethodDatasetPerformance Accuracy (%)
Al-Nuzaili (2018) [65]Perceptual and Quadratic Angular features
with an Extreme Learning Machine classifier
AHDB83.06
Hassen (2017) [53]Multi-statistical features with Sequential
Minimal Optimization classifier
AHDB91.59
Al-Nuzaili (2018) [54]Gabor features with Extreme Learning Machine (ELM)
and Sequential Minimal Optimization (SMO) classifiers
AHDB72.79 (ELM classifier)
89.29 (SMO classifiers)
CENPARMI80.86 (ELM classifier)
86.72 (SMO classifiers)
Aouadi (2016) [43]Significant structural features
with Markovian classifier.
IFN-ENIT87.0
El-Melegy (2019) [58]CNNAHDB97.85
Korichi (2022) [66]CNNAHDB98.50
Proposed word
recognition method (2024)
Hybrid A: Ensemble CNNs + ViTOwn AHLA datasetACC = 99.02%
SEN = 98.17%
PRE = 98.20%
F1-Score = 98.17%
AUC = 99.89%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Abdo, H.A.; Abdu, A.; Al-Antari, M.A.; Manza, R.R.; Talo, M.; Gu, Y.H.; Bawiskar, S. End-to-End Deep Learning Framework for Arabic Handwritten Legal Amount Recognition and Digital Courtesy Conversion. Mathematics 2024, 12, 2256. https://doi.org/10.3390/math12142256

AMA Style

Abdo HA, Abdu A, Al-Antari MA, Manza RR, Talo M, Gu YH, Bawiskar S. End-to-End Deep Learning Framework for Arabic Handwritten Legal Amount Recognition and Digital Courtesy Conversion. Mathematics. 2024; 12(14):2256. https://doi.org/10.3390/math12142256

Chicago/Turabian Style

Abdo, Hakim A., Ahmed Abdu, Mugahed A. Al-Antari, Ramesh R. Manza, Muhammed Talo, Yeong Hyeon Gu, and Shobha Bawiskar. 2024. "End-to-End Deep Learning Framework for Arabic Handwritten Legal Amount Recognition and Digital Courtesy Conversion" Mathematics 12, no. 14: 2256. https://doi.org/10.3390/math12142256

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop