End-to-End Deep Learning Framework for Arabic Handwritten Legal Amount Recognition and Digital Courtesy Conversion

Abdo, Hakim A.; Abdu, Ahmed; Al-Antari, Mugahed A.; Manza, Ramesh R.; Talo, Muhammed; Gu, Yeong Hyeon; Bawiskar, Shobha

doi:10.3390/math12142256

Open AccessArticle

End-to-End Deep Learning Framework for Arabic Handwritten Legal Amount Recognition and Digital Courtesy Conversion

by

Hakim A. Abdo

^1,2,†

,

Ahmed Abdu

³

,

Mugahed A. Al-Antari

^4,†

,

Ramesh R. Manza

¹,

Muhammed Talo

⁵,

Yeong Hyeon Gu

^4,*

and

Shobha Bawiskar

^6,*

¹

Department of Computer Science and IT, Dr. Babasaheb Ambedkar Marathwada University, Chhatrapati Sambhajinagar 431004, India

²

Department of Computer Science, Hodeidah University, Al-Hudaydah P.O. Box 3114, Yemen

³

Department of Software Engineering, Northwestern Polytechnical University, Xi’an 710072, China

⁴

Department of Artificial Intelligence and Data Science, College of AI Convergence, Daeyang AI Center, Sejong University, Seoul 05006, Republic of Korea

⁵

Department of Computer Science & Engineering, University of North Texas, Denton, TX 76205, USA

⁶

Department of Digital and Cyber Forensics, Government Institute of Forensic Science, Chhatrapati Sambhajinagar 431004, India

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2024, 12(14), 2256; https://doi.org/10.3390/math12142256

Submission received: 11 June 2024 / Revised: 2 July 2024 / Accepted: 10 July 2024 / Published: 19 July 2024

Download

Browse Figures

Versions Notes

Abstract

:

Arabic handwriting recognition and conversion are crucial for financial operations, particularly for processing handwritten amounts on cheques and financial documents. Compared to other languages, research in this area is relatively limited, especially concerning Arabic. This study introduces an innovative AI-driven method for simultaneously recognizing and converting Arabic handwritten legal amounts into numerical courtesy forms. The framework consists of four key stages. First, a new dataset of Arabic legal amounts in handwritten form (“.png” image format) is collected and labeled by natives. Second, a YOLO-based AI detector extracts individual legal amount words from the entire input sentence images. Third, a robust hybrid classification model is developed, sequentially combining ensemble Convolutional Neural Networks (CNNs) with a Vision Transformer (ViT) to improve the prediction accuracy of single Arabic words. Finally, a novel conversion algorithm transforms the predicted Arabic legal amounts into digital courtesy forms. The framework’s performance is fine-tuned and assessed using 5-fold cross-validation tests on the proposed novel dataset, achieving a word level detection accuracy of 98.6% and a recognition accuracy of 99.02% at the classification stage. The conversion process yields an overall accuracy of 90%, with an inference time of 4.5 s per sentence image. These results demonstrate promising potential for practical implementation in diverse Arabic financial systems.

Keywords:

Arabic handwriting legal amount; digital courtesy amount; object detection; ensemble transfer learning; hybrid classification AI model; vision transformer (ViT)

MSC:

68T45

1. Introduction

The Arabic script is one of the most widely used writing systems, employed by over 300 million people worldwide [1]. However, the automatic recognition of Arabic handwriting remains an extremely challenging task. This difficulty stems from the rich complexity of Arabic script, including its cursive style, context-sensitive shapes, and numerous ligatures and overlaps [2,3]. Handwritten Arabic literal amounts are ubiquitous across financial documents and forms. Digitized Arabic documents could facilitate streamlined back-end data processing, improving efficiency and accuracy over manual entry [4]. For banking and finance, the fast digital conversion of deposit slips, cheques, and invoices would speed up transactions and reduce costs [5,6]. Preserving handwritten records into searchable formats can also expand access to historical archives, manuscripts, and other cultural materials. Individuals with visual impairments or limited literacy skills further stand to benefit from technology translating pen strokes into text. Recognizing literal amounts written in Arabic poses unique difficulties compared to other languages [7]. The optical character recognition (OCR) of handwritten Arabic amounts must contend with intricacies like overlapping characters, variant glyph shapes, fragmented characters, and diverse writing styles [7,8,9,10]. The cursive nature of Arabic script leads to heavy merging between characters, making individual glyph segmentation difficult [2,11]. Inconsistent handwriting and sloped alignment further complicate accurate recognition [7]. Prior approaches to recognizing Arabic handwriting legal amounts have focused on segmented words or sub-words [12,13,14,15,16]. However, the accurate segmentation of heavily slanted and merged characters remains challenging. Errors propagate through the recognition pipeline, severely limiting overall accuracy. Preprocessing and careful feature engineering are often needed to extract salient characteristics for classification [17]. The complexity of these traditional OCR pipelines makes end-to-end training difficult. Recent advances in deep learning provide new opportunities for developing end-to-end Arabic handwriting recognition without explicit segmentation [18,19,20]. Deep neural networks can learn robust feature representations directly from raw image data. Convolutional Neural Networks (CNNs) have shown remarkable success in image and text classification tasks [21,22,23,24]. CNNs contain repeated convolutional layers that act as trainable feature extractors. Stacked convolutional layers enable the learning of hierarchical representations, from low-level edges to high-level semantic features [25]. CNNs require minimal preprocessing and can automatically learn discriminative visual characteristics compared to handcrafted features. Sophisticated CNN architectures, like Residual Networks [22] and DenseNets [26], have achieved state-of-the-art results by increasing depth and feature reuse. Transfer learning allows for the leveraging of knowledge from large CNNs pre-trained on datasets like ImageNet [27]. Fine-tuning target data adapts the models to new tasks [28,29]. Recently, Vision Transformers (ViTs) have emerged as an alternative to CNNs [30,31]. ViTs rely entirely on self-attention to model long-range dependencies in images. By capturing global context, ViTs can complement CNNs’ strengths in local feature extraction [32]. Hybrid CNN–ViT models have proven very effective as relates to image recognition benchmarks. Object detection is a key component of many computer vision systems [33]. Deep learning methods like YOLO [34] and Faster R-CNN [35] have driven rapid object detection accuracy and speed advances. YOLO divides images into grids and predicts bounding boxes and class probabilities directly from full images in one pass. This enables real-time processing, while maintaining high accuracy. Faster R-CNN introduces a region proposal network to generate candidate object regions, improving prior R-CNN models. These methods are used in various application domains, i.e., healthcare, autonomous vehicles, face recognition, and extracting the bank cheque fields from bank-printed cheques; YOLO also provides highly performant frameworks for detecting objects like text and handwriting in the answer sheet images [36]. This work proposes an end-to-end framework for Arabic handwritten amount recognition using deep transfer learning. To the best of our knowledge, this represents the first approach targeting the end-to-end recognition of complete legal amounts in Arabic bank cheques. The framework integrates state-of-the-art object detection and image classification models—YOLOv5 for locating amount words and a hybrid CNN–ViT model for word recognition. Rather than segmenting words into individual glyphs, the models recognize complete amounts in a single forward pass. Fine-tuning on a dataset of Arabic amounts allows the models to learn specialized visual features for this task. By recognizing full amounts holistically, the approach aims to overcome cascading errors of segmentation-based OCR. The proposed system requires no complex preprocessing, segmentation, or feature engineering. It can directly recognize amounts from raw cheque images after a simple resizing step. The major contributions presented for this work are briefly summarized here.

A novel end-to-end AI-based framework is proposed for the simultaneous word level detection and classification of Arabic handwritten amounts. To the best of our knowledge, this is the first method that aims to recognize full legal amounts in Arabic cheques end-to-end, overcoming the limitations of segmentation-based OCR methods.
A new dataset of Arabic legal amount handwriting images has been collected and annotated by native experts.
A hybrid AI classification model is introduced, which combines the global and local feature extractors of ensemble CNNs and ViTs, respectively. This complex combination of various AI networks demonstrates its capability to enhance the extracted knowledge, resulting in a superior recognition rate.
A novel algorithm is proposed for converting a series of recognized Arabic handwritten words into numerical amounts for legal purposes.

The rest of this article is organized as follows: Section 2 reviews the relevant and recent AI-based Arabic literature studies. Section 3 explains the proposed methodology and describes the datasets, training process, and experiments. Section 4 presents the results and discussion. Section 5 concludes the paper.

2. Related Work

The recognition of Arabic legal amounts has been a topic of interest for several researchers, who have employed diverse methods to tackle the issue; these researchers have only applied their techniques to segmented words within the legal amount rather than the legal amount field. In this section, we reviewed the existing studies in the following three categories: related works in Arabic handwritten legal amount datasets, Arabic handwriting word extraction, and related works in recognizing segmented legal amount words.

2.1. Arabic Handwritten Legal Amount Datasets

Access to a comprehensive Arabic bank cheque database is important for researchers engaged in processing. This necessity extends beyond mere research, as it plays a vital role in comparing and evaluating researchers’ techniques. Our examination of existing databases related to Arabic bank cheque processing reveals that only one database is sourced from actual Arabic bank cheques, which are available for acquisition at a cost [37]; they presented the database for recognizing handwritten Arabic legal and courtesy amounts on cheques. Collecting the database involved collaboration with Al Rajhi Banking and Investment Corporation to collect real-world gray-level cheque images. The work involved gathering real-life data, segmentation, binarization, tagging, and the validation of the tagging process. The resulting databases include Arabic cheques, legal amounts, sub-words, and courtesy amounts written in Indian digits; the collected database has limitations in the uneven distribution of sub-word classes in the collected data that may cause some training problems, particularly for those rarely used classes. The training and testing of such classes become more difficult and may impose restrictions on the type of classifiers. In [38], Meslati et al. used 576 words for training their hybrid neuro-symbolic classifier. These words were taken from a vocabulary of 48 words, and four writers wrote each word three times. The authors also tested their classifier on 532 words, where three writers wrote the 48 words of the vocabulary thrice.

Additionally, they performed experiments with independent training and testing databases. The first database contained 480 words, where 10 writers wrote the 48 words, and the second database contained 1200 words, where 25 writers wrote the 48 words. The authors do not provide any information about the identity of the writers or their characteristics, and the dataset they used is not available online. In [39,40], they introduced a database comprising 4800 words, each corresponding to 48 frequently used Arabic legal terms. This database was built with the participation of 100 distinct writers for each word; the introduced dataset is not available online. Al-Ma’adeed [41] introduced the AHDB (Arabic Handwriting Database), a comprehensive dataset encompassing Arabic words and texts that one hundred individual writers wrote. A subset of this database comprises handwritten representations of numerical values and quantities typically found on cheques. In totality, the AHDB dataset encompasses 4970 distinct words corresponding to precisely 50 distinct legal amounts, each written by 100 different writers; the legal amount sentences part is not available online.

2.2. Arabic Words Extraction from Whole Cheque Images

Much effort has been put into studying the extraction of Arabic handwriting. The commonly adopted methods for word extraction are mainly bottom-up approaches, utilizing connected component analysis [42], structural feature extraction [43], or a combination of both [44]. In [45], AlKhateeb et al. proposed a component-based approach that analyzed the connected components based on baseline information, resulting in a correct extraction rate of 85%. In [46], an SVM-based gap metric was proposed to segment text at the line level using a threshold to classify gaps as either “within” or “between” words. The method was tested on the ICDAR 2007 Handwriting Segmentation Contest datasets and achieved an F-measure of 93%. In [47], a gap metric method was introduced, employing a clustering algorithm to determine segmentation thresholds, distinguishing between “within the word” and “between words” gaps. Testing on the AHDB dataset [41] resulted in a correct extraction rate of 84.8%. In [48], Al-Dmour et al. proposed a method that used two spatial measures—connected component length and the gaps between them. The connected component lengths were clustered to differentiate between letters, sub-words, and words using the Self-Organizing Map (SOM) algorithm.

Additionally, gaps were clustered into two groups to indicate whether the gap occurred between words or within a word. The method was tested on the AHDB dataset [41] and achieved a correct extraction rate of 86.3%. However, most of these methods rely on gap classification. It is believed that the spaces between words cannot be used as a reliable characteristic for word extraction due to varying writing styles. In [43], N. Aouadi et al. proposed an automatic system for Arabic handwritten word extraction and recognition, which achieved an average recognition rate of 87% on historical handwritten documents. In [49], a CNN-BLSTM-CTC neural network architecture was used for word extraction. The CNN extracted features from the input handwritten text images, which were then fed into a BLSTM network, followed by a CTC to map transcriptions to text-line images. The method was tested on the KHATT Arabic database [50] and achieved an extraction success rate of 80.1%. T. Ben et al. [51] presented an innovative approach for text recognition using an Attention-based CNN-ConvLSTM model followed by a Connectionist Temporal Classification (CTC) function. An Attention-based Convolutional Neural Network (CNN) processed the text-line image inputs to extract key features. These features and the text line transcriptions were then fed into a Convolutional Long Short-term Memory (ConvLSTM) network to establish a mapping between them. The CTC function was applied to learn the alignment between the text-line images and their corresponding transcriptions. The proposed model was evaluated on the KHATT dataset and achieved an extraction rate of 91.7%. The model also demonstrated accuracy on the AHDB database with a 92.8% extraction rate, and a 94.1% extraction rate on the IFN/ENIT database.

2.3. Arabic Words Recognition

In general, several studies used various methods for Arabic handwriting recognition, in particular for Arabic legal amount word recognition. Most researchers have focused on using handcrafted features in Arabic handwritten word recognition. A system for recognizing Arabic handwriting words was presented in [52]. This system utilizes several crucial structural features, including sub-words, diacritics, loops, ascenders, and descenders, which are extracted from the word skeleton and are then input into a neural network. The performance of the proposed system was evaluated using a set of images from the IFN/ENIT benchmark database. In [43], a handcrafted approach was employed to extract several significant structural features such as loops, stems, legs, and diacritics, taking into account their position within the word, such as the beginning, middle, or end, and their location within the upper, central, or lower bands. These extracted features were then subjected to a Markovian classifier. The performance of the model was evaluated using the IFN-ENIT dataset. Practically, several studies have been carried out on Arabic legal amount word recognition. Souici et al. [38] introduced a novel hybrid neuro-symbolic classifier to recognize Arabic legal amount words. The authors utilized contour tracing to extract features, including the number of loops, ascenders, descenders, connected components, and diacritical marks. The classifier was trained on 1200 words from 25 different writers. Hassen et al. [53] proposed an Arabic handwritten monetary word recognition system based on multiple statistical features. The system employed a Histogram of Oriented Gradients, Invariant Moments (IVs), and Gabor filters as statistical features, and utilized the SMO classifier for recognition. In another study, Al-Nuzaili et al. [54] employed Gabor features and utilized the Extreme Learning Machine and Sequential Minimal Optimization classifiers. The Gabor feature extraction was performed in two groups, with image sizes of 128 × 128 and 32 × 64. The performance of the proposed methods was assessed using the AHDB and CENPARMI datasets. Recently, several research efforts have been made in Arabic handwriting recognition. N. Altwaijry et al. [55] presented a Convolutional Neural Network (CNN) model for recognizing Arabic handwriting. The model was trained and evaluated on the Hijja and AHCD datasets, achieving a recognition accuracy of 88% and 97%, respectively. R. Maalej et al. [56] proposed a deep neural network system that combines CNNs and Long Short-term Memory (LSTM) networks. The CNN component extracts important features from the input images, while the LSTM network classifies the inputs based on these features. The system also includes a Connectionist Temporal Classification (CTC) layer, which improves the performance of the LSTM network by aligning its output with the input sequence. M. Elleuch et al. [57] proposed a handwriting recognition model that integrates a CNN with a Support Vector Machine (SVM) classifier. The model uses raw pixel data as input and has been tested on two Arabic handwriting datasets—HACDB and IFN/ENIT. El-Melegy et al. [58] developed a deep learning approach based on a CNN for recognizing Arabic handwritten amounts. The approach was trained and tested on the AHDB dataset.

3. Methods and Materials

The proposed AI framework for Arabic Handwritten Legal Amount Recognition and Digital Courtesy Conversion, as illustrated in Figure 1, offers a sophisticated, end-to-end solution that comprises two main stages—word detection and word classification. Initially, handwritten Arabic data are collected from 160 native speakers and are meticulously annotated at both word and sentence levels to facilitate precise detection and classification. The data undergo rigorous preprocessing, including image normalization, uniform resizing, data splitting, and augmentation, to optimize it for model training. The first stage employs an AI-based YOLO (You Only Look Once) detection model to localize and accurately extract Arabic words from the images. In the second stage, a hybrid classification model, which integrates Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), is used to recognize the extracted words. A courtesy amount conversion algorithm then processes these recognized words to generate the final digital courtesy amount. This comprehensive framework is meticulously designed to ensure high accuracy and efficiency in recognizing and converting handwritten Arabic legal amounts, leveraging state-of-the-art AI techniques to achieve a robust performance.

3.1. Arabic Handwritten Legal Amount (AHLA) Dataset

The AHLA dataset is collected by distributing an advanced designed report with Arabic native speakers. Our dataset contains two kinds of Arabic handwritten expressions as follows:

Arabic word-level images are used to express legal amounts of bank cheques, including the colloquial words used in writing Arabic numbers. A total of 10,660 legal amount word images across 33 different classes were collected and labeled, involving the varieties of Arabic vocabulary needed to write out monetary sums, as presented in Table 1.
Arabic legal amount sentence images where 160 Arabic native speakers or writers participate in writing 160 legal amount reports in Arabic words, as shown in Figure 2.

The focused scope on legal/financial vocabulary and formats in word- and sentence-level samples is intended to provide challenging real-world examples for developing Arabic handwriting text recognition systems, especially for financial use cases such as processing monetary documents or bank cheques. The driving purpose behind assembling this multi-tiered dataset is to provide the appropriate Arabic language samples to train and test systems that automatically identify and understand handwritten legal amounts on financial documents and then convert those semantic phrases into corresponding numeric currency totals suitable for digital processing and banking.

3.2. Data Preprocessing and Preparation

3.2.1. Arabic Legal Amount Sentence Images

In the initial phase of preparing the legal amount sentences’ images for word detection, we embarked on the task of annotating the images. This annotation process involved the precise delineation of objects of interest (Arabic handwritten legal amount word) through the utilization of bounding boxes. To facilitate this task, we harnessed the capabilities of Roboflow’s annotation tool. This resource was pivotal in streamlining the annotation process, enhancing our labels’ accuracy and consistency, as shown in Figure 2. Following meticulous annotation, we partitioned the annotated images into subsets; the dataset was segregated into a training set comprising 65% of the data, a validation set encompassing 22%, and a testing set incorporating 13%, as shown in Table 2. Preprocessing steps were executed to prepare the data for training our models. These preprocessing steps included resizing images to conform to a uniform size and normalizing pixel values. The standardization of image sizes and pixel values is of paramount importance in ensuring the effectiveness of the model training process.

Furthermore, a suite of data augmentation techniques was applied to augment the dataset and enhance the model’s capacity to generalize across diverse data instances. These augmentation techniques encompassed operations such as image flipping, rotation, cropping, and adjustments to brightness and contrast. Notably, these data augmentation procedures were exclusively implemented on the training data. The rationale behind this approach is to prevent any bias in the evaluation metrics when assessing model performance on the validation and test datasets. Lastly, the dataset was exported in YOLOv5 PyTorch TXT format. This format was chosen to align with the requirements of the YOLOv5 model architecture, ensuring seamless compatibility between the dataset and the model.

3.2.2. Arabic Word-Level Images

Preprocessing plays a crucial role in Arabic handwritten legal amount word recognition because of the unique attributes of Arabic writing, variations in writing styles, and quality. A comprehensive preprocessing pipeline is critical to improve the input image quality and facilitate accurate segmentation and recognition. The preprocessing pipeline for Arabic handwritten legal amount word recognition includes image scaling, data splitting, and data augmentation. Image scaling resizes input images to a consistent resolution of 224 × 224 pixels, standardizing the image dimensions to ensure compatibility with models and to mitigate the impact of variations in image sizes. In the context of multiclass recognition approaches, the dataset undergoes a division process to facilitate training, validation, and testing procedures. Specifically, this division randomly allocates word images from each class into three subsets—training, testing, and validation. This division is executed for all multiclass recognition approaches, as follows: 70% of the word images corresponding to each class are allocated to the training set, 20% to the validation set, and 10% to the testing set. The data distribution across these subsets is presented in Table 3. Data augmentation methods are utilized on training data to enhance the robustness of models and to reduce the risk of overfitting. These techniques include rescaling to normalize pixel values, shear transformations to mimic diverse writing styles, random zooming to replicate varying text sizes, and horizontal flipping for orientation diversity. These operations generate new images from the existing ones, introducing controlled variations and enhancing the models’ generalization ability to real-world scenarios.

3.2.3. Detection Task: AI-Based YOLO Detection Model

YOLOv5 (You Only Look Once version 5) is a state-of-the-art real-time object detection system developed by Glen Jocher and the Ultralytics team [59]. It is an open-source framework for training and deploying object detection models. YOLOv5 uses a single Convolutional Neural Network (CNN) architecture to predict the bounding boxes and class probabilities for objects in an image; the bounding boxes around detected objects allow for further processing, such as tracking and recognition. The architecture is designed to be fast and efficient, making it suitable for real-time applications. In the context of a transfer learning task for Arabic handwriting word extraction, YOLOv5 will be used as a pre-trained object detection model for the Arabic handwriting legal amount word extraction task. YOLOv5 consists of three main architectural blocks, CSPDarknet, PANet, and yolo head detection, as shown in Figure 3. The CSPDarknet block is the backbone of the model and is responsible for extracting features from the input image. It is created by incorporating the cross-stage partial network (CSPNet) into Darknet. The CSPDarknet consists of a series of convolutional layers with Cross-Stage Partial connections, combining features from earlier layers with features from deeper layers. This helps improve the model’s accuracy and robustness and ensures inference speed. As seen in Figure 3, the C3 block is a building block of the CSPDarknet architecture; the block consists of a series of convolutional layers, each with a kernel size of 3 × 3, that are used to extract features from the input image, batch normalization, and a Leaky ReLU activation function typically follows the C3 block. The Path aggregation network (PANet) is the neck of the YOLOv5 model, and it boosts information flow by adopting a new feature pyramid network (FPN) structure with an enhanced bottom-up path, which improves the propagation of low-level features.

Additionally, adaptive feature pooling links the feature grid and all feature levels, propagating useful information in each feature level directly to the following sub-network; this improves the utilization of accurate localization signals in lower layers, which can significantly enhance the location accuracy of objects. Finally, the detection block, also known as the model head, generates three different sizes of feature maps (18 × 18, 36 × 36, and 72 × 72) to achieve multi-scale prediction, allowing the model to handle small, medium, and large objects. This is particularly useful when working with Arabic handwritten words, as the sizes and shapes of these words can vary significantly between writers. Multi-scale detection ensures the model can adapt to size changes in detecting words. The final detection layers in YOLOv5 use the sigmoid activation function, while the middle or hidden layers utilize Leaky ReLU activation functions. The final output vectors of the detection block include the bounding boxes, confidence scores, and class probabilities for each object detected (words). This work ignores any prediction with a confidence score lower than 0.5. The detected words are arranged in descending order depending on the xmin value, to match the rules of writing in the Arabic language, and then segments these words and passes them to the next step for recognition. To retrain the YOLOv5s model, we utilized 200 legal amount images from the AHLA dataset, which were annotated. The labeling and annotation of the images were accomplished online using the Roboflow platform (https://roboflow.com). The images were randomly partitioned into 80% training, 15% validation, and 5% testing sets.

3.3. Classification Task: The Proposed Hybrid Ensemble CNNs with the ViT

In this work, to recognize the Arabic handwritten amount words, we propose a novel deep learning architecture for Arabic handwritten word recognition that hybrid ensembles two CNN models—Xception and InceptionResNetV2—with a Vision Transformer (ViT), as shown in Figure 4. The model takes Arabic word images as input and passes them through the two CNNs independently, which act as feature extractors. Their outputs are flattened and concatenated to ensemble their learned visual features. In parallel, the image is fed to the ViT model, which encodes global contextual relationships. The concatenated CNN ensemble output is fused with the ViT output through concatenation. This allows the model to represent robust local features from the CNN ensemble and long-range dependencies from the ViT. The concatenated multi-model representation is fed to a classifier. By leveraging an ensemble of CNNs, our model can capture diverse complementary local visual patterns. The ViT incorporates global context and long-range sequence relationships. By hybrid integrating the CNN ensemble with the ViT, our model architecture is designed to combine their respective strengths. The entire hybrid model is trained end-to-end to recognize Arabic handwritten words. Extensive experiments show that our hybrid ensembling approach performs better than individual CNNs or ViT models, demonstrating the effectiveness of fusing diverse models.

3.3.1. Global Knowledge Extraction: Ensemble Transfer Learning via Various CNNs

We carefully select the best-performance AI models for detection and classification stages based on empirical and experimental studies. The trail-based error strategy checks various individual models and then selects the models with the best recognition performance.

Xception: Xception is a deep Convolutional Neural Network architecture proposed by Chollet et al. [60]. It relies on depth-wise separable convolutions as basic building blocks. Depth-wise separable convolutions split a standard convolution into two layers—a depth-wise convolution to filter inputs per channel, followed by a pointwise convolution to combine channel outputs. This factorization dramatically reduces computation compared to traditional convolutions. Xception achieves a strong image classification performance with only 36 convolutional layers, much less than other state-of-the-art CNNs. It provides an efficient and lightweight alternative to models like Inception V3 and ResNet [60]. Using separable convolutions allows Xception to gain accuracy from considerably increased depth, while maintaining a relatively low computational budget. Key strengths of Xception include its efficiency, accuracy, and modular structure based on repeating convolution blocks [61].
InceptionResNetV2: Inception-ResNetV2 was proposed by Szegedy et al. [62] to improve upon earlier Inception architectures. It combines the Inception structure with residual connections, inspired by the success of ResNet models [22]. Inception modules employ multiple filter sizes in parallel to capture details at various scales. Residual connections help optimization and accuracy by allowing direct gradient propagation across layers. Inception-ResNetV2 achieves high accuracy on image classification benchmarks like ImageNet, surpassing its predecessors—the hybrid Inception–Resnet design benefits from multi-scale feature extraction and residual learning. Additional enhancements include batch normalization and factorized convolutions to reduce computational requirements. The model is readily transferable and can be a strong feature extractor for many vision tasks.

3.3.2. Local Knowledge Extraction: Vision Transformer (ViT)

The Vision Transformer (ViT) was proposed by Dosovitskiy et al. [30] as a new approach to applying transformers to image recognition tasks. While transformers have been hugely successful in NLP, ViT was the first model to demonstrate their potential for computer vision. The key insight is representing an image as a sequence of patches embedded as tokens, which allows modeling with standard transformer encoders. ViT relies solely on attention mechanisms without convolution layers to learn relationships between image patches. ViT achieves a strong performance on image classification benchmarks, rivaling state-of-the-art CNNs. The global receptive field of transformers allows modeling longer-range dependencies compared to CNNs. ViT has become hugely influential, leading to numerous further adaptations of transformers for vision [31].

3.4. Courtesy Amount Conversion

The writing structure of the Arabic legal amount sentence starts from right to left, with the highest value on the right and decreasing to the left; in other words, the largest value (million part) comes in the first part on the right, then followed by the smaller value (thousand part), until the smallest value (one part) comes in the last part on the left. In this section, we proposed an algorithm to generate the courtesy amount from the Arabic legal amount sentence after recognizing each word, named the LegalToCourtesy algorithm (see Algorithm 1).

The algorithm computes predicted legal amount words’ courtesy amount value (CAV). The input is a set of predicted legal words (L) words, which come out from the classification phase, and the output is the CAV. Let us assume that

L = [6, 10, 1000000, 800, 20, 1000, 3, 100, 5, 60]

is a generated list dictionary of predicted legal amount words that output from the processed in-word detection phase and word classification phase from the legal amount sentence image in Figure 5.

Algorithm 1 LegalToCourtesy Algorithm

Input:

L = \{{p w}_{1}, {p w}_{2}, \dots, {p w}_{n}\}

← Set of predicted legal word
Output: CAV courtesy amount value

$P = \{1,000,000, 1000, 100\}$ ← Part value of numbers
Initial $M P = \{\} \leftarrow M i l l i o n s ’ P a r t$
Initial $T P = \{\} \leftarrow T h o u s a n d s ’ P a r t$
Initial $H P = \{\} \leftarrow H u n d r e d s ’ p a r t$
Initial $T O P = \{\} \leftarrow T e n s a n d O n e s ’ P a r t$
For each part in P do
If part exists in $L$
$i n d e x = l . i n d e x (p a r t)$
If match (part, 1,000,000) = true
$M P = L [0 : i n d e x]$
Delete MP from L
Else if match(part,1000) = true
$T P = L [0 : i n d e x]$
Delete TP from L
Else if match(part,100) = true
$H P = L [0 : i n d e x]$
Else
$T O P = L$
$C o m p u t e M P V a c o r d i n g t o C a l M i l T h P a r t V a l u e f u n c t i o n \leftarrow M i l l i o n s ’ P a r t V a l u e$
$C o m p u t e T P V a c o r d i n g t o C a l M i l T h P a r t V a l u e f r m u l a \leftarrow T h o u s a n d s ’ P a r t V a l u e$
$H P V = s u m (H O P) \leftarrow H u n d r e d s ’ P a r t V a l u e$
$T O P V = s u m (T O P P) \leftarrow T e n s a n d o n e s ’ P a r t V a l u e$
$C A V = M P V + T P V + H O P V$

The algorithm splits L into four sets of part values—the millions part (MP), thousands part (TP), hundreds part (HP), and the tens and ones’ part (TOP). The algorithm checks if the part value exists in L; if it exists, the index of the part value is used to determine the corresponding set of words for that part. The value sets of MP and TP are then calculated based on the CalMilThPartValue function, as presented in Algorithm 2, and keep results in MPV and TPV for MP and TP sets, respectively, as shown in Figure 6.

The TPV and TOPV are calculated by the sum of the TP and TOP element sets, respectively. Finally, the CAV is calculated as follows:

C A V = M P V + T P V + H P V + T O P V = 16,000,000 + 820,000 + 300 + 65 = 160,820,365

(1)

Algorithm 2 CalMilThPartValue Function

Define the CalMilThPartValue function as
Input:

S P = \{{s p}_{1}, {s p}_{2}, \dots, {s p}_{n}\}

← a subset of the part value, Unit ← Part value
Output: Value ← Value of part

$I n i t i a l V a l u e = 0$
If hundreds exist in SP
$i n d e x = S P . i n d e x (h u n d r e d s)$
$H S P = S P [0 : i n d e x] \leftarrow h u n d r e d s ’ p a r t o f s u b s e t ’ s p a r t$
Delete $H S P$ from $S P$
$U S P V = U n i t \leftarrow U n i t s ’ P a r t o f s u b s e t ’ s p a r t v a l u e$
Delete $U S P$ from $S P$
$T S P = S P \leftarrow T e n s ’ p a r t o f s u b s e t ’ s p a r t$
$H S P V = T h e p r o d u c t o f m u l t i p l y i n g t h e e l e m e n t s o f H S P \leftarrow H u n d r e d s ’ p a r t V a l u e$
$T S P V = T h e s u m o f m u l t i p l y i n g t h e e l e m e n t s o f H S P \leftarrow T e n s ’ p l a r t V a l u e$
$V a l u e = (H S P V + T S P V) \times U S P V$

3.5. Experimental Setup

This section details the specific configurations for the evaluation experiments. The experiments were conducted on Google Colab using an Nvidia Tesla T4 K80 GPU with 16 GB of RAM. The YOLOv5 algorithm was implemented using PyTorch 1.13.1 and CUDA cu116, and the hybrid CNN–ViT model was implemented using TensorFlow 2.9.2, based on the Python 3.8.10 programming language.

3.6. Evaluation Metrics

Evaluating the performance of object detection approaches involves utilizing various statistical and machine learning metrics, such as ROC curves, precision and Recall, F-scores, and false positives per image [54]. The detection accuracy is typically determined by comparing the results of the object detector to a reference set of ground-truth bounding boxes. Most object detection studies have employed the overlap criterion introduced by Everingham et al. [55] for the Pascal VOC challenge to determine the correctness of detection. This criterion involves assigning detections to ground-truth objects and determining true or false positives based on the overlap between the predicted and ground-truth bounding boxes. According to [55], a detection is correct if the overlap ratio between the predicted and ground-truth boxes exceeds 0.5 (50%).

The overlap criterion used in Pascal VOC is the intersection over union (IoU) and is calculated as follows:

I o U = a 0 = \frac{a r e a (B_{p} \cap B_{g t})}{a r e a (B_{p} \cup B_{g t})}

(2)

where IoU represents the intersection over union; a0 represents the overlap ratio; B_p and B_gt denote the predicted and ground truth bounding boxes, respectively; and area (B_p ∩ B_gt) represents the overlap or intersection of the predicted and ground-truth bounding boxes. In contrast, area (B_p ∪ B_gt) represents the union of these two bounding boxes. By matching detections to the ground truth, it is possible to determine the number of correctly classified objects, referred to as true positives (TPs), incorrect detections or false positives (FPs), and ground-truth objects not missed by the detector or false negatives (FNs). A wide range of evaluation metrics can be computed using the total TPs, FPs, and FNs count.

Our YOLOV5-based word extraction module was assessed using four metrics—Precision, Recall, F1-score, and Average Precision (mAP). Precision measures the accuracy of detected legal amount words out of the total detected words.

P r e c i s i o n (P R E) = \frac{T P}{T P + F P} .

(3)

Conversely, Recall calculates the ratio of correctly detected words to the total number of legal amount words in the dataset.

S e n s i t i v i t y (S E N) / R e c a l l (R E) = \frac{T P}{T P + F N} .

(4)

The F1-score is the trade-off between Precision and Recall, giving a general idea of the algorithm’s performance.

F 1 - s c o r e = \frac{2 \times P r e c i s i o n * R e c a I l}{P r e c i s i o n + R e c a I l} .

(5)

Finally, Average accuracy (mAP) evaluates the algorithm’s performance under different confidence thresholds.

A P = \int_{0}^{1} P (R) d R .

(6)

We utilized four metrics to assess the proposed hybrid ensemble’s (CNNs–ViT) performance—accuracy (ACC), Recall, Precision, and F1-score. For the model, we calculated a Confusion Matrix. From the Confusion Matrix, we obtained the values for false negatives (FNs), false positives (FPs), true negatives (TNs), and true positives (TPs).

Accuracy (A C C) = \frac{T P + T N}{T P + T N + F P + F N} .

(7)

4. Results and Discussion

This section will discuss experimental results; we evaluated the performance of our novel approach for recognizing legal amounts from a set of documents, generating the courtesy amount. Our approach consisted of the following three major components: YOLOv5s-based word segmentation, CNN-based word recognition, and Courtesy Amount Generation.

4.1. Hyperparameter Settings

4.1.1. Hyperparameter Settings of YOLOv5 Model

Several critical hyperparameters were configured in the fine-tuning process of the YOLOv5 algorithm for detecting Arabic handwritten words. The input image size was set to 640 × 640 pixels, while filling any unoccupied areas with a white background, defining the dimensions of images used during training. Mini batches of 32 samples were employed for each iteration, influencing the model’s gradient updates. The training was conducted for 300 epochs using the stochastic gradient descent (SGD) algorithm with a learning rate of 0.01, a weight decay of 0.0005, and an SGD momentum of 0.937, implying that the entire dataset was processed 300 times for model optimization. The dataset configuration, including class information and file paths, was as specified in the data .YAML file. For the initial weights of the YOLOv5 model, ‘yolov5s.pt’ was utilized. Finally, caching was not explicitly enabled in this setup. The other hyperparameters not explicitly mentioned remain consistent with the default values used in YOLOv5s.

4.1.2. Hyperparameter Settings of Hybrid CNN–ViT

In the training process of the hybrid CNN–ViT model for Arabic handwritten word recognition, several key hyperparameters were configured. The input image size was set to 224 × 224 pixels, with any unoccupied area filled with white background, defining the dimensions of images used during training and inference. Mini batches of 16 samples were employed for each iteration, influencing the model’s gradient updates. The training was conducted for 80 epochs and used the Adam optimization algorithm with a learning rate of 1 × 10⁻⁴ and a categorical cross-entropy loss. For regularization, a dropout with a rate of 0.5 was applied to the classifier layers. The CNN layers were frozen during training, while the classifier layers were trained from scratch. The Xception, InceptionResNetV2, and ViT models used ImageNet pre-trained weights for initialization. Data augmentation was applied on the fly, including rotations, shifts, flips, and color jittering. The dataset configuration was specified in the data loading module. The other hyperparameters not explicitly mentioned remain consistent with the default values used in the original CNN and ViT model configurations. The model was trained end-to-end using the above settings to optimize it for Arabic handwritten word recognition.

We set these parameters to balance computational efficiency and model performance. The choice of learning rates, batch sizes, and optimization algorithms were determined through a combination of grid search and empirical validation, ensuring a robust performance across various training scenarios. This careful tuning was crucial for achieving the high accuracy rates reported in our results, and it reflects a meticulous approach to model optimization and validation.

4.2. YOLOv5s Model: Word Detection Outcomes

The YOLOv5s-based word extraction was used to detect and extract the words from the legal amount images. The purpose of the loss function is to assess the model’s training progress in each iteration and to compute the discrepancy between the predicted and actual values during the iteration. The YOLOv5 loss function is represented as object loss, box loss, and class loss, where box loss reflects how well the algorithm can pinpoint the object’s center and the accuracy of the predicted bounding box encompassing the object. The target loss (object loss) essentially gauges the likelihood of an object existing in the designated region of interest, and the classification loss (class loss) represents the category loss. As there is only one class in the training set for this study, class loss is equal to 0. The loss function curves for the training procedure are depicted in Figure 7. As demonstrated by the loss curve, at 180 epochs, the loss function of the training set dropped from an initial value of 0.233 to approximately 0.087, and the loss function of the validation set decreased from its starting value of 0.299 to 0.154. The precision of a model refers to its ability to identify an object accurately.

Meanwhile, Recall is a metric that evaluates the extent to which the model searches for all instances of the object when recognizing it. The variation of Precision and Recall during the model’s training, based on the number of epochs, is depicted in Figure 7. It is evident that the highest Precision reached by the model during its training process is 0.997, and the highest Recall achieved is 0.997. The mean Average Precision (mAP) is a crucial metric in determining the effectiveness of an object detection network. It measures the network’s performance by calculating the area under the Precision–Recall curve. The [email protected] metric evaluates the network’s performance when the intersection over union (IOU) threshold is set to 0.5. On the other hand, [email protected]:0.95 considers the average precision across different IOU thresholds ranging from 0.5 to 0.95, evaluated at a step of 0.05. As depicted in Figure 8, the mAP curve during the training process is shown. The ultimate evaluation results show that the model achieved an [email protected] of 0.995 and an [email protected]–0.95 of 0.711. The confusion matrix for the model is shown in Figure 9. The model achieved an overall accuracy of 99.99%, with only 0.01% of objects misclassified as background. This demonstrates the model’s strong performance in accurately detecting words. The sample results in Figure 10 indicate a clear success for the YOLOv5 model in accurately detecting Arabic handwritten legal amount words. The model was tested on 75 Arabic legal amount images, with impressive results, as shown in Table 4; the Precision and Recall rates achieved by the model were 0.977 and 0.961, respectively. The model also achieved a mean Average Precision (mAP) score of 0.979 at an intersection over union (IoU) threshold of 0.5 and a mAP score of 0.596 at IoU thresholds between 0.5 and 0.95.

Additionally, it is important to note that the number of letters in an Arabic word does not significantly impact detection accuracy. Our model has been designed to effectively handle words of varying lengths, and the accuracy remains consistent regardless of the number of letters. Our model’s performance metrics and visual examples demonstrate this robustness throughout the training and evaluation processes.

4.3. Hybrid Models: Classification Outcomes

Two hybrid learning scenarios were designed to recognize Arabic handwritten words by integrating pre-trained CNN models with the Vision Transformer (ViT) model. In the first scenario, referred to as Hybrid A, a hybrid ensemble of two CNN models, Xception and InceptionResNetV2, was fused in parallel with the ViT, as depicted in Figure 4. In the second scenario, Hybrid B, features extracted from Xception and InceptionResNetV2 were concatenated in series with the ViT. These scenarios aim to determine the optimal positioning of the ViT to enhance classification performance. The classification performance of each hybrid deep learning model is comprehensively delineated in Table 5; regarding classification accuracy, Hybrid A achieved an accuracy of 99.02%, while Hybrid B performed lower, with an accuracy of 97.531%. This suggests that Hybrid A outperforms Hybrid B in terms of overall accuracy. Precision (PRE) and F1-Score metrics also demonstrate the superiority of Hybrid A with a PRE of 98.20% and an F1-Score of 98.17%, compared to Hybrid B’s PRE of 97.62% and F1-Score of 97.52%. These differences indicate that Hybrid A achieves a higher accuracy and exhibits a better precision and F1-Score.

Furthermore, the area under the ROC curve (AUC) and sensitivity (SEN) values are crucial for assessing the models’ performance. Hybrid A surpasses Hybrid B in both AUC (99.89% compared to 99.83%) and SEN (98.17% compared to 97.531%). This implies that Hybrid A can better distinguish between classes and is more sensitive in identifying true positive cases. Furthermore, a comprehensive presentation of the classification evaluation results, specifically regarding Receiver Operating Characteristic (ROC) and Precision–Recall (PR) curves, is summarized in Table 6. Hybrid A exhibits impressive results with a macro-average AUC of 99.89% and a micro-average AUC of 99.99%, as well as a macro-average Precision–Recall score of 99.28% and a micro-average Precision–Recall of 99.16%. On the other hand, Hybrid B still performs slightly lower, with a macro-average AUC of 99.80%, a micro-average AUC of 99.99%, along with a macro-average Precision–Recall score of 98.16%, and a micro-average Precision–Recall of 98.96%.

The number of accurately and mistakenly identified samples was assessed using a confusion matrix or contingency table, as depicted in Figure 11. For Hybrid A, 18 samples were misclassified. In contrast, as shown in Figure 12, Hybrid B demonstrated significantly higher misclassifications, with 29 samples being wrongly identified. These findings suggest that Hybrid A outperforms Hybrid B regarding classification accuracy. The lower number of misclassified samples in Hybrid A signifies the higher precision and reliability in its predictions, highlighting its superior performance compared to Hybrid B.

4.4. Courtesy Amount Generation Outcomes

We used 50 images in the test set that were not involved in training to assess the approach’s ability to generate the courtesy amount. The testing results are evaluated based on the generated courtesy amount values. Our approach could generate the courtesy amount with an accuracy rate of 90% and an inference time cost of 4.5 s per image. Our novel approach demonstrated excellent results in extracting the legal amount words, recognizing them, and generating the courtesy amount; Figure 13 shows the samples of courtesy amounts that were correctly generated. Additionally, our approach can detect words even in overlapping cases, as shown in Figure 14a, and can classify words despite spelling mistakes, as demonstrated in Figure 14b. This approach can automate legal amount recognition processing in financial documents, bank cheques, invoices, and other financial transactions. All improperly generated courtesy amounts are caused by inaccurate word detection estimates or incorrect word recognition of legal amounts. Figure 15 shows examples of improperly generated courtesy amounts. In Figure 15a, the error is due to the wrong classification of a word, which is caused by a spelling error that affected the classification process. In Figure 15b, the error arises from detecting a sub-word as a complete word, which impacts the generation of the courtesy amount.

4.5. Comparison of Proposed Methods with Existing Studies

Since there is no work on the Arabic handwriting legal amount recognition, which can generate the Arabic courtesy amount based on the Arabic legal amount recognition, we can compare the YOLOv5-based word extraction method with existing Arabic word extraction methods, as shown in Table 7, and compare hybrid CNN–ViT-based legal amount words recognition methods with existing Arabic words recognition methods, as shown in Table 8.

Our experimental results demonstrate that ViTs, as well as the hybrid models integrating ViT with selected CNNs, exhibit an even better performance, emphasizing the potential of combining these two deep learning paradigms. Hybrid B emerges as the best-performing model, offering high accuracy and strong discriminative abilities, making it a promising choice for practical applications in texture-based classification tasks. The number of accurately and mistakenly identified samples was measured using a confusion matrix or contingency table, as shown in Figure 11 and Figure 12. For Hybrid B, 29 samples were misclassified, while Hybrid A exhibited a significantly lower number of misclassifications, with only 18 samples being wrongly identified. These results suggest that Hybrid A is more robust and accurate in classification than Hybrid B. The lower number of misclassified samples indicates the higher precision and reliability in Hybrid A’s predictions.

4.6. Limitations and Future Work

While promising, the proposed model has several limitations that provide opportunities for future work. Different individuals write the same characters differently, leading to variability in handwriting styles that can impact the model’s accuracy. Additionally, overlapping characters in handwritten Arabic complicate recognition, and spelling errors can affect word detection and classification. Our current framework does not explicitly handle all typos in handwritten words, though it has shown the capability to manage some typographical errors, as demonstrated in Figure 14b. The dataset used for training and evaluation contained only 160 legal amount images, which is relatively small for deep learning models. Training on larger and more diverse datasets would likely improve the model’s accuracy and robustness.

Additionally, the model was only evaluated on recognizing Arabic literal amounts, and its capabilities for other Arabic handwriting tasks remain unknown. Testing the model on additional datasets and tasks, such as recognizing courtesy amounts and free-form handwriting, would provide insight into its versatility. Integrating natural language processing techniques to recognize numerical values from words could expand the model’s capabilities. Furthermore, optimizing the model architecture and hyperparameter tuning could improve efficiency for real-time applications. Addressing these limitations presents worthwhile avenues for future research. With larger datasets, evaluation across tasks, integration of contextual understanding, and optimization, the proposed model can potentially generalize better and pave the way for production-ready Arabic handwriting recognition.

5. Conclusions

This paper introduced a novel deep learning framework for the end-to-end recognition of handwritten Arabic legal amounts on bank cheques, overcoming limitations of traditional segmentation-dependent OCR techniques. The proposed pipeline integrates three main stages. First, YOLOv5-based word detection to accurately detect and extract handwritten words, achieving over 99% accuracy. Second, the hybrid CNN–ViT model is proposed to recognize extracted words by combining the strengths of CNNs and Vision Transformers, achieving over 99% accuracy. Third, the LegalToCourtesy algorithm is proposed to convert the recognized words into their corresponding numerical courtesy amounts, accounting for the right-to-left structure of Arabic text. End-to-end testing demonstrated 90% accuracy in extracting handwritten Arabic legal amounts and correctly converting them into courtesy amounts, without requiring complex preprocessing or segmentation. The robust and accurate performance highlights the effectiveness of leveraging deep transfer learning for Arabic handwriting recognition tasks. This work advances the field of Arabic OCR and document image analysis, presenting new techniques for legal amount recognition in Arabic script. The high accuracy and practical implications suggest a significant potential for integration into real-world systems, particularly in banking and financial services for automating cheque processing and financial workflows.

Author Contributions

H.A.A.: conceptualization; data curation; software; writing—original draft. A.A.: validation; resources; investigation; methodology. M.A.A.-A.: conceptualization; supervision; validation; writing—review and editing; project administration; funding acquisition. R.R.M.: formal analysis; resources; visualization. M.T.: validation; resources; visualization; investigation. S.B.: data curation; conceptualization; supervision; Y.H.G.: formal analysis; resources; validation; writing—review and editing; funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2021-II210755, Dark data analysis technology for data scale and accuracy improvement).

Data Availability Statement

To achieve the goal of this study, we use our private AHLA dataset and source code, which are explained in these URLs: Arabic Handwritten Legal Amount (AHLA) Dataset available at: https://doi.org/10.5281/zenodo.10845222 (accessed on 2 July 2024). Source code of this implementation is accessible here: https://github.com/Hakim-Abdo/ArabicHandwrittenLegalAmountToCourtesyAmount.git (accessed on 2 July 2024).

Acknowledgments

This work was partly supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2021-II210755, Dark data analysis technology for data scale and accuracy improvement). This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. RS-2022-00166402 and RS-2023-00256517).

Conflicts of Interest

The authors declare that there are no conflicts of interest to publish such research findings.

Abbreviations

AI	Artificial Intelligence
YOLO	You Only Look Once
CNN	Convolutional Neural Network
ViT	Vision Transformer
OCR	Optical Character Recognition
TL	Transfer Learning
RE	Recall
ROC	Receiver Operating Characteristic
PR	Precision–Recall
AUC	Area Under the Curve
ACC	Accuracy
PRE	Precision
SEN	Sensitivity
CSP	Cross-stage partial network
SPP	Spatial Pyramid Pooling
Conv	Convolutional layer

References

Al-Muhtaseb, H.A.; Mahmoud, S.A.; Qahwaji, R.S. Recognition of off-line printed Arabic text using Hidden Markov Models. Signal Process. 2008, 88, 2902–2912. [Google Scholar] [CrossRef]
Tanvir Parvez, M.; Mahmoud, S.A. Arabic handwriting recognition using structural and syntactic pattern attributes. Pattern Recognit. 2013, 46, 141–154. [Google Scholar] [CrossRef]
Suen, C.; Kharma, N.; Cheriet, M.; Liu, C.-L. Character Recognition Systems: A Guide for Students and Practitioners; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2007; p. 326. [Google Scholar]
Al-homed, L.S.; Jambi, K.M.; Al-Barhamtoshy, H.M. A Deep Learning Approach for Arabic Manuscripts Classification. Sensors 2023, 23, 8133. [Google Scholar] [CrossRef] [PubMed]
Djaghbellou, S.; Bouziane, A.; Attia, A.; Akhtar, Z. A Survey on Arabic Handwritten Script Recognition Systems. Int. J. Artif. Intell. Mach. Learn. 2021, 11, 1–17. [Google Scholar] [CrossRef]
Lawgali, A. A Survey on Arabic Character Recognition. Int. J. Signal Process. 2015, 8, 401–426. [Google Scholar] [CrossRef]
Khayyat, M.; Lam, L.; Suen, C.Y. Learning-based word spotting system for Arabic handwritten documents. Pattern Recognit. 2014, 47, 1021–1030. [Google Scholar] [CrossRef]
Slimane, F.; Ingold, R.; Kanoun, S.; Alimi, A.M.; Hennebert, J. A new Arabic printed text image database and evaluation protocols. In Proceedings of the 2009 10th International Conference on Document Analysis and Recognition, Barcelona, Spain, 26–29 July 2009; pp. 946–950. [Google Scholar] [CrossRef]
Shiu, C.W.; Chen, J.; Chen, Y.C. Low-Cost Online Handwritten Symbol Recognition System in Virtual Reality Environment of Head-Mounted Display. Mathematics 2020, 8, 1967. [Google Scholar] [CrossRef]
Baek, S.B.; Shon, J.G.; Park, J.S. CAC: A Learning Context Recognition Model Based on AI for Handwritten Mathematical Symbols in e-Learning Systems. Mathematics 2022, 10, 1277. [Google Scholar] [CrossRef]
Mezghani, N.; Mitiche, A.; Cheriet, M. On-line recognition of handwritten Arabic characters using a Kohonen neural network. In Proceedings of the Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition, Niagra-on-the-Lake, ON, Canada, 6–8 August 2002; pp. 490–495. [Google Scholar] [CrossRef]
Safabakhsh, R.; Adibi, P. Nastaaligh Handwritten Word Recognition Using a Continuous-Density Variable-Duration HMM. Arab. J. Sci. Eng. 2005, 30, 95–118. [Google Scholar]
Farooq, F.; Govindaraju, V.; Perrone, M. Pre-processing methods for handwritten Arabic documents. In Proceedings of the Eighth International Conference on Document Analysis and Recognition (ICDAR’05), Seoul, Republic of Korea, 31 August–1 September 2005; Volume 2005, pp. 267–271. [Google Scholar] [CrossRef]
Simultaneous Segmentation and Recognition of Arabic Characters in an Unconstrained On-Line Cursive Handwritten Document. Available online: https://www.researchgate.net/publication/242308716_Simultaneous_Segmentation_and_Recognition_of_Arabic_Characters_in_an_Unconstrained_On-Line_Cursive_Handwritten_Document (accessed on 10 November 2023).
Parvez, M.T.; Mahmoud, S.A. Offline arabic handwritten text recognition: A Survey. ACM Comput. Surv. 2013, 45, 1–35. [Google Scholar] [CrossRef]
Abdo, H.A.; Abdu, A.; Manza, R.R.; Bawiskar, S. An approach to analysis of arabic text documents into text lines, words, and characters. Indones. J. Electr. Eng. Comput. Sci. 2022, 26, 754–763. [Google Scholar] [CrossRef]
Alma’adeed, S.; Higgens, C.; Elliman, D. Recognition of off-line handwritten arabic words using Hidden Markov Model approach. In Proceedings of the 2002 International Conference on Pattern Recognition, Quebec City, QC, Canada, 11–15 August 2002; Volume 16, pp. 481–484. [Google Scholar] [CrossRef]
Graves, A.; Schmidhuber, J. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. In Proceedings of the Neural Information Processing Systems, Vancouver, BC, Canada, 8–10 December 2008; pp. 297–313. [Google Scholar] [CrossRef]
Bluche, T.; Ney, H.; Kermorvant, C. Feature extraction with convolutional neural networks for handwritten word recognition. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition, Washington, DC, USA, 25–28 August 2013; pp. 285–289. [Google Scholar] [CrossRef]
Krishnan, P.; Dutta, K.; Jawahar, C.V. Word spotting and recognition using deep embedding. In Proceedings of the 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), Vienna, Austria, 24–27 April 2018; pp. 1–6. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Nazir, A.; Cheema, M.N.; Sheng, B.; Li, P.; Li, H.; Xue, G.; Qin, J.; Kim, J.; Feng, D.D. ECSU-Net: An Embedded Clustering Sliced U-Net Coupled with Fusing Strategy for Efficient Intervertebral Disc Segmentation and Classification. IEEE Trans. Image Process. 2022, 31, 880–893. [Google Scholar] [CrossRef] [PubMed]
Abdu, A.; Zhai, Z.; Abdo, H.A.; Algabri, R. Software Defect Prediction Based on Deep Representation Learning of Source Code From Contextual Syntax and Semantic Graph. IEEE Trans. Reliab. 2024, 73, 820–834. [Google Scholar] [CrossRef]
Gu, J.; Wang, Z.; Kuen, J.; Ma, L.; Shahroudy, A.; Shuai, B.; Liu, T.; Wang, X.; Wang, G.; Cai, J.; et al. Recent advances in convolutional neural networks. Pattern Recognit. 2018, 77, 354–377. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Li, F.-F. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
Yosinski, J.; Clune, J.; Bengio, Y.; Lipson, H. How transferable are features in deep neural networks? Adv. Neural Inf. Process. Syst. 2014, 4, 3320–3328. [Google Scholar]
Tan, C.; Sun, F.; Kong, T.; Zhang, W.; Yang, C.; Liu, C. A Survey on Deep Transfer Learning. In Proceedings of the 27th International Conference on Artificial Neural Networks, Rhodes, Greece, 4–7 October 2018; Volume 11141, pp. 270–279. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in Vision: A Survey. ACM Comput. Surv. 2022, 54, 1–41. [Google Scholar] [CrossRef]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. Proc. Mach. Learn. Res. 2020, 139, 10347–10357. [Google Scholar]
Wu, F.; Wang, J.; Liu, J.; Wang, W. Vulnerability detection with deep learning. In Proceedings of the 2017 3rd IEEE International Conference on Computer and Communications (ICCC), Chengdu, China, 13–16 December 2017; pp. 1298–1302. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Fu, P.; Zhang, X.; Yang, H. Answer sheet layout analysis based on YOLOv5s-DC and MSER. Vis. Comput. 2023, 1–12. [Google Scholar] [CrossRef]
Al-ohali, Y.; Cheriet, M.; Suen, C. Databases for recognition of handwritten Arabic cheques. Pattern Recognit. 2003, 36, 111–121. [Google Scholar] [CrossRef]
Souici-Meslati, L.; Sellami, M. A hybrid approach for arabic literal amounts recognition. Arab. J. Sci. Eng. 2004, 29, 177–194. [Google Scholar]
Farah, N.; Souici, L.; Sellami, M. Classifiers combination and syntax analysis for Arabic literal amount recognition. Eng. Appl. Artif. Intell. 2006, 19, 29–39. [Google Scholar] [CrossRef]
Farah, N.M.; Sellami, M.A. Fuzzy nearest neighbor system: An application to the recognition of handwritten Arabic literal amounts. Jordan J. Appl. Sci.-Nat. Sci. 2005, 7, 48–55. [Google Scholar]
Al-Ma’adeed, S.; Elliman, D.; Higgins, C.A. A data base for Arabic handwritten text recognition research. In Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition, Niagra-on-the-Lake, ON, Canada, 6–8 August 2002; Volume 1, pp. 485–489. [Google Scholar] [CrossRef]
Louloudis, G.; Gatos, B.; Pratikakis, I.; Halatsis, C. Text line and word segmentation of handwritten documents. Pattern Recognit. 2009, 42, 3169–3183. [Google Scholar] [CrossRef]
Aouadi, N.; Echi, A.K. Word Extraction and Recognition in Arabic Handwritten Text. Int. J. Comput. Inf. Sci. 2016, 12, 17–23. [Google Scholar] [CrossRef]
Elzobi, M.; Al-Hamadi, A.; Al Aghbari, Z. Off-line handwritten arabic words segmentation based on structural features and connected components analysis. In Proceedings of the 19th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision, Plzen, Czech Republic, 31 January–3 February 2011; pp. 135–142. [Google Scholar]
AlKhateeb, J.H.; Jiang, J.; Ren, J.; Ipso, S. Interactive Knowledge Discovery for Baseline Estimation and Word Segmentation in Handwritten Arabic Text. In Recent Advances in Technologies; Intechopen: London, UK, 2009. [Google Scholar] [CrossRef]
Papavassiliou, V.; Stafylakis, T.; Katsouros, V.; Carayannis, G. Handwritten document image segmentation into text lines and words. Pattern Recognit. 2010, 43, 369–377. [Google Scholar] [CrossRef]
Al-dmour, A.; Fraij, F. Segmenting Arabic Handwritten Documents into Text lines and Words. Int. J. Adv. Comput. Technol. 2014, 6, 109–119. [Google Scholar]
Al-Dmour, A.; Zitar, R.A. Word extraction from arabic handwritten documents based on statistical measures. Int. Rev. Comput. Softw. 2016, 11, 436–444. [Google Scholar] [CrossRef]
Neche, C.; Belaïd, A.; Kacem-Echi, A. Arabic handwritten documents segmentation into text-lines and words using deep learning. In Proceedings of the 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Sydney, Australia, 22–25 September 2019; Volume 6, pp. 19–24. [Google Scholar] [CrossRef]
Mahmoud, S.A.; Ahmad, I.; Al-Khatib, W.G.; Alshayeb, M.; Tanvir Parvez, M.; Märgner, V.; Fink, G.A. KHATT: An open Arabic offline handwritten text database. Pattern Recognit. 2014, 47, 1096–1112. [Google Scholar] [CrossRef]
Gader, T.B.A.; Echi, A.K. Attention-based CNN-ConvLSTM for Handwritten Arabic Word Extraction. Electron. Lett. Comput. Vis. Image Anal. 2022, 21, 121–134. [Google Scholar] [CrossRef]
Saidi, A.; Lakhdar, A.M.; Beladgham, M. Recognition of Offline Handwritten Arabic Words Using a Few Structural Features. Comput. Mater. Contin. 2021, 66, 2875–2889. [Google Scholar] [CrossRef]
Hassen, H.; Al-Maadeed, S. Arabic handwriting recognition using sequential minimal optimization. In Proceedings of the 1st IEEE International Workshop on Arabic Script Analysis and Recognition, ASAR, Nancy, France, 3–5 April 2017. [Google Scholar]
Al-Nuzaili, Q.; Al-Maadeed, S.; Hassen, H.; Hamdi, A. Arabic Bank Cheque Words Recognition Using Gabor Features. In Proceedings of the 2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR), London, UK, 12–14 March 2018; pp. 84–89. [Google Scholar] [CrossRef]
Altwaijry, N.; Al-Turaiki, I. Arabic handwriting recognition system using convolutional neural network. Neural Comput. Appl. 2021, 33, 2249–2261. [Google Scholar] [CrossRef]
Maalej, R.; Kherallah, M. Convolutional Neural Network and BLSTM for Offline Arabic Handwriting Recognition. In Proceedings of the ACIT 2018—19th International Arab Conference on Information Technology, Werdanye, Lebanon, 28–30 November 2018. [Google Scholar]
Elleuch, M.; Maalej, R.; Kherallah, M. A New design based-SVM of the CNN classifier architecture with dropout for offline Arabic handwritten recognition. In Proceedings of the Procedia Computer Science, New York, NY, USA, 16–19 July 2016; Volume 80. [Google Scholar]
El-Melegy, M.; Abdelbaset, A.; Abdel-Hakim, A.; El-Sayed, G. Recognition of Arabic Handwritten Literal Amounts Using Deep Convolutional Neural Networks. In Proceedings of the Iberian Conference on Pattern Recognition and Image Analysis, Madrid, Spain, 1–4 July 2019; Volume 11868, pp. 169–176. [Google Scholar] [CrossRef]
Jocher, G.; Stoken, A.; Borovec, J.; NanoCode012; ChristopherSTAN; Changyu, L.; Laughing; Hogan, A.; lorenzomammana; tkianai; et al. ultralytics/yolov5: v3.0; Zenodo: Genève, Switzerland, 2020. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 10691–10700. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the AAAI’17: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31, pp. 11–24. [Google Scholar] [CrossRef]
Jamal, A.T.; Nobile, N.; Suen, C.Y. End-shape recognition for arabic handwritten text segmentation. In Proceedings of the IAPR Workshop on Artificial Neural Networks in Pattern Recognition, Montreal, QC, Canada, 6–8 October 2014; Volume 8774, pp. 228–239. [Google Scholar] [CrossRef]
Lamsaf, A.; Aitkerroum, M.; Boulaknadel, S.; Fakhri, Y. Text Line and Word Extraction of Arabic Handwritten Documents. In Innovations in Smart Cities Applications Edition 2; Ben Ahmed, M., Boudhir, A.A., Younes, A., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 492–503. [Google Scholar]
Al-Nuzaili, Q.; Hamdi, A.; Hashim, S.Z.M.; Saeed, F.; Khalil, M.S. An enhanced quadratic angular feature extraction model for arabic handwritten literal amount recognition. In Lecture Notes on Data Engineering and Communications Technologies; Springer: Berlin/Heidelberg, Germany, 2018; Volume 5, pp. 369–377. [Google Scholar] [CrossRef]
Korichi, A.; Slatnia, S.; Tagougui, N.; Zouari, R.; Kherallah, M.; Aiadi, O. Recognizing Arabic Handwritten Literal Amount Using Convolutional Neural Networks. In Proceedings of the International Conference on Artificial Intelligence and its Applications, El-Oued, Algeria, 28–30 September 2021; Volume 413, pp. 153–165. [Google Scholar] [CrossRef]

Figure 1. Proposed legal amount recognition end-to-end framework. The English explanation is provided specifically for non-Arabic speakers.

Figure 2. Samples of legal amount sentences with English explanation for non-Arabic native speaker. The English explanation is provided specifically for non-Arabic speakers.

Figure 3. YOLOv5 model structure for Arabic handwritten word detection. The English explanation is provided specifically for non-Arabic speakers.

Figure 4. The proposed hybrid classification pipeline for Arabic handwritten word recognition. The English explanation is provided specifically for non-Arabic speakers.

Figure 5. Sample of legal amount image outputted from word detection phase. The English explanation is provided specifically for non-Arabic speakers.

Figure 6. Sample of applying LegalToCourtesy algorithm to calculate the courtesy amount value. The English explanation is provided specifically for non-Arabic speakers.

Figure 7. Training and validation convergence in terms of loss function for the YOLOv5s-based word extraction model.

Figure 8. Evaluation prediction performance of the YOLOv5s-based word extraction model through training.

Figure 9. Confusion matrix of the YOLOV5-based word extraction model.

Figure 10. Sample of Arabic legal amount words detection results with English explanation for non-Arabic native speaker. The English explanation is provided specifically for non-Arabic speakers.

Figure 11. Performance assessment using confusion matrices for the Hybrid A model.

Figure 12. Performance assessment using confusion matrices for the Hybrid B model.

Figure 13. Samples of the proposed method results with correctly generated courtesy amounts. The English explanation is provided specifically for non-Arabic speakers.

Figure 14. Samples of the ability of the proposed approach to detect and classify in some complex cases: (a) word detection in overlapping letters case and (b) spelling mistake word classification. The English explanation is provided specifically for non-Arabic speakers.

Figure 15. Samples of improperly generated courtesy amounts: (a) incorrect word recognition and (b) inaccurate word detection.

Table 1. Arabic word-level image dataset in terms of 33 classes with their image amounts. The English explanation is provided specifically for non-Arabic speakers.

Class #	English Meaning	Class #	English Meaning
1	Eight	18	Seven
2	Eight Hundred	19	Seven Hundred
3	Eighty	20	Seventy
4	Fifty	21	Six
5	Five	22	Six Hundred
6	Five Hundred	23	Sixty
7	Forty	24	Ten
8	Four	25	Thirty
9	Four Hundred	26	Thousand
10	Hundred	27	Three
11	Million	28	Three Hundred
12	Nine	29	Twenty
13	Nine Hundred	30	Two
14	Ninety	31	Two Hundred
15	One	32	Two Million
16	Only	33	Two Thousand
17	Reyal

Table 2. Distribution of dataset splitting for legal amount sentences and separated word images.

	Training Set (70%)	Validation Set (20%)	Testing Set (10%)	Total
Original Dataset	369	122	75	566
Data augmentation	1107	122	75	1304

Table 3. Distribution of word images for legal amounts by class.

Class #	Class Label	Training Set (70%)	Validation Set (20%)	Testing Set (10%)	Total
0	Eight	228	65	33	326
1	Eight Hundred	106	30	16	152
2	Eighty	215	61	32	308
3	Fifty	280	80	41	401
4	Five	268	76	40	384
5	Five Hundred	144	41	22	207
6	Forty	224	64	33	321
7	Four	226	64	33	323
8	Four Hundred	133	37	20	190
9	Hundred	261	74	38	373
10	Million	394	112	57	563
11	Nine	232	66	34	332
12	Nine Hundred	113	32	17	162
13	Ninety	219	62	32	313
14	One	226	64	33	323
15	Only	120	34	18	172
16	Reyal	270	77	39	386
17	Seven	244	70	36	350
18	Seven Hundred	118	34	18	170
19	Seventy	238	68	35	341
20	Six	230	66	34	330
21	Six Hundred	141	40	19	200
22	Sixty	235	67	35	337
23	Ten	226	64	34	324
24	Thirty	228	65	33	326
25	Thousand	394	112	58	564
26	Three	241	69	35	345
27	Three Hundred	148	42	22	212
28	Twenty	277	79	40	396
29	Two	443	126	65	634
30	Two Hundred	223	63	33	319
31	Two Million	203	58	29	290
32	Two Thousand	201	57	30	288
Total		7449	2119	1094	10,662

Table 4. YOLOv5s model assessment results for word detection in terms of P, R, mAP50, mAP50–95, and inference time.

Images	Instances (Objects)	PRE (%)	RE (%)	mAP50 (%)	mAP50–95 (%)	Inference Time (ms)
75	1274	0.958	0.986	0.967	0.625	1.5 per image

Table 5. Classification assessment results of hybrid deep learning models regarding ACC, PRE, F1-Score, AUC, and SEN.

AI Hybrid Model	ACC	SEN	PRE	F1-Score	AUC
Hybrid A: Ensemble CNNs + ViT (parallel hybrid)	0.9902	0.9817	0.9820	0.9817	0.9989
Hybrid B: Ensemble CNNs + ViT (Serial hybrid)	0.97531	97531	0.9762	0.9752	0.9983

Table 6. Classification assessment results of hybrid models regarding Receiver Operating Characteristic (ROC) and Precision–Recall (PR) curves.

AI Hybrid Model	Receiver Operating Characteristic (ROC) Curve		Precision–Recall (PR) Curve
AI Hybrid Model	Macro-Average Area	Micro-Average Area	Macro-Average PR	Micro-Average PR
Hybrid A: Ensemble CNNs + ViT (parallel hybrid)	0.9989	0.9999	0.9928	0.9916
Hybrid B: Ensemble CNNs + ViT (Serial hybrid)	0.9980	0.9999	0.9916	0.9896

Table 7. Comparison of the proposed word extraction method with existing Arabic word extraction methods.

Reference	Method	Dataset	Performance Accuracy (%)
Jamal (2014) [63]	Metric-based segmentation + ESL-based segmentation	IFN/ENIT	93.135
Lamsaf (2019) [64]	The threshold of distances between connected components method.	AHDB	word extraction rate = 87.9%
Al-dmour (2014) [47]	A gap metric method between connected components with the fuzzy c-means clustering algorithm.	AHDB	word extraction rate = 84.8%
Neche (2019) [49]	Deep learning method: Convolutional Neural Network+ Bidirectional Long Short-term Memory + Connectionist Temporal Classification.	KHATT	80.1
Gader (2022) [51]	Deep learning method: Convolutional Neural Network + Attention + Convolutional Long Short-term Memory + Connectionist Temporal Classification.	AHDB	92.8
		KHATT	91.7
		IFN/ENIT	94.1
Proposed word extraction method (2024)	YOLOv5-based word extraction	Own AHLA dataset	PRE = 96.3%. SEN = 96.6%. mAP = 98.6%

Table 8. Comparison of the proposed word recognition method with existing Arabic word recognition methods.

Study	Method	Dataset	Performance Accuracy (%)
Al-Nuzaili (2018) [65]	Perceptual and Quadratic Angular features with an Extreme Learning Machine classifier	AHDB	83.06
Hassen (2017) [53]	Multi-statistical features with Sequential Minimal Optimization classifier	AHDB	91.59
Al-Nuzaili (2018) [54]	Gabor features with Extreme Learning Machine (ELM) and Sequential Minimal Optimization (SMO) classifiers	AHDB	72.79 (ELM classifier) 89.29 (SMO classifiers)
Al-Nuzaili (2018) [54]		CENPARMI	80.86 (ELM classifier) 86.72 (SMO classifiers)
Aouadi (2016) [43]	Significant structural features with Markovian classifier.	IFN-ENIT	87.0
El-Melegy (2019) [58]	CNN	AHDB	97.85
Korichi (2022) [66]	CNN	AHDB	98.50
Proposed word recognition method (2024)	Hybrid A: Ensemble CNNs + ViT	Own AHLA dataset	ACC = 99.02% SEN = 98.17% PRE = 98.20% F1-Score = 98.17% AUC = 99.89%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Abdo, H.A.; Abdu, A.; Al-Antari, M.A.; Manza, R.R.; Talo, M.; Gu, Y.H.; Bawiskar, S. End-to-End Deep Learning Framework for Arabic Handwritten Legal Amount Recognition and Digital Courtesy Conversion. Mathematics 2024, 12, 2256. https://doi.org/10.3390/math12142256

AMA Style

Abdo HA, Abdu A, Al-Antari MA, Manza RR, Talo M, Gu YH, Bawiskar S. End-to-End Deep Learning Framework for Arabic Handwritten Legal Amount Recognition and Digital Courtesy Conversion. Mathematics. 2024; 12(14):2256. https://doi.org/10.3390/math12142256

Chicago/Turabian Style

Abdo, Hakim A., Ahmed Abdu, Mugahed A. Al-Antari, Ramesh R. Manza, Muhammed Talo, Yeong Hyeon Gu, and Shobha Bawiskar. 2024. "End-to-End Deep Learning Framework for Arabic Handwritten Legal Amount Recognition and Digital Courtesy Conversion" Mathematics 12, no. 14: 2256. https://doi.org/10.3390/math12142256

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

End-to-End Deep Learning Framework for Arabic Handwritten Legal Amount Recognition and Digital Courtesy Conversion

Abstract

1. Introduction

2. Related Work

2.1. Arabic Handwritten Legal Amount Datasets

2.2. Arabic Words Extraction from Whole Cheque Images

2.3. Arabic Words Recognition

3. Methods and Materials

3.1. Arabic Handwritten Legal Amount (AHLA) Dataset

3.2. Data Preprocessing and Preparation

3.2.1. Arabic Legal Amount Sentence Images

3.2.2. Arabic Word-Level Images

3.2.3. Detection Task: AI-Based YOLO Detection Model

3.3. Classification Task: The Proposed Hybrid Ensemble CNNs with the ViT

3.3.1. Global Knowledge Extraction: Ensemble Transfer Learning via Various CNNs

3.3.2. Local Knowledge Extraction: Vision Transformer (ViT)

3.4. Courtesy Amount Conversion

3.5. Experimental Setup

3.6. Evaluation Metrics

4. Results and Discussion

4.1. Hyperparameter Settings

4.1.1. Hyperparameter Settings of YOLOv5 Model

4.1.2. Hyperparameter Settings of Hybrid CNN–ViT

4.2. YOLOv5s Model: Word Detection Outcomes

4.3. Hybrid Models: Classification Outcomes

4.4. Courtesy Amount Generation Outcomes

4.5. Comparison of Proposed Methods with Existing Studies

4.6. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI