Next Article in Journal
A Proactive Microgrid Management Strategy for Resilience Enhancement Based on Nested Chance Constrained Problems
Previous Article in Journal
Cooperative Transmission Mechanism Based on Revenue Learning for Vehicular Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Multi-Layer Holistic Approach for Cursive Text Recognition

1
Faculty of Information Technology & Computer Science, University of Central Punjab, Lahore 54000, Pakistan
2
Faculty of Computers and Information Technology, University of Tabuk, Tabuk 47921, Saudi Arabia
3
Department of Computer Science, COMSATS University Islamabad, Lahore Campus, Lahore 54000, Pakistan
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2022, 12(24), 12652; https://doi.org/10.3390/app122412652
Submission received: 22 October 2022 / Revised: 24 November 2022 / Accepted: 1 December 2022 / Published: 9 December 2022
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
Urdu is a widely spoken and narrated language in several South-Asian countries and communities worldwide. It is relatively hard to recognize Urdu text compared to other languages due to its cursive writing style. The Urdu text script belongs to a non-Latin cursive family script like Arabic, Hindi and Chinese. Urdu is written in several writing styles, among which ‘Nastaleeq’ is the most popular and widely used font style. A gap still poses a challenge for localization/detection and recognition of Urdu Nastaleeq text as it follows modified version of Arabic script. This research study presents a methodology to recognize and classify Urdu text in Nastaleeq font, regardless of the text position in the image. The proposed solution is comprised of a two-step methodology. In the first step, text detection is performed using the Connected Component Analysis (CCA) and Long Short-Term Memory Neural Network (LSTM). In the second step, a hybrid Convolution Neural Network and Recurrent Neural Network (CNN-RNN) architecture is deployed to recognize the detected text. The image containing Urdu text is binarized and segmented to produce a single-line text image fed to the hybrid CNN-RNN model, which recognizes the text and saves it in a text file. The proposed technique outperforms the existing ones by achieving an overall accuracy of 97.47%.

1. Introduction

Automation development has led to digitization in the modern world to save time and enhance capacity. Digitization is a significant stage in the progress of technology, and Optical Character Recognition (OCR) is an essential point in digitizing manuscript papers. OCR is converting a typed script from images to a modifiable text. OCRs are developed for a particular language or font within the language as needed. This research mainly tackles Urdu OCR for Nastaleeq font in scanned documentary pictures.
The South-Asian languages like Urdu, Hindi and Persian are cursive in nature. They all are mutually intelligible as spoken languages. They belong to different writing systems. Urdu and Persian follows modified version of Arabic script while Hindi follows Devangari [1]. Urdu is a popular language among many South-Asian countries, including Pakistan, India, Afghanistan and Nepal. Large communities in the USA, the UK, Australia, Canada and the Middle East [2] use it. Urdu is spoken as a first language by nearly 70 million people and as a second language by more than 100 million people, predominantly in Pakistan and India. One can find several official documents and literature that need to be digitized, and that is where OCR comes in handy and caters for Urdu text recognition. Due to cursive nature, the Urdu alphabets can have many writing forms. These forms vary when writing individually and in combination with other alphabets. A brief detail of these variations is explained below in Section 1.1.

1.1. Urdu Script

Urdu script is written right-to-left cursively, meaning that each letter in a word is joined with another and can have four different styles depending on its position. The four glyphs of the Urdu alphabet are as follows and shown in Figure 1,
  • Isolated.
  • Initial.
  • Middle.
  • Final.
In Figure 1, the Urdu text style of a single character is indicated in each word corresponding to its position, along with roman Urdu and English translations of each word. Because Urdu is written down in mixed ways, the relationship between each alphabet to form a word is called a ligature.

Nastaleeq Typeface

Nastaleeq is the preferred font for Urdu language scripting [3]. Nastaleeq can handle Open Type Format (OTF) text style specifications and is currently considered a Windows Motor Ship with Microsoft Windows 2000 to write official documentation, books and manuscript writing.

1.2. Text Recognition

Recognition of text or script is a procedure to draw out the typed text from images. For this purpose, OCR is utilized. Every linguist can have different OCRs based on text and image properties. OCR targets one linguistic need to recognize the guidelines of that linguistic to provide effective outcomes of the identified script. When converting documents into digital copies, individuals can easily change the written details as per need. Hence, the modifications can be easily made in the document. Therefore, there are many requirements for implementing text recognition solutions for Urdu to fulfill the basic needs of Urdu OCR. Finding its solution is the major obstacle in digitizing Urdu documentation. Several solutions can be found, but their limitations lie, which leaves a gap in finding the optimal solution. Urdu script is cursive and very different from Latin script, and it has multiple fonts and font styles. Due to a lack of research on Urdu, only a limited amount of data files can be found. The biggest challenge is to localize the words or sentences in a picture before recognizing the text. Pictures undergoing the process of text recognition can have different background colors or text types for each source. Therefore, localizing the text in the image before identifying the text is a necessary point. Furthermore, the central reason for low accuracy in past research is the lack of a data set. Each study with distinctive guidelines requires a data set that applies to those guidelines. Due to the complex nature of Urdu text styles, text detection and recognition is a complicated problem. The following are the main contribution of this research work:
  • A unified approach for position free Urdu text detection through Connected Component Analysis (CCA) and Long Short-Term Memory (LSTM).
  • A hybrid Convolution Neural Network and Recurrent Neural Network (CNN-RNN) approach for text recognition.
  • Evaluation of the proposed methodology on self-synthesized and UPTI data sets.

2. Literature Review

Several studies have been conducted on script recognition and mostly English text recognition. However, many people have not been able to perform Urdu research or identify every aspect of Urdu writing. The following section briefly reviews some research on Urdu text recognition.
Segmentation of text to characters explicitly is a highly challenging task in cursive scripts like Urdu. Research is carried out by segmenting text into characters implicitly with the help of deep learning. The writing on Urdu text recognition can be arranged dependent on the element of recognition, for example, characters or ligatures. The research network has created two benchmark datasets of printed Urdu text, the Urdu Printed Text Line Images (UPTI) dataset [4,5] and the Center of Language Engineering (CLE) dataset. The UPTI dataset consists of 1000 lines of text, while the CLE dataset consists of more than 2000 groups of high-recurrence Urdu ligatures and scanned pages of books.
Reflecting on the strategies assessed on the CLE dataset, Javed et al. subdivide the ligatures at spreading points. Raw pixel values are used to train Hidden Markov Models (HMMs) on each fragment of ligatures. Assessments are done on 1692 essential ligatures with correctness of 92%. This work is expanded in [6,7,8], in which 250 graphemes are used to train HMMs. In [9], a detailed identification proposal is presented, which uses Discrete cosine transform (DCT) features and HMMs. They have modified the Tesseract engine to include recognition of Urdu ligatures, realizing around a 97% recognition rate for ligatures in fixed font sizes. In [10], a test on 6000 queries, the ligatures realized an identification rate of 97.93%, and arithmetic features are taken from the 2028 HFL clusters. The primary and secondary ligatures association was also carried out.
Among different identification techniques, Sabbour and Shafait in [11] use object descriptors to recognize ligatures in the UPTI database, resulting in an 89% recognition rate. On the UPTI database [12,13,14], many implicit partition-based techniques have also been calculated. In such methods, the learning limitations are determined after the ground truth transcription, and the pictures of text lines are provided to the learning algorithm. Refs. [15,16] used statistical method with multi-dimensional Recurrent Neural Networks (MD-RNN), whereas [17,18] used bidirectional Long Short-Term Memory (LSTM) networks on raw pixel values. The background is that the ligature-based approaches evaluated on the UPTI database are lesser because the database is labeled at line level, and ligatures have to be manually taken from the database. The study [19] concludes a 95% classification rate on around 100 ligature classes. In [20,21], statistical characteristics are used with HMMs to identify more than 1500 ligature classes in the UPTI database.
From the outlook of video text identification, the research on cursive scripts is limited. Few video OCRs targeting recognition of Arabic text have been reported in the literature [22]. In [23], the authors deployed deep learning-based methods for identifying Arabic video text. Three classifiers, including deep belief networks, multilayer perceptron, and deep autoencoder, realized recognition rates of 90.73%, 88.5%, and 94.36%, respectively. The procedure was calculated using a database with 47 h of video. The studies [8,24] stated an approach for noticing, localization, and recognizing Urdu text in videos. Identification rates of around 91–93% were obtained. The system is evaluated on 10 h of video, each from three different channels.
A partition-based technique for OCR using a Support Vector Machine (SVM) was proposed [25]. They evaluated local and global features on top of partitioned cursive characters.
Most of the work in the domain of Urdu OCR consumed handcrafted attributes and used the nearest-adjacent approach to perform identification. Ref. [26] used contour extraction techniques and shape context information to create attribute descriptors for each ligature/glyph. Similarly, [27] used morphological operations and character-specific filters to pre-process each segmented character/glyph from a line image. They used a heuristics-based approach on the character-chain-code to determine which class label (Urdu glyph) the segmented image is assigned. CNN takes input data, extracts low-level features, and then is fed to MDLSTM. All resulting vectors are concatenated to form a single vector to be passed on to Connectionist Temporal Classification (CTC) transcript layer [28].
The supervised learning methods for text recognition [29,30] will generally be more complex than the masked procedures and recognize text and non-text blocks. AI approaches are thought to be among the most recently accomplished approaches, an Artificial Neural Network (ANN) is seen in [31]. These AI-based methodologies require a ton of preparing information to increase adequate order rates. A Character Proposal Network (CPN) for looking through character recommendations is utilized for text ID in [32] though a lot of mid-level natives to get the segmental structure of characters (named as “stroke lets”) is in [28].
Commonly, unsupervised methodologies are characterized by edge-based, CCA-based, surface-based, and shading-based strategies. In unsupervised techniques, picture assessment approaches are placed in, and segment strategies (edges, spatial gathering) are utilized to recognize text from the remainder of the picture. Surface-based strategies [33,34] consider literary substance in the picture as an exceptional surface that separates itself from the non-text parts. Surface credits are assessed from dark-level pictures or by first changing the picture by separating or applying recurrence space changes. Numerous investigations [35] endeavor various textural measures to see and affirm text parts in a picture.
Edge-based methodologies [36] use the high difference between text and its experience by looking through the image’s edges. Fragments of high edge thickness are then consolidated under specific heuristics to sift through non-text areas. Associated segment-based techniques [37,38] endeavor the shading/power of text pixels alongside some mathematical heuristics to separate content from the foundation.
Multilingual pretraining can generally enhance a model’s monolingual performance, it has been demonstrated that as a multilingual model learns more languages, the capacity for each language falls off [39]. Recent research has demonstrated that multilingual language models perform worse than their monolingual equivalents [40]. A detail of challenges in multilingual models are explained in detail [41]. Even bilingual language modeling has been found to perform better than multilingual modeling [42]. A study shows that monolingual versions outperform the traditional multilingual models for all datasets. Moreover, better sentence representations are also generated by the monolingual models [43]. However, training and maintenance for each language in a monolingual model is not cost-effective in terms of time and resources. Urdu being a resource-starved language, large pre-trained language models cannot be built on a single corpus.
A significant amount of work has previously been done on object detection in photos using deep neural network [44]. In another study, fast text classification is performed by processing only the text part of the image instead of the whole picture [45]. Considering text as object, the existing methodologies can be applied for text detection. This concept is mainly followed in the detection and recognition of text in different languages including English [46,47,48,49], Chinese [50,51,52], Arabic [53,54], Persian [55,56], Urdu [34,57,58,59,60] and Hindi [61,62,63]. In some languages (including Urdu), there exists multiple ligatures and orthography. Manuscripts of the same language can have their unique writing style, which can create different lengths of script lines with the same word appearing at different locations. Moreover, a manuscript can have numerous font styles and colors in a distinct image as shown in Figure 2.
In the same text, character size and spacing are largely consistent. However, this may not be true in the context of complex scenes. In case of arbitrary text position, some research studies have proposed solutions for a reliable detection of oriented scene text [64]. A recent study has proposed a dynamic, receptive field adaptation method for reliably identifying scene text for the English language [65]. However, for cursive nature languages like Urdu, there exist challenges. For example, the text localization is relatively hard when there is no symmetry in the placement and text style. A survey paper on the history and progress of text detection and recognition from an image has summarized the benefits and drawbacks of different approaches. The paper also elaborates the history and progress of scene text detection and recognition is explained and also mention the conventional methods in detail with their benefits and drawbacks [66].
Researchers have come up with appropriate solutions to challenge Urdu text recognition over the years. A few researchers use distinctive methods, but not all cover the problem completely regarding the solution. There lies the gap in accurately localizing the text in the provided image as well as the identification of high cursive words.

3. Proposed Methodology

This section contains a detailed description of the proposed methodology for developing the Nastaleeq Urdu font OCR. Any developed OCR has two functions: text localization and text recognition. The proposed system consists of hybrid CNN-RNN architectures for text recognition, and a text localization process carried out by CCA and LSTM.

3.1. Dataset Description

The Urdu language is not articulated in many parts of the globe; thus, the Urdu dataset is not radially available. Due to its significance, Urdu is much smaller than Arabic and English. This study used previously available CLE datasets and self-synthesized datasets. The self-synthesized dataset is used for the initial training of the proposed model, and it was further trained on the CLE dataset.

3.1.1. Self-Synthesized Dataset

The main aim of this study is to detect and recognize the Urdu text from scanned images of documents written in the Nastaleeq typeface, so the images in the Urdu novel are scanned for early model training. The dataset contains 71 scanned images of Urdu documents and text files associated with images. This small-scale dataset was initially used for model training in basic document scanning and improved the model by training it in a large dataset. In addition, the self-synthesized dataset contained 71 scanned document images with 144,788 characters.

3.1.2. Available Dataset

The adopted dataset for the Nastaleeq typeface has been accessed from the Urdu Printed Text Images (UPTI) for extensive model training. The UPTI [24] contains 10,063 Urdu line images and 1,87,039 ligatures. The length of the text in the pictures of the available data set varies because the forms used are distinctive, e.g., the pictures having poems have smaller script lines than those of the consistent manuscript script line. The difference in pictures facilitated the model’s training more accurately and efficiently.

3.1.3. Dataset Transcription

Before the training of machine learning models to transcribe the dataset. Dataset transcription in the proposed study is distributed into the following steps:
  • Binarization of scanned image;
  • Separate text lines from every picture to a new separate image;
  • Generate a ground truth text file and manifest file for each text line image that contains text image file names;
  • Each text file is updated with the accompanying text of the text line image.
According to the text line image, ground truth text files are automatically rationalized by comparative script. The procedure of generating text line images, manifest files, and ground truth text files is shown in Figure 3, a block diagram.
Figure 3 shows the process through which document images are passed to the transcription process, in which pixel values separate each text line and (x, y) coordinates. A ground truth text file is generated for every script line picture, which is then manually updated alongside the corresponding script of the picture. A manifest is a file that includes all the file names of the image text files. These generated files are then used for training the developed model.

3.2. Text Detection

Text detection is the first and foremost footstep of an OCR and this study. This process refers to the localization of the text lines in a given image. This study proposes a position free text localization approach based on CCA and LSTM. The next section will elucidate the unified approach of these techniques for text detection. Furthermore, the recognition of the detected text using a hybrid CNN-RNN model is also explained.

3.2.1. Connected Component Analysis

A study used CCA to find the pixels in the linked part and share the relative pixel values. The CCA works by examining the pixels from top to bottom to separate the positions of the connected pixels, i.e., the neighboring pixels with the same fixed values [25]. For the V = 1 parallel image, in both cases, the grayscale image contains the motion of the V values, for example, V = {41, 42, 43,..., 87, 88, 89, 90}. Component labeling functions and the different functions of the network connected in parallel or gray-level images are potential. The CCA operator moves and scans a line until the image reaches a point, and p denotes the pixel to be labeled during the scanning process, V = 1. If this is true, it has already checked the four neighbors of p Scan, the neighbors of p to the left side, and above slanting words. According to this data, p is labeled as follows:
  • If four of the neighbors are 0, enter a new label for p.
  • If a neighbor has V = 1, assign its label to p.
  • If V = 1 for one of the neighbors, give p a label and a distant note.
After accomplishing the filter, the equivalent labeled duos are classified into correspondence classes, and each class is assigned a unique name. In the final phase, an additional examination is performed on the picture, where each label is assigned a square label.

3.2.2. The Process of Text Detection

The process of identifying text in an image takes many forms. First, each entered picture is binarized to improve the visibility of the text. Binarized images are passed through CCA and LSTM, where text blobs are disjointed by additional-size blogs. For this purpose, the CCA applies the image and identifies the blob’s adjusted areas (x and y), which coordinates the pixel values and forms the boundary box. Next, the pixels in each bounding box are sent to the pre-trained LSTM model with pixel values, whether the box is a text or a different shape of the block sequence. After success, the boundary box is cut into text line images complete to be inserted in the neural network to recognize the text. The process works as follows:
 For each scanned image,
  binarize image;
  pass binarized image to CCA;
   image is segmented into blobs of objects;
   pass each blob to a pertained LSTM model;
    if blob contains text;
     crop image to new image on generated blob dimensions;
     take each newly generated image;
     store locally for text recognition;
    else;
     ignore the blob;
   break;
  move to the next image.

3.3. Recognition of the Text

The principal determination of any OCR is to identify the text. The effectiveness of OCR depends on how accurately and rapidly the text is found. A variety of neural networks and matching patterns can be observed. The choice of methods to recognize text brings the machine’s accuracy, rapidness, and memory usage. It may be the most efficient, rigid, durable, fast, and lightweight system in terms of memory management for the best results. The model applied in this study is a trivial hybrid CNN-RNN model.

3.3.1. Structure of the Proposed Model

The proposed structure incorporates a hybrid model of CNN and RNN architectures. CNN extracts attributes, and RNN provides text recognition for the extracted properties. The proposed model consists of twelve layers, which are as follows:
  • A convolutional layer with 3 × 3 filter size, 32 nodes and ReLU activation function;
  • A dropout layer with probability of 0.1;
  • A max-pooling layer with 2 × 2 kernel and stride value 2;
  • A convolutional layer with 3 × 3 filter size, 64 nodes and ReLU activation function;
  • A dropout layer with probability of 0.1;
  • A max-pooling layer with 2 × 2 kernel and stride value 2;
  • A reshape layer;
  • A recurrent neural network layer with 64 nodes;
  • A recurrent neural network layer with 128 nodes;
  • A recurrent neural network layer with 256 nodes;
  • A dropout layer with probability of 0.5;
  • A regularization layer.
Figure 4 represents the proposed model implemented. The model is a twelve-layer architecture that takes in a single text line image at a time and a respective ground truth text file to train itself in a supervised manner. The model calculates scores of each text line image and ground truth file by the Connectionist temporal classification (CTC) loss function.

3.3.2. Model Training

The ascribed dataset is divided into two parts on the base of the 80:20 ratio. Overall, 80% of the dataset is used for training, and the remaining 20% is reserved for evaluation and validation. The flow of the training process of the model from scratch can be observed in the block diagram Figure 5.
In Figure 5, the process of training the model with dataset transcription can be seen. The document images are transcribed to generate a single-line text image, a separate ground truth text file, and a manifest file. Each text line image is picked from the directory along with its ground truth text file and is fed to the model for supervised learning. A CoreML model is generated at the end of each epoch. The training process is stopped based on “early stopping”, and the CoreML model with the best accuracy is saved at the end of the process.

3.4. Experimentation

Multiple scanned images of different documents are taken and passed to the system for experimentation. Each document image is binarized during this process, text localization is applied to binarized images, and text lines are cropped. Extracted text line images are passed onto the CoreML model one at a time, which recognizes the text. The text from every iteration is written in a file, then stored on the local machine. Figure 6 shows the flow of the proposed system.
In Figure Figure 7, the input image passed to the proposed method and the output text generated and written in the text file can be observed.

4. Results

This section throws light on the experimentation results that have been achieved, the performance metrics, and the accuracy comparison of the proposed model with previously available solutions.

4.1. Variable Layer Architectures

The three different architectures of the CNN-RNN hybrid model were developed with a different set of two CNN layers with 32 and 64 nodes, respectively, and variable RNN layers as follows:
  • One RNN layer with 100 nodes.
  • Two RNN layer with 128 and 256 nodes, respectively.
  • Three RNN layers with 64, 128 and 256 nodes, respectively.
The three architectures are trained and evaluated on the same dataset, and Table 1 shows the results.
As it can be seen in Table 1, the training and validation accuracy have increased with the increased number of RNN layers. The accuracy progression of the proposed model can be seen in Figure 8.
Figure 8, shows the training accuracy of the model throughout epochs. It can be seen that the model started at a shallow rate, and accuracy was not consistent. Still, as the model progressed in training, its accuracy became more consistent. At a point where accuracy was almost constant, the model training process stopped at the base of early stopping.

4.2. Performance Matrix

A confusion matrix is an efficient way to check the performance of supervised learning processes. A confusion matrix is divided into the following matrices and explained (based on the equations).
Equations (1)–(4).
S e n s i t i v i t y = ( T P × 100 ) ( T P + F N )
S p e c i f i c a t i o n = ( T N × 100 ) ( T N + F P )
P o s i t i v e l y e s t i m a t e d v a l u e = ( T P × 100 ) ( T P + F P )
N e g a t i v e l y e s t i m a t e d v a l u e = ( T N × 100 ) ( T N + F N )

4.2.1. Training Confusion Matrix

A total of 144,788 characters are used in training, for which the performance matrix is presented in Table 2.
The outcomes of the implemented system of training are mentioned in Table 3.

4.2.2. Validation Confusion Matrix

A total number of 36,197 characters are used in validation, for which the performance matrix is presented in Table 4.
The outcomes of the validation are mentioned in Table 5.

4.2.3. UPTI Validation Confusion Matrix

A total number of 493,204 characters from the UPTI dataset are used in validation, for which the performance matrix is presented in Table 6.
The outcomes of the validation are mentioned in Table 7.

4.2.4. Sensitivity Analysis

The degree of correctly identifying positive classes as positive is the recall or sensitivity. As the rate of correctly identifying positive values increases, the efficiency of the model increases. As the proposed system has a validation sensitivity of 97.41%, the proposed model has a higher efficiency of correctly recognizing text.

4.3. Accuracy Comparison

The accuracy of the proposed system achieved is 98.45% in the training process and 97.47% on validation. Table 8 shows the model has achieved accuracy concerning those models dense in architectures and trained on GPUs using the UPTI dataset.
It is visible how the proposed system scores the highest accuracy upon validation on the same dataset used by the architectures being compared. The proposed system not only beats the previous accuracy levels, but also stands out as an architecture that efficiently works on the CPU.

5. Conclusions

This research proposed an efficient and lightweight OCR-based text detection and recognition methodology for Urdu written in Nastaleeq font. The proposed solution is a multi-layer hybrid CNN-RNN architecture for text recognition in combination with CCA and LSTM for text detection. In comparison to existing techniques, this architecture offers a promising accuracy of 97.47%. In the future, the model could be tested for font-free cursive text.

Author Contributions

Conceptualization, M.U., M.Z. and F.D.; methodology, M.U., M.Z. and F.D.; formal analysis, S.A., M.U. and M.Z.; investigation, M.U., M.Z., S.A. and M.S.B.; supervision, M.U. and M.Z.; writing—original draft preparation, M.U. and M.Z.; writing—review and editing, M.U., M.Z., F.D., S.A., M.S.B., M.H. and A.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data is publicly available at https://tukl.seecs.nust.edu.pk/downloads.html (accessed on 10 January 2022).

Conflicts of Interest

The authors declare no conflict of interest.

Sample Availability

Code is available on request ([email protected]).

Abbreviations

The following abbreviations are used in this manuscript:
CCAConnected component analysis
LSTMLong Short-Term Memory
OCROptical Character Recognition
CNNConvolutional Neural Networks
RNNRecurrent Neural Network
OTFOpen type format
UPTIUrdu printed text line images
CLECenter of language engineering
DCTDiscrete cosine transform
ANNArtificial neural networks
OTFOpen type format
HMMHidden Markov Model
CTCConnectionist temporal classification
MD-RNNMulti-dimensional Recurrent Neural Networks
CPNCharacter proposal network
MLMachine learning

References

  1. Hindustani Language. Available online: https://www.britannica.com/topic/Hindustani-language (accessed on 21 November 2022).
  2. World Data.info. Urdu as Language—Urdu Speaking Countires. Available online: https://www.worlddata.info/languages/urdu.php (accessed on 22 October 2022).
  3. Computers & Writing Systems. Nastaliq Navees Features—Preffered Urdu Language Script. Available online: https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=nastaliq_features (accessed on 17 January 2022).
  4. Javed, S.T.; Hussain, S.; Maqbool, A.; Asloob, S.; Jamil, S.; Moin, H. Segmentation Free Nastalique Urdu OCR. Int. J. Comput. Inf. Eng. 2010, 4, 1514–1519. [Google Scholar]
  5. Ud Din, I.; Siddiqi, I.; Khalid, S.; Azam, T. Segmentation-free optical character recognition for printed Urdu text. EURASIP J. Image Video Process. 2017, 2017, 62. [Google Scholar] [CrossRef] [Green Version]
  6. Hussain, S.; Ali, S.; Akram, Q.U.A. Nastalique segmentation-based approach for Urdu OCR. Int. J. Doc. Anal. Recognit. (IJDAR) 2015, 18, 357–374. [Google Scholar] [CrossRef]
  7. Hayat, U.; Aatif, M.; Zeeshan, O.; Siddiqi, I. Ligature Recognition in Urdu Caption Text using Deep Convolutional Neural Networks. In Proceedings of the 2018 14th International Conference on Emerging Technologies (ICET), Ohrid, North Macedonia, 21–22 November 2018; pp. 1–6. [Google Scholar] [CrossRef]
  8. Zhang, D.; Liu, Y.; Wang, Z.; Wang, D. OCR with the deep CNN model for ligature script-based languages like Manchu. Sci. Program. 2021, 2021, 5520338. [Google Scholar] [CrossRef]
  9. Akram, Q.U.A.; Hussain, S.; Niazi, A.; Anjum, U.; Irfan, F. Adapting Tesseract for Complex Scripts: An Example for Urdu Nastalique. In Proceedings of the 2014 11th IAPR International Workshop on Document Analysis Systems, Tours, France, 7–10 April 2014; pp. 191–195. [Google Scholar] [CrossRef]
  10. Akram, Q.u.A.; Hussain, S.; Adeeba, F.; ur Rehman, S.; Saeed, M. Framework of Urdu Nastalique Optical Character Recognition System; University of Engineering and Technology: Lahore, Pakistan, 2014. [Google Scholar]
  11. Sabbour, N.; Shafait, F. A segmentation-free approach to Arabic and Urdu OCR. Document Recognition and Retrieval XX ADS Bibcode: 2013SPIE.8658E..0NS. In Proceedings of the IS&T/SPIE Electronic Imaging Symposium, Burlingame, CA, USA, 5–7 February 2013; Volume 8658, p. 86580N. [Google Scholar] [CrossRef] [Green Version]
  12. Javed, N.; Shabbir, S.; Siddiqi, I.; Khurshid, K. Classification of Urdu Ligatures Using Convolutional Neural Networks—A Novel Approach. In Proceedings of the 2017 International Conference on Frontiers of Information Technology (FIT), Islamabad, Pakistan, 18–20 December 2017; pp. 93–97. [Google Scholar] [CrossRef]
  13. Halima, M.B.; Karray, H.; Alimi, A.M. A Comprehensive Method for Arabic Video Text Detection, Localization, Extraction and Recognition. In Proceedings of the Conference on Advances in Multimedia Information Processing—PCM, Shanghai, China, 21–24 September 2010; Qiu, G., Lam, K.M., Kiya, H., Xue, X.Y., Kuo, C.C.J., Lew, M.S., Eds.; Springer: Berlin/Heidelberg, Germany, 2010; pp. 648–659. [Google Scholar] [CrossRef]
  14. Camastra, F. A SVM-based cursive character recognizer. Pattern Recognit. 2007, 40, 3721–3727. [Google Scholar] [CrossRef]
  15. Nawaz, T. Optical Character Recognition System for Urdu (Naskh Font) Using Pattern Matching Technique; University of Engineering and Tehnology: Taxila, Pakistan, 2004; pp. 92–104. [Google Scholar]
  16. Ahmed, S.B.; Naz, S.; Razzak, M.I.; Rashid, S.F.; Afzal, M.Z.; Breuel, T.M. Evaluation of cursive and non-cursive scripts using recurrent neural networks. Neural Comput. Appl. 2016, 27, 603–613. [Google Scholar]
  17. Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; Association for Computing Machinery: New York, NY, USA, 2006; pp. 369–376. [Google Scholar] [CrossRef]
  18. Ul-Hasan, A.; Ahmed, S.B.; Rashid, F.; Shafait, F.; Breuel, T.M. Offline printed Urdu Nastaleeq script recognition with bidirectional LSTM networks. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition, Washington DC, USA, 25–28 August 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 1061–1065. [Google Scholar]
  19. Naz, S.; Umar, A.I.; Ahmed, R.; Razzak, M.I.; Rashid, S.F.; Shafait, F. Urdu Nasta’liq text recognition using implicit segmentation based on multi-dimensional long short term memory neural networks. SpringerPlus 2016, 5, 2010. [Google Scholar] [CrossRef] [Green Version]
  20. Naz, S.; Ahmed, S.B.; Ahmad, R.; Razzak, M.I. Zoning Features and 2DLSTM for Urdu Text-line Recognition. Procedia Comput. Sci. 2016, 96, 16–22. [Google Scholar] [CrossRef] [Green Version]
  21. Naz, S.; Umar, A.I.; Ahmad, R.; Ahmed, S.B.; Shirazi, S.H.; Siddiqi, I.; Razzak, M.I. Offline cursive Urdu-Nastaliq script recognition using multidimensional recurrent neural networks. Neurocomputing 2016, 177, 228–241. [Google Scholar] [CrossRef]
  22. Yin, X.C.; Zuo, Z.Y.; Tian, S.; Liu, C.L. Text Detection, Tracking and Recognition in Video: A Comprehensive Survey. IEEE Trans. Image Process. 2016, 25, 2752–2773. [Google Scholar] [CrossRef]
  23. Yousfi, S.; Berrani, S.A.; Garcia, C. Deep Learning and Recurrent Connectionist-based Approaches for Arabic Text Recognition in Videos. In Proceedings of the 13th International Conference on Document Analysis and Recognition (ICDAR), Lausanne, Switzerland, 23–26 August 2015. [Google Scholar]
  24. Ahmad, I.; Wang, X.; Mao, Y.h.; Liu, G.; Ahmad, H.; Ullah, R. Ligature based Urdu Nastaleeq sentence recognition using gated bidirectional long short term memory. Clust. Comput. 2018, 21, 703–714. [Google Scholar] [CrossRef]
  25. Khattak, I.U.; Siddiqi, I.; Khalid, S.; Djeddi, C. Recognition of Urdu ligatures-a holistic approach. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 71–75. [Google Scholar]
  26. Nicolaou, A.; Bagdanov, A.D.; Gómez, L.; Karatzas, D. Visual Script and Language Identification. In Proceedings of the 2016 12th IAPR Workshop on Document Analysis Systems (DAS), Santorini, Greece, 11–14 April 2016; pp. 393–398. [Google Scholar] [CrossRef] [Green Version]
  27. Ren, X.; Chen, K.; Yang, X.; Zhou, Y.; He, J.; Sun, J. A new unsupervised convolutional neural network model for Chinese scene text detection. In Proceedings of the 2015 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP), Chengdu, China, 12–15 July 2015; pp. 428–432. [Google Scholar] [CrossRef]
  28. Bai, X.; Yao, C.; Liu, W. Strokelets: A Learned Multi-Scale Mid-Level Representation for Scene Text Recognition. IEEE Trans. Image Process. 2016, 25, 2789–2802. [Google Scholar] [CrossRef]
  29. Wen, W.; Huang, X.; Yang, L.; Yang, Z.; Zhang, P. An Efficient Method for Text Location and Segmentation. In Proceedings of the 2009 WRI World Congress on Software Engineering, Washington, DC, USA, 19–21 May 2009; Volume 3, pp. 3–7. [Google Scholar] [CrossRef]
  30. Pan, Y.F.; Hou, X.; Liu, C.L. A Hybrid Approach to Detect and Localize Texts in Natural Scene Images. IEEE Trans. Image Process. 2011, 20, 800–813. [Google Scholar] [CrossRef]
  31. Sami-Ur-Rehman, B.; Tayyab, B.U.; Naeem, M.F.; Ul-Hasan, A.; Shafait, F. A Multi-faceted OCR Framework for Artificial Urdu News Ticker Text Recognition. In Proceedings of the 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), Vienna, Austria, 24 April 2018; pp. 211–216. [Google Scholar] [CrossRef]
  32. Zhang, S.; Lin, M.; Chen, T.; Jin, L.; Lin, L. Character proposal network for robust text extraction. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 2633–2637. [Google Scholar] [CrossRef] [Green Version]
  33. Javed, S.T.; Hussain, S. Segmentation Based Urdu Nastalique OCR. In Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications; Lecture Notes in Computer Science; Ruiz-Shulcloper, J., Sanniti di Baja, G., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 41–49. [Google Scholar] [CrossRef] [Green Version]
  34. Arafat, S.Y.; Iqbal, M.J. Urdu-text detection and recognition in natural scene images using deep learning. IEEE Access 2020, 8, 96787–96803. [Google Scholar] [CrossRef]
  35. Khatri, M.J.; Shetty, A.; Gupta, A.; Sharma, G. Video OCR for Indexing and Retrieval. Int. J. Comput. Appl. 2015, 118, 30–33. [Google Scholar]
  36. Jamil, A.; Siddiqi, I.; Arif, F.; Raza, A. Edge-Based Features for Localization of Artificial Urdu Text in Video Images. In Proceedings of the 2011 International Conference on Document Analysis and Recognition, Beijing, China, 21 September 2011; pp. 1120–1124. [Google Scholar] [CrossRef]
  37. Khan, N.; Adnan, A.; Waheed, A.; Zareei, M.; Aldosary, A.; Mohamed, E. Urdu Ligature Recognition System: An Evolutionary Approach. Comput. Mater. Contin. 2020, 66, 1347–1367. [Google Scholar] [CrossRef]
  38. Huang, J.; Haq, I.U.; Dai, C.; Khan, S.; Nazir, S.; Imtiaz, M. Isolated Handwritten Pashto Character Recognition Using a K-NN Classification Tool based on Zoning and HOG Feature Extraction Techniques. Complexity 2021, 2021, 5558373. [Google Scholar] [CrossRef]
  39. Conneau, A.; Lample, G. Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2019; Volume 32. [Google Scholar]
  40. Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised cross-lingual representation learning at scale. arXiv 2019, arXiv:1911.02116. [Google Scholar]
  41. Nayef, N.; Patel, Y.; Busta, M.; Chowdhury, P.N.; Karatzas, D.; Khlif, W.; Matas, J.; Pal, U.; Burie, J.C.; Liu, C.l.; et al. ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition—RRC-MLT-2019. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 25 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1582–1587. [Google Scholar]
  42. Khalid, U.; Beg, M.O.; Arshad, M.U. Rubert: A bilingual roman urdu bert using cross lingual transfer learning. arXiv 2021, arXiv:2102.11278. [Google Scholar]
  43. Velankar, A.; Patil, H.; Joshi, R. Mono vs. multilingual bert for hate speech detection and text classification: A case study in marathi. In Proceedings of the IAPR Workshop on Artificial Neural Networks in Pattern Recognition, Montreal, QC, Canada, 6–8 October 2014; Springer: New York, NY, USA, 2023; pp. 121–128. [Google Scholar]
  44. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Las Vegas, NV, USA, 26–30 June 2016; Springer: New York, NY, USA, 2016; pp. 21–37. [Google Scholar]
  45. Kralicek, J.; Matas, J. Fast Text vs. Non-text Classification of Images. In Proceedings of the International Conference on Document Analysis and Recognition, Lausanne, Switzerland, 5–10 September 2021; Springer: New York, NY, USA, 2021; pp. 18–32. [Google Scholar]
  46. Veit, A.; Matera, T.; Neumann, L.; Matas, J.; Belongie, S. Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv 2016, arXiv:1601.07140. [Google Scholar]
  47. Agnihotri, L.; Dimitrova, N. Text detection for video analysis. In Proceedings of the IEEE Workshop on Content-Based Access of Image and Video Libraries (CBAIVL’99), Fort Collins, CO, USA, 22 June 1999; IEEE: Piscataway, NJ, USA, 1999; pp. 109–113. [Google Scholar]
  48. Panhwar, M.A.; Memon, K.A.; Abro, A.; Zhongliang, D.; Khuhro, S.A.; Memon, S. Signboard detection and text recognition using artificial neural networks. In Proceedings of the 2019 IEEE 9th International Conference on Electronics Information and Emergency Communication (ICEIEC), Beijing, China, 12–14 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 16–19. [Google Scholar]
  49. Reddy, S.; Mathew, M.; Gomez, L.; Rusinol, M.; Karatzas, D.; Jawahar, C. Roadtext-1k: Text detection & recognition dataset for driving videos. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; IEEE: Piscataway, NJ, USA,, 2020; pp. 11074–11080. [Google Scholar]
  50. Tang, C.W.; Liu, C.L.; Chiu, P.S. HRRegionNet: Chinese Character Segmentation in Historical Documents with Regional Awareness. In Proceedings of the International Conference on Document Analysis and Recognition, Lausanne, Switzerland, 5–10 September 2021; Springer: New York, NY, USA, 2021; pp. 3–17. [Google Scholar]
  51. Huang, Y.; Jin, L.; Peng, D. Zero-shot Chinese text recognition via matching class embedding. In Proceedings of the International Conference on Document Analysis and Recognition, Lausanne, Switzerland, 5–10 September 2021; Springer: New York, NY, USA, 2021; pp. 127–141. [Google Scholar]
  52. Tang, C.W.; Liu, C.L.; Chiu, P.S. HRCenterNet: An anchorless approach to Chinese character segmentation in historical documents. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1924–1930. [Google Scholar]
  53. Moumen, R.; Chiheb, R.; Faizi, R. Real-time Arabic scene text detection using fully convolutional neural networks. Int. J. Electr. Comput. Eng. 2021, 11, 2. [Google Scholar]
  54. Oulladji, L.; Feraoun, K.; Batouche, M.; Abraham, A. Arabic text detection using ensemble machine learning. Int. J. Hybrid Intell. Syst. 2018, 14, 233–238. [Google Scholar] [CrossRef]
  55. Fateh, A.; Rezvani, M.; Tajary, A.; Fateh, M. Persian printed text line detection based on font size. Multimed. Tools Appl. 2022, 1–26. [Google Scholar] [CrossRef]
  56. Kheirinejad, S.; Riaihi, N.; Azmi, R. Persian Text Based Traffic sign Detection with Convolutional Neural Network: A New Dataset. In Proceedings of the 2020 10th International Conference on Computer and Knowledge Engineering (ICCKE), Mashhad, Iran, 29–30 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 60–64. [Google Scholar]
  57. Arafat, S.Y.; Ashraf, N.; Iqbal, M.J.; Ahmad, I.; Khan, S.; Rodrigues, J.J. Urdu signboard detection and recognition using deep learning. Multimed. Tools Appl. 2022, 81, 11965–11987. [Google Scholar]
  58. Butt, M.A.; Ul-Hasan, A.; Shafait, F. TraffSign: Multilingual Traffic Signboard Text Detection and Recognition for Urdu and English. In Proceedings of the International Workshop on Document Analysis Systems, La Rochelle, France, 22–25 May 2022; Springer: New York, NY, USA, 2022; pp. 741–755. [Google Scholar]
  59. Balobaid, A.; Mohan, S.M.; Mahalakshmi, V. Contemporary Methods on Text Detection and Localization from Natural Scene Images and Applications. J. Algebr. Stat. 2022, 13, 2802–2811. [Google Scholar]
  60. Chandio, A.A.; Asikuzzaman, M.; Pickering, M.R.; Leghari, M. Cursive Text Recognition in Natural Scene Images Using Deep Convolutional Recurrent Neural Network. IEEE Access 2022, 10, 10062–10078. [Google Scholar] [CrossRef]
  61. Shwait, K.; Buttar, P.K.; Gautam, R. Detection and recognition of hindi text from naturalL scenes and its translation to english. Int. J. Adv. Res. Comput. Sci. 2022, 13, 86. [Google Scholar]
  62. Garg, N.K.; Kaur, L.; Jindal, M.K. A new method for line segmentation of handwritten Hindi text. In Proceedings of the 2010 Seventh International Conference on Information Technology: New Generations, Las Vegas, CA, USA, 12–14 April 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 392–397. [Google Scholar]
  63. Palakollu, S.; Dhir, R.; Rani, R. Handwritten Hindi text segmentation techniques for lines and characters. In Proceedings of the World Congress on Engineering and Computer Science, San Francisco, CA, USA, 24–26 October 2012; Volume 1, pp. 24–26. [Google Scholar]
  64. Liao, M.; Zhu, Z.; Shi, B.; Xia, G.S.; Bai, X. Rotation-sensitive regression for oriented scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5909–5918. [Google Scholar]
  65. Qin, H.; Yang, C.; Zhu, X.; Yin, X. Dynamic Receptive Field Adaptation for Attention-Based Text Recognition. In Proceedings of the International Conference on Document Analysis and Recognition, Lausanne, Switzerland, 5–10 September 2021; Springer: New York, NY, USA, 2021; pp. 225–239. [Google Scholar]
  66. Lin, H.; Yang, P.; Zhang, F. Review of scene text detection and recognition. Arch. Comput. Methods Eng. 2020, 27, 433–454. [Google Scholar] [CrossRef]
Figure 1. Different styles of an Urdu character used in cursive words.
Figure 1. Different styles of an Urdu character used in cursive words.
Applsci 12 12652 g001
Figure 2. Nastaleeq font in different styles, colors and sizes.
Figure 2. Nastaleeq font in different styles, colors and sizes.
Applsci 12 12652 g002
Figure 3. Dataset transcription.
Figure 3. Dataset transcription.
Applsci 12 12652 g003
Figure 4. Proposed model block diagram.
Figure 4. Proposed model block diagram.
Applsci 12 12652 g004
Figure 5. Model training block diagram.
Figure 5. Model training block diagram.
Applsci 12 12652 g005
Figure 6. Proposed model data processing, input image to an output text file.
Figure 6. Proposed model data processing, input image to an output text file.
Applsci 12 12652 g006
Figure 7. Text detection and recognition from an input image into an output text.
Figure 7. Text detection and recognition from an input image into an output text.
Applsci 12 12652 g007aApplsci 12 12652 g007b
Figure 8. Training accuracy progression of the proposed model over the course of epochs.
Figure 8. Training accuracy progression of the proposed model over the course of epochs.
Applsci 12 12652 g008
Table 1. Variable layer architectures accuracy comparison.
Table 1. Variable layer architectures accuracy comparison.
RNN LayersTraining AccuracyValidation Accuracy
194.53%93.72%
296.97%96.12%
398.45%97.47%
Table 2. Training confusion matrix.
Table 2. Training confusion matrix.
Predicted WrongPredicted Correct
Actual WrongTrue Negative (3099)False Positive (854)
Actual CorrectFalse Negative (3953)True Positive (136,882)
Table 3. Training outcomes.
Table 3. Training outcomes.
Specificity97.86%
Sensitivity95.13%
Positive predicted values97.98%
Negative predicted values87.3%
Table 4. Validation confusion matrix.
Table 4. Validation confusion matrix.
Predicted WrongPredicted Correct
Actual WrongTrue Negative (3397)False Positive (117)
Actual CorrectFalse Negative (457)True Positive (35,226)
Table 5. Validation outcomes.
Table 5. Validation outcomes.
Specificity97.32%
Sensitivity94.07%
Positive predicted values97.328%
Negative predicted values86.47%
Table 6. Validation confusion matrix (UPTI).
Table 6. Validation confusion matrix (UPTI).
Predicted WrongPredicted Correct
Actual WrongTrue Negative (12,113)False Positive (315)
Actual CorrectFalse Negative (12,428)True Positive (468,348)
Table 7. Validation outcomes (UPTI).
Table 7. Validation outcomes (UPTI).
Specificity97.47%
Sensitivity97.41%
Positive predicted values97.478%
Negative predicted values49.35%
Table 8. Accuracy comparison.
Table 8. Accuracy comparison.
MethodsPercentage Accuracy
Ahmed et al. [16]96.71%
Naz et al. [19]93.39%
Naz et al. [21]96.72%
Ul-Hassan et al. [18]80.34%
Proposed Method97.47%
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Umair, M.; Zubair, M.; Dawood, F.; Ashfaq, S.; Bhatti, M.S.; Hijji, M.; Sohail, A. A Multi-Layer Holistic Approach for Cursive Text Recognition. Appl. Sci. 2022, 12, 12652. https://doi.org/10.3390/app122412652

AMA Style

Umair M, Zubair M, Dawood F, Ashfaq S, Bhatti MS, Hijji M, Sohail A. A Multi-Layer Holistic Approach for Cursive Text Recognition. Applied Sciences. 2022; 12(24):12652. https://doi.org/10.3390/app122412652

Chicago/Turabian Style

Umair, Muhammad, Muhammad Zubair, Farhan Dawood, Sarim Ashfaq, Muhammad Shahid Bhatti, Mohammad Hijji, and Abid Sohail. 2022. "A Multi-Layer Holistic Approach for Cursive Text Recognition" Applied Sciences 12, no. 24: 12652. https://doi.org/10.3390/app122412652

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop