1. Introduction
Automation development has led to digitization in the modern world to save time and enhance capacity. Digitization is a significant stage in the progress of technology, and Optical Character Recognition (OCR) is an essential point in digitizing manuscript papers. OCR is converting a typed script from images to a modifiable text. OCRs are developed for a particular language or font within the language as needed. This research mainly tackles Urdu OCR for Nastaleeq font in scanned documentary pictures.
The South-Asian languages like Urdu, Hindi and Persian are cursive in nature. They all are mutually intelligible as spoken languages. They belong to different writing systems. Urdu and Persian follows modified version of Arabic script while Hindi follows Devangari [
1]. Urdu is a popular language among many South-Asian countries, including Pakistan, India, Afghanistan and Nepal. Large communities in the USA, the UK, Australia, Canada and the Middle East [
2] use it. Urdu is spoken as a first language by nearly 70 million people and as a second language by more than 100 million people, predominantly in Pakistan and India. One can find several official documents and literature that need to be digitized, and that is where OCR comes in handy and caters for Urdu text recognition. Due to cursive nature, the Urdu alphabets can have many writing forms. These forms vary when writing individually and in combination with other alphabets. A brief detail of these variations is explained below in
Section 1.1.
1.1. Urdu Script
Urdu script is written right-to-left cursively, meaning that each letter in a word is joined with another and can have four different styles depending on its position. The four glyphs of the Urdu alphabet are as follows and shown in
Figure 1,
Isolated.
Initial.
Middle.
Final.
In
Figure 1, the Urdu text style of a single character is indicated in each word corresponding to its position, along with roman Urdu and English translations of each word. Because Urdu is written down in mixed ways, the relationship between each alphabet to form a word is called a ligature.
Nastaleeq Typeface
Nastaleeq is the preferred font for Urdu language scripting [
3]. Nastaleeq can handle Open Type Format (OTF) text style specifications and is currently considered a Windows Motor Ship with Microsoft Windows 2000 to write official documentation, books and manuscript writing.
1.2. Text Recognition
Recognition of text or script is a procedure to draw out the typed text from images. For this purpose, OCR is utilized. Every linguist can have different OCRs based on text and image properties. OCR targets one linguistic need to recognize the guidelines of that linguistic to provide effective outcomes of the identified script. When converting documents into digital copies, individuals can easily change the written details as per need. Hence, the modifications can be easily made in the document. Therefore, there are many requirements for implementing text recognition solutions for Urdu to fulfill the basic needs of Urdu OCR. Finding its solution is the major obstacle in digitizing Urdu documentation. Several solutions can be found, but their limitations lie, which leaves a gap in finding the optimal solution. Urdu script is cursive and very different from Latin script, and it has multiple fonts and font styles. Due to a lack of research on Urdu, only a limited amount of data files can be found. The biggest challenge is to localize the words or sentences in a picture before recognizing the text. Pictures undergoing the process of text recognition can have different background colors or text types for each source. Therefore, localizing the text in the image before identifying the text is a necessary point. Furthermore, the central reason for low accuracy in past research is the lack of a data set. Each study with distinctive guidelines requires a data set that applies to those guidelines. Due to the complex nature of Urdu text styles, text detection and recognition is a complicated problem. The following are the main contribution of this research work:
A unified approach for position free Urdu text detection through Connected Component Analysis (CCA) and Long Short-Term Memory (LSTM).
A hybrid Convolution Neural Network and Recurrent Neural Network (CNN-RNN) approach for text recognition.
Evaluation of the proposed methodology on self-synthesized and UPTI data sets.
2. Literature Review
Several studies have been conducted on script recognition and mostly English text recognition. However, many people have not been able to perform Urdu research or identify every aspect of Urdu writing. The following section briefly reviews some research on Urdu text recognition.
Segmentation of text to characters explicitly is a highly challenging task in cursive scripts like Urdu. Research is carried out by segmenting text into characters implicitly with the help of deep learning. The writing on Urdu text recognition can be arranged dependent on the element of recognition, for example, characters or ligatures. The research network has created two benchmark datasets of printed Urdu text, the Urdu Printed Text Line Images (UPTI) dataset [
4,
5] and the Center of Language Engineering (CLE) dataset. The UPTI dataset consists of 1000 lines of text, while the CLE dataset consists of more than 2000 groups of high-recurrence Urdu ligatures and scanned pages of books.
Reflecting on the strategies assessed on the CLE dataset, Javed et al. subdivide the ligatures at spreading points. Raw pixel values are used to train Hidden Markov Models (HMMs) on each fragment of ligatures. Assessments are done on 1692 essential ligatures with correctness of 92%. This work is expanded in [
6,
7,
8], in which 250 graphemes are used to train HMMs. In [
9], a detailed identification proposal is presented, which uses Discrete cosine transform (DCT) features and HMMs. They have modified the Tesseract engine to include recognition of Urdu ligatures, realizing around a 97% recognition rate for ligatures in fixed font sizes. In [
10], a test on 6000 queries, the ligatures realized an identification rate of 97.93%, and arithmetic features are taken from the 2028 HFL clusters. The primary and secondary ligatures association was also carried out.
Among different identification techniques, Sabbour and Shafait in [
11] use object descriptors to recognize ligatures in the UPTI database, resulting in an 89% recognition rate. On the UPTI database [
12,
13,
14], many implicit partition-based techniques have also been calculated. In such methods, the learning limitations are determined after the ground truth transcription, and the pictures of text lines are provided to the learning algorithm. Refs. [
15,
16] used statistical method with multi-dimensional Recurrent Neural Networks (MD-RNN), whereas [
17,
18] used bidirectional Long Short-Term Memory (LSTM) networks on raw pixel values. The background is that the ligature-based approaches evaluated on the UPTI database are lesser because the database is labeled at line level, and ligatures have to be manually taken from the database. The study [
19] concludes a 95% classification rate on around 100 ligature classes. In [
20,
21], statistical characteristics are used with HMMs to identify more than 1500 ligature classes in the UPTI database.
From the outlook of video text identification, the research on cursive scripts is limited. Few video OCRs targeting recognition of Arabic text have been reported in the literature [
22]. In [
23], the authors deployed deep learning-based methods for identifying Arabic video text. Three classifiers, including deep belief networks, multilayer perceptron, and deep autoencoder, realized recognition rates of 90.73%, 88.5%, and 94.36%, respectively. The procedure was calculated using a database with 47 h of video. The studies [
8,
24] stated an approach for noticing, localization, and recognizing Urdu text in videos. Identification rates of around 91–93% were obtained. The system is evaluated on 10 h of video, each from three different channels.
A partition-based technique for OCR using a Support Vector Machine (SVM) was proposed [
25]. They evaluated local and global features on top of partitioned cursive characters.
Most of the work in the domain of Urdu OCR consumed handcrafted attributes and used the nearest-adjacent approach to perform identification. Ref. [
26] used contour extraction techniques and shape context information to create attribute descriptors for each ligature/glyph. Similarly, [
27] used morphological operations and character-specific filters to pre-process each segmented character/glyph from a line image. They used a heuristics-based approach on the character-chain-code to determine which class label (Urdu glyph) the segmented image is assigned. CNN takes input data, extracts low-level features, and then is fed to MDLSTM. All resulting vectors are concatenated to form a single vector to be passed on to Connectionist Temporal Classification (CTC) transcript layer [
28].
The supervised learning methods for text recognition [
29,
30] will generally be more complex than the masked procedures and recognize text and non-text blocks. AI approaches are thought to be among the most recently accomplished approaches, an Artificial Neural Network (ANN) is seen in [
31]. These AI-based methodologies require a ton of preparing information to increase adequate order rates. A Character Proposal Network (CPN) for looking through character recommendations is utilized for text ID in [
32] though a lot of mid-level natives to get the segmental structure of characters (named as “stroke lets”) is in [
28].
Commonly, unsupervised methodologies are characterized by edge-based, CCA-based, surface-based, and shading-based strategies. In unsupervised techniques, picture assessment approaches are placed in, and segment strategies (edges, spatial gathering) are utilized to recognize text from the remainder of the picture. Surface-based strategies [
33,
34] consider literary substance in the picture as an exceptional surface that separates itself from the non-text parts. Surface credits are assessed from dark-level pictures or by first changing the picture by separating or applying recurrence space changes. Numerous investigations [
35] endeavor various textural measures to see and affirm text parts in a picture.
Edge-based methodologies [
36] use the high difference between text and its experience by looking through the image’s edges. Fragments of high edge thickness are then consolidated under specific heuristics to sift through non-text areas. Associated segment-based techniques [
37,
38] endeavor the shading/power of text pixels alongside some mathematical heuristics to separate content from the foundation.
Multilingual pretraining can generally enhance a model’s monolingual performance, it has been demonstrated that as a multilingual model learns more languages, the capacity for each language falls off [
39]. Recent research has demonstrated that multilingual language models perform worse than their monolingual equivalents [
40]. A detail of challenges in multilingual models are explained in detail [
41]. Even bilingual language modeling has been found to perform better than multilingual modeling [
42]. A study shows that monolingual versions outperform the traditional multilingual models for all datasets. Moreover, better sentence representations are also generated by the monolingual models [
43]. However, training and maintenance for each language in a monolingual model is not cost-effective in terms of time and resources. Urdu being a resource-starved language, large pre-trained language models cannot be built on a single corpus.
A significant amount of work has previously been done on object detection in photos using deep neural network [
44]. In another study, fast text classification is performed by processing only the text part of the image instead of the whole picture [
45]. Considering text as object, the existing methodologies can be applied for text detection. This concept is mainly followed in the detection and recognition of text in different languages including English [
46,
47,
48,
49], Chinese [
50,
51,
52], Arabic [
53,
54], Persian [
55,
56], Urdu [
34,
57,
58,
59,
60] and Hindi [
61,
62,
63]. In some languages (including Urdu), there exists multiple ligatures and orthography. Manuscripts of the same language can have their unique writing style, which can create different lengths of script lines with the same word appearing at different locations. Moreover, a manuscript can have numerous font styles and colors in a distinct image as shown in
Figure 2.
In the same text, character size and spacing are largely consistent. However, this may not be true in the context of complex scenes. In case of arbitrary text position, some research studies have proposed solutions for a reliable detection of oriented scene text [
64]. A recent study has proposed a dynamic, receptive field adaptation method for reliably identifying scene text for the English language [
65]. However, for cursive nature languages like Urdu, there exist challenges. For example, the text localization is relatively hard when there is no symmetry in the placement and text style. A survey paper on the history and progress of text detection and recognition from an image has summarized the benefits and drawbacks of different approaches. The paper also elaborates the history and progress of scene text detection and recognition is explained and also mention the conventional methods in detail with their benefits and drawbacks [
66].
Researchers have come up with appropriate solutions to challenge Urdu text recognition over the years. A few researchers use distinctive methods, but not all cover the problem completely regarding the solution. There lies the gap in accurately localizing the text in the provided image as well as the identification of high cursive words.