**1. Introduction**

Currently, reading text from a natural image is one of the hottest research issues in computer vision and document processing. It has many applications including ordering large pictures and video databases by their literary substance, such as Bing Maps, Apple Maps, Google Street View, and so on. Moreover, it allows for image mining, office automation, and support for the visually impaired. Thus, scene text is highly important for thoughtful and uniform services throughout the world. However, reading text from natural images poses several challenges, due to the use of different fonts (color, type, and size) and texts being written on more than one script. Moreover, imperfect image condition causes distorted text, and complex and inference backgrounds cause unpredictability. As a result, reading or spotting texts from a natural image becomes a challenging task.

Previously, several considerable research outputs were presented for scene text detection [1–5] and scene text recognition [6,7] independently, which led to a computational complexity and integration problem being used as a text-reading task. To improve these, an end-to-end scene text spotting method was presented in references [8–10], but it still needs improvement in terms of recognition accuracy, memory usage, and speed. For instance, in [11,12] a fully conventional network is applied for scene text detection and recognition by considering the detection and recognition problems independently. For scene text detection, a convolutional neural network (CNN) was applied to extract feature maps from the input image, and then different decoders were used to decode and detect the text region based on the extracted features [5,13,14].

Using the extracted sequences of features at the scene text detection phase, characters/words have been predicted with sequence prediction models [15,16]. These types of approaches led to heavy time cost and ignored the correlation in visual cues for images with a number of text regions, whereas both operations had real integrations. In general, previously proposed scene text detection and recognition approaches were problematic, especially when texts in the image are written in more than one script, different text sizes and text shapes are irregular. Furthermore, most research focused on English language and only a few presented other languages such as Arabic and Chinese. Except our previously presented scene text recognition method [17], there is no research output for scene text reading as well as scene text detection for Ethiopic script-based languages. Ethiopic script is used as a writing system for more than 43 languages, including Amharic, Geez, and Tigrigna.

Amharic is the official language of Ethiopia and the second-largest Semitic language after Arabic [18]. On the other hand, English is used as a teaching medium in secondary schools and higher education. As a result, English and Amharic languages are being used concurrently for different activities in most areas of the country. Thus, designing independent applications of scene text detection and scene text recognition requires multiple networks for solving individual sub-problems, which increases computational complexity and causes accuracy and integrity problems. Additionally, developing detection and recognition as independent sub-problems restrains the recognition of rotated and irregular texts. The characteristics of individual characters for complex languages, for example, Amharic language, in the script, and the availability of bilingual scripts in natural images make the scene text recognition methods to challenging when used independently for detection and recognition. Text detection and text recognition are relevant tasks in most operations and complement each other.

Recently, the proposed multilingual end-to-end scene text spotting system in [9,15,19] had a good result for several languages except for Ethiopic script-based languages. However, in their proposed method, they did not consider the frequency of features (high and low) and the effects of word length in the recognition. In this paper, a bilingual end-to-end trainable scene text reading model is proposed by extracting features from the input image based on their frequency and a time-restricted self-attention encoder-decoder module for recognition. Between the feature-extraction and recognition layers, we use a region proposal network, to detect the text area and predict the bounding boxes.

Figure 1 shows the architecture of the proposed system, which contains feature-extraction, detection, and recognition layers. In the first layer of our proposed network, we use a feature pyramid network (FPN) [20] with ResNet-50 to extract features. Inspired by reference [21], the ResNet-50 vanilla convolutions are replaced by octave convolutions (OctConv), except for the first convolution layer. The OctConv factorizes feature tensors based on their frequencies (high and low) which helps to effectively enlarge the receptive field in the original pixel space and improve recognition performance. Additionally, it optimizes the memory requirement by avoiding redundancy. As stated in [21], OctConv improves object-recognition performance and shows a state-of-the-art result. In the second layer, a region proposal network (RPN) is applied for predicting text/non-text regions and recognizing the bounding boxes of the predicted text region from the input image using the extracted feature at the first layer. Finally, by applying Region of Interest (RoI) pooling based on the predicted bounding boxes to the extracted features, word prediction is performed using a time-restricted self-attention encoder-decoder module. Our proposed bilingual text-reading model is originally presented to read texts from the natural image in an end-to-end manner. The major contributions of the article are summarized as follows:


The rest of the paper is organized as follows. Related works are presented in Section 2. In Section 3, we discuss the proposed bilingual end-to-end scene text reading methodology. A short description of the Ethiopic script and datasets that are used for training and evaluating the proposed model is described in Section 4. The experimental set-up and results are discussed in Section 5. Finally, a conclusion is drawn in Section 6.

**Figure 1.** The architecture of the proposed bilingual end-to-end scene text reader model.

#### **2. Related Work**

Reading text from a natural image is currently an active field of investigation in computer vision and document analysis. In this section, we introduce related works, including scene text detection, scene text recognition, and text spotting (combining detection and recognition) techniques.
