*2.2. Scene Text Recognition*

In the text-reading phases of natural images, text recognition is the second phase after scene text detection. This method can be implemented independently or after scene text detection phases. In the scene text recognition phase, the cropped text regions are fed either from the scene text detection phase or from the prepared input dataset, from which the sequences of labels are decoded. Previous attempts were made by detecting individual characters and refining misclassified characters. Such methods require training a strong character detector for accurately detecting and cropping each character out from the original word. These types of methods are more difficult for Ethiopic scripts due to their complexities. Apart from the character level methods, word recognition [12], sequence to label [30], and sequence to sequence [31] methods have been presented. Liu et al. [32] and Shi et al. [15] presented a spatial attention mechanism to transform a distorted text region from irregular input images into canonical pose suitable recognition. However, both the detection and recognition task performance are determined based on the extracted features. Previously proposed scene text detection and recognition of deep learning-based and conventional machine learning feature extraction methods do not consider the frequency of the input image. Following [21], in this paper, we propose an OctConv with ResNet-50 feature extractor, which extracts features by factorizing based on their frequencies.

### *2.3. Scene Text Spotting*

Recently, several end-to-end scene text spotting methods have been introduced and have shown a remarkable result compared to independent scene text detection and recognition approaches. For instance, Li et al. [10] introduced an end-to-end text spotting technique from natural images using RPN as a text detector and attention Long Short Term Memory (LSTM) as a text recognizer. Liao et al. [8] presented an end-to-end scene text-reading method using Single Shot Detector (SSD) [33] and convolutional recurrent neural network (CRNN) for scene text detection and recognition, respectively. Liu et al. [34] introduced a unified network to detect and recognize multi-oriented scene texts from natural images. Lunadren et al. [35] introduced an octave-based fully convolutional neural network with fewer layers and parameters to precisely detect multilingual scene text. The most recently proposed scene text-reading models are summarized in Table 1.



Improving the feature extraction and recognition network will improve scene text detection, recognition, and text spotting problems. In [21], an OctConv feature extraction method has been proposed for object detection and improves its performance. Octave convolution addresses spatial redundancy, which was not addressed in the previously proposed methods. The OctConv does not change the connectivity between feature maps and it is different from inception multi-path designs [36,37]. In our proposed bilingual text-reading method, we replace the ResNet-50 vanilla

convolution with OctConv, which can operate quickly and produce accurate results in the extraction of features. As stated in [38], the limitation of Connectionist Temporal Classification (CTC), attention encoder-decoder, and hybrid (CTC and attention) method is improved using a time-restricted self-attention method for an automatic speech recognition system. In our proposed method, we integrate a time-restricted self-attention encoder-decoder module for recognition with feature extraction and bounding box detection layers.
