**6. Conclusions**

This paper introduced an end-to-end trainable bilingual (English and Ethiopic) scene text reading system using octave convolution and time-restricted attention encoder-decoder module. In the proposed model there were three layers. In the first layer, FPN with ResNet-50 was used as a feature extractor by replacing vanilla convolution with OctConv. Secondly, bounding box prediction and detection of texts were performed using RPN. Finally, recognition of text was performed by segmenting text areas based on the detected bounding boxes on the second layer using a time-restricted attention encoder-decoder network. To measure the effectiveness of the proposed model, we collect and syntactically generate a bilingual dataset. Additionally, we use well-known ICDAR2013, ICDAR2015, and Total Text datasets. Based on the prepared bilingual dataset, the proposed method shows 61.04% and 85.25% F1-measure on scene text reading and scene text detection, respectively. Compared to state-of-the-art recognition performance, our proposed model shows promising results. However, our method shows state-of-the-art results for ICDAR2013 and Total-Text end-to-end text readings. Furthermore, due to the existence of many characters, their similarities, and the limited number of training samples for certain Ethiopic characters, a recognition error occurred at the time of testing. To improve the recognition performance of the system, it is necessary for the future to prepare more training data that contain enough samples for every character. After the publication of the paper, the implementation code and the prepared dataset link will be freely available for the researchers on https://github.com/direselign/amh\_eng.

**Author Contributions:** Conceptualization D.A.T. and experiments, D.A.T. and V.-D.T.; validation, D.A.T. and V.-D.T.; writing – original draft preparation, D.A.T.; writing-review and editing, C.M.L. and V.-D.T.; super vision, C.-M.L.; funding acquisition, C.-M.L.; All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported in part by the Ministry of Science and Technology and National Taipei University of Technology through the Applied Computing Research Laboratory, Taiwan, under Grant MOST 107-2221-E-027-099-MY2 and Grant NTUT-BIT-109-03.

**Conflicts of Interest:** The authors declare no conflicts of interest.
