4.2.2. Real Scene Text Dataset

In addition to the synthetic dataset, we collected 1200 benchmark bilingual real scene text images using photo camera and image search on Google. The images were captured from local markets, navigation and traffic signs, banners, billboards, and governmental offices. We also incorporated several office logos, most of which were written both in Amharic and English with curved shapes. In addition to our prepared dataset, we used the Synthetic [22] dataset to pre-train the proposed model with our synthetic dataset. To refine the pre-trained model and compare its performance with a state-of-the-art model, we used ICDAR2013 [44], ICDAR2015 [40], and Total-Text [45] datasets. The datasets, we used in the proposed model are summarized in Table 2. Additionally, sample images from the collected datasets are depicted in Figure 3.

**Figure 3.** Sample of collected real scene text images.



### **5. Experiments and Discussions**

The effectiveness of the proposed model was evaluated and compared with state-of-the-art methods by pre-training the proposed model using our synthetically generated dataset and a Synthetic dataset. Finally, the pre-trained model was refined by merging the above-mentioned datasets.
