*5.2. Experiment Results*

Throughout our experimental analysis, we evaluated a single model trained in a multilingual setup as explained in Section 3. To improve the performance of the model, we first pre-trained it using Synthetic dataset [22] and our synthetically generated bilingual dataset which has a total of 430 characters. Then, we fine-tuned the pre-trained model by combining the above-mentioned real scene text datasets. The text recognition results were reported in an unconstrained setup, that is, without using any predefined lexicon (set of words).

The performance of the trained model was verified using our prepared testing dataset and well-known ICDAR detests. As discussed in Section 4.2, the collected images in our dataset contain horizontal, arbitrary, and curved texts. Both the detection and recognition results were promising for horizontal, arbitrary, and curved text. The experiment evaluation for scene text detection on our prepared real scene text dataset showed 88.3% Precision (P), 82.4% Recall (R), and 85.25% F1-score (F). On the other hand, the end-to-end scene text-reading experiment showed 80.88% P, 49.01% R, and 61.04% F. The scene text detection performance of the proposed method for English and Amharic words do not differ much. However, in the end-to-end scene text reading task, 63.4% of errors occurred in the recognition of Amharic words. From incorrectly recognized characters, some of them did not have sufficient samples on the real and Synthetic datasets. Sample detection and recognition results are depicted in Figure 4. Most of the detection errors in our proposed method occurred from false detection of non-text areas of backgrounds.

**Figure 4.** Sample detection and recognition result for our prepared dataset.

In addition to our testing dataset, we evaluated the performance of our proposed model using ICDAR2013, ICDAR2015, and Total-Text testing datasets, which contain only English texts. The model is fine-tuned for both English and Amharic languages as one model, not for each language. The results of our proposed method and previously proposed methods are shown in Table 3. The experiment showed that our proposed method had a better recognition result on ICDAR2013 and Total-Text datasets. However, the scene text detection result of our proposed method was almost similar to a recently proposed mask text spotter [9] method. We used their architecture and implementation code with a little modification on the feature extraction layer and recognition layer. From the MaskTextSpotter implementation, we modified the ResNet-50 feature extraction by octave based ResNet-50 feature extraction and the text recognition part is modified by self-attention encoder-decoder model. Whereas the preprocessing and RPN implementation is taken from MaskTextSpotter. In Table 4, we compare the scene text detection result of our proposed method with previously proposed methods using ICDAR2013, ICDAR2015, and Total-Text datasets.


**Table 3.** F1-Score experimental results of the proposed unconstrained scene text reading system compared with previous methods.

\* indicates that the model is trained for English language only; \*\* indicates that the model is trained for multilingual datasets. Our model is trained for English and Amharic languages, with 430 characters.

**Table 4.** Scene text detection result of the proposed method compared with previous methods.


In the experiment, the proposed bilingual scene text reading method had limitations regarding small font size scene texts and severely distorted images. Furthermore, due to the existence of many characters and their similarities, and the limited number of training samples for certain Ethiopic characters, a recognition error occurred at the time of testing. To improve the recognition performance of the system and the scene text-reading system in general, it is necessary to prepare more training data that contain enough samples for every character.
