*5.1. Implementation Details*

The proposed model was first pre-trained using our synthetically generated bilingual dataset and Synthetic [22], then fine-tuned using the union of other real-world datasets indicated in Section 4.2.2. Due to the lack of real sample images in the fine-tuning stage, data augmentation and multi-scale training were applied by randomly modifying brightness, hue, contrast, the angle of the image between −30 and 30. Following [9], for multi-scale training, the shorter sides of the input images were randomly resized to five scales (600, 800, 1000, 1200, 1400). We used Adam [46] (base learning rate = 0.0001, β1 = 0.9, β2 = 0.999, weight decay = 0) as an optimizer. Following the result of [21], we set the value to α = 0.25 which denotes the ratio of the low-frequency part.

The experiment of the proposed bilingual scene text reading model is conducted on the Ubuntu machine containing Intel Core i7-7700 (3.60 GHz) CPU with 64 GB RAM and GeForce GTX 1080 Ti 11176 MiB GPU. For the implementation, we use Python 3.7 and PyTorch1.2.
