*3.1. Overall View of the Architecture*

Our proposed architecture follows the architecture presented in [9,21]. Our proposed architecture has three functional components, feature-extraction layer, text/non-text detection layer, and recognition layer. In the feature-extraction layer, features are extracted from input natural images and passed to the next layer using an FPN [20] with ResNet-50 [39] by replacing the vanilla convolution with an octave convolution. Then, using the extracted features on the 1st layer as an input, a region proposal network (RPN) [40] predicts text/non-text area and bounding boxes of each text area. Finally, by applying RoI to the outputs of the 2nd layer, text segmentation, and word prediction are done using the time-restricted self-attention encoder-decoder module. Details of each layer are presented below.
