*3.3. Text Region Detection Layer*

Using RPN and taking the extracted feature maps as an input, text/non-text regions are detected. Following [9] and [20], we assign five anchors at different stages {P2, P3, P4, P5, P6} with the area of anchors {322, 642, 1282, 2562, 5122}, respectively. Besides, to handle different text sizes {0.5, 1, 2} aspect ratios are implemented at each stage. By doing this, text proposal features are generated. These features are further extracted using RoI align [41], which preserves a more accurate location compared to RoI pooling. Finally, the Fast Region (R)-CNN [41] generates precise bounding boxes for the texts found in the input natural image. Using a soft-Non-maximal suppression (NMS) [42] technique, we select one bounding box for those texts that have more than one bounding box.
