*3.2. Feature Extraction Layer*

Feature extraction is one of the crucial steps in machine learning problems. In the deep learning era, several automatic feature extraction methods have been proposed, including [40–43]. These feature extraction methods were applied to several problem domains and produced good results. Recently, Chen et al. [21], proposed an OctConv method that extracts features based on their frequencies. We use Chen et al.'s feature extraction method to detect text/non-text regions. Naturally, texts found in natural images have different properties (i.e., size, orientation, shapes, and color). These cause a challenge in perfectly detecting the text/non-text region, which directly affects the performance of the recognition task. To overcome this challenge, we build high-level semantic feature maps using FPN with ResNet-50. Different from [9], in our proposed feature extraction layer, we replace vanilla convolutions by OctConv. This factorizes the mixed-feature map tensor into high and low-frequency maps, where the high-frequency feature map tensors encode with fine details, whereas the low-frequency feature map tensors encode with global structures. Compared to vanilla convolution, OctConv reduces spatial redundancy, memory cost, and computation cost.

For given spatial dimensions *w* and *h* with the number of feature maps *c*, the input feature tensor of a convolution layer will be *<sup>X</sup>* <sup>∈</sup> <sup>R</sup>*c*×*h*×*w*. In OctConv, the input vector *<sup>X</sup>* factorized along channel dimensions into low feature map (*XL*) and high feature map (*XH*) frequencies. As stated in [21], the factorization of high feature map and low feature map tensors are computed as follows:

$$X^{H} = \begin{array}{c} X^{(1-a)c \times b \times w} \\ \end{array} \tag{1}$$

$$X^L = X^{ac \times \frac{k}{2} \times \frac{\pi}{2}} \tag{2}$$

where the value of α ∈ [0, 1)

In the factorization process, fine details are obtained on high-frequency feature maps, whereas differences in speed in spatial dimensions with respect to image location were obtained at low-frequency feature map tensors. This process maps the features that are compacted and replace spatial repetitive feature maps with different resolution maps. On these feature maps, an octave convolution is applied

where the vanilla convolution does not work, due to different resolutions of high- and low-frequency feature maps. The octave convolution enables efficient inter-frequency communication and effectively operates on low- and high-frequency tensors. For the factorized high (*XH*) and low ( *XL*) feature tensors, there is a corresponding output feature tensor *Y<sup>H</sup>* and *YL*, respectively. To get each output feature tensor, inter (*YH*→*L*, *YL*→*H*) and intra (*YL*→*L*, *YH*→*H*) frequency convolution update is performed. Each output feature map at location (*p*, *q*) is computed using appropriate kernels (*W<sup>L</sup>* and *WH*), applying regular convolution for intra-frequency update and removing the need of explicitly computing and sorting on up/down sampling for inter-frequency communication as follows:

$$\mathbf{Y}\_{p,q}^{H} = \sum\_{i,j \in \mathbb{N}\_k} \left( \mathbf{W}\_{i + \frac{k-1}{2}, j + \frac{k-1}{2}}^{H \rightarrow H} \right)^T \mathbf{X}\_{p+i, q+j}^{H} + \sum\_{i,j \in \mathbb{N}\_k} \left( \mathbf{W}\_{i + \frac{k-1}{2}, j + \frac{k-1}{2}}^{L \rightarrow H} \right)^T \mathbf{X}\_{(\frac{p}{2} + i), (\frac{q}{2} + j)}^{L} \tag{3}$$

$$Y\_{p,q}^L = \sum\_{i,j \in \mathbb{N}\_k} \left( \mathsf{W}\_{i + \frac{k-1}{2}, j + \frac{k-1}{2}}^{L \rightarrow L} \right)^T X\_{p+i, q+j}^L + \sum\_{i,j \in \mathbb{N}\_k} \left( \mathsf{W}\_{i + \frac{k-1}{2}, j + \frac{k-1}{2}}^{H \rightarrow L} \right)^T X\_{(2 \cdot p + 0.5 + i), (2 \cdot p + 0.5 + j)}^H \tag{4}$$

The recognition performance of the model is improved because OctConv can extract a larger receptive field for low-frequency feature maps. Most commonly, text found in natural images has low frequencies. Compared to vanilla convolution, OctConv convolves at a factor of 2 receptive fields.
