**2. Related Work**

#### *2.1. Anchor-Based and Anchor-Free*

Both anchor-based detectors and anchor-free detectors have been used in recent natural scene text detection tasks.

Specifically, anchor-based methods traverse the feature maps calculated by convolutional layers, and place a large number of pre-defined anchors on each picture, the categories are predicted, and the coordinates of these anchors are optimized, which will be regarded as detection results. According to the text area's aspect ratio characteristics, TextBoxes [16] equips each point with six anchors with different aspect ratios as the initial text detection box. TextBoxes+ [17] can detect text in any direction, which uses text boxes with oblique angles to detect irregularly shaped text. DMPNet [18] retains the traditional horizontal sliding window and separately sets six candidate text boxes with different inclination angles according to the inherent shape characteristics of the text: add two 45-degree rectangular windows in the square window; add two long parallelogram windows in the long rectangular window; add two tall parallelogram windows inside the tall rectangular window. The four vertices coordinates of the quadrilateral are used to represent the text candidate frame.

Anchor-free detectors can find objects directly in two different ways without defining anchors in advance. One method is to locate several pre-defined or self-learning key points and limit the spatial scope of the target. Another method is to use the center point or area of the object to define the positive, and then predict the four distances from the positive to the object boundary. For example, in FCOS [19], the introduction of centerness can well inhibit these low-quality boxes' production. Simultaneously, it avoids the complex calculation of anchor frames, such as calculating the overlap in the training process, and saves the memory consumption in the training process. AF-RPN [20] solves the problem that the classic RPN algorithm cannot effectively predict text boxes in any direction. Instead of detecting fusion features from different levels, it detects the text size by the size of the multiscale components extracted by the feature pyramid network. The RPN stage abandons the use of anchors and uses a point directly to return the coordinates of the four corners of the bounding box, and then shrinks the text area to generate the text core area.

PSENet [21] is slightly different from anchor-free methods. It segments the fusion features of different scales' outputs by the FPN network. Each text instance is reduced to multiple text segmentation maps of different scales through the shrinkage method. The segmentation maps of different scales are merged by the progressive expansion algorithm based on breadth-first-search, which focuses on reconstructing the text instance as a whole to ge<sup>t</sup> the final detected text. The progressive scale expansion algorithm can detect the scene text more accurately and distinguish the text that is close or stuck together, which is another method that can process text well without an anchor.

#### *2.2. One-Stage and Two-Stage Algorithms*

The representative one-stage and two-stage algorithms are YOLO and Faster-R-CNN, respectively.

The most significant advantage of the single-stage detection algorithm is that it is fast. It provides category and location information directly through the backbone network without using the RPN network to display the candidate area. The accuracy of this algorithm is slightly lower than that of the two-stage. With the development of target detection algorithms, the accuracy of single-stage target detection algorithms has also been improved. Gupta et al. proposed the FCRN model [22], which extracts features based on the full convolutional network, and then performs regression prediction on the feature map by convolution operation. Unlike the prediction of a category label in FCN [23], it predicts the bounding box parameters of each enclosing word, including the center coordinate offset, width, height, and angle information. EAST [24] directly indicates arbitrary quadrilateral text based on the full convolutional network (FCN). It uses NMS to process overlapping bounding boxes and generates multi-channel pixel-level text scoring maps and geometric

figures with an end-to-end model. R-YOLO [25] proposed a real-time detector including a fourth-scale detection branch based on YOLOv4 [26], which improved the detection ability of small-scale text effectively.

The precision of the two-stage is higher, while the speed is slower than that of the one-stage. The two-stage network extracts deep features through a convolutional neural network, and then divides the detection into two stages: The first step is to generate candidate regions that may contain objects through the RPN network, and complete the classification of the regions to make a preliminary prediction of the position of the target; the second step is to further accurately classify and calibrate the candidate regions to obtain the final detection result. The entire network structure of RRPN [9] is the same as Faster-R-CNN, which is divided into two parts: one is used to predict the category, and the other one is used to regress the rotated rectangular box to detect text in any direction. Its two-stage is embodied in the use of RRPN to generate a candidate area with rotation angle information, and then adding an RROI pooling layer to generate a fixed-length feature vector, followed by two layers fully connected for the classification of the candidate area. Mask TextSpotter [27] is also a two-stage text detection network based on Mask R-CNN [28], it replaces the RoI pooling layer of Faster-R-CNN with the RoIAlign layer, and adds an FCN branch that predicts the segmentation mask. TextFuseNet [29] merged the ideas of masktextspotter and Mask R-CNN to extract multi-level features from different paths to obtain richer features.

#### *2.3. ResNet and FPN*

In addition to the design and improvement of various target detection algorithms that focus on different positions, a detector that can be applied currently in either one stage or two stages usually has the following two parts: the backbone network and the neck part.

It comprises a series of convolution layers, nonlinear layers, and downsampling layers for CNN. The features of images are captured from the global receptive field to describe the images. VGGNet [30] improves performance by continuously deepening the network structure. The increase in the number of network layers will not bring about an explosion in the number of parameters, and the ability to learn features is more vital. The BN layer in batch normalization [31] suppresses the problem that small changes in parameters are amplified as the characteristic network deepens and is more adaptable to parameter changes. Its superior performance makes it the standard configuration in current convolutional networks. ResNet establishes a direct correlation channel between input and output. The robust parameterized layer concentrates on learning the residual between input and output, and improves gradient explosion and gradient disappearance when the network develops deeper.

The backbone of target detection includes VGG, ResNet, etc. In CTPN [8], the VGG16 backbone is first used for feature extraction, SSD network [5] also uses VGG-16 as the primary network. ResNet-50 module was first used for feature extraction in the method proposed by Yang et al. [32], and most of the later networks adopt the ResNet series. The backbone part has also helped develop many excellent networks, such as DenseNet. DenseNet establishes the connection relationship between different layers through feature reuse and bypass settings to further reduce the problem of gradient disappearance and achieve a good training effect, instead of deepening the number of network layers in ResNet and widening network structure in Inception to improve network performance. Besides, the use of the bottleneck layer and translation layer makes the network narrower and reduces the parameters, suppressing overfitting effectively. Some detectors use DenseNet as a backbone for feature extraction.

With the popularity of multi-scale prediction methods such as FPN, many lightweight modules integrating different feature pyramids have been proposed. In FPN, the information from the adjacent layers of bottom-up and top-down data streams will be combined. The target texts of different sizes use the feature map at different levels and detect them separately, leading to repeated prediction results. It is not possible to use the information

of the other level feature maps. The neck part of the network has also further developed PANet and other networks. In the target detection algorithm, Yolov4 [26] also uses the PANet method based on the FPN module of YOLOv3 [33] to gather parameters for the training phase to improve the performance of its detector, which proves the effectiveness of PANet. That multi-level fusion architecture has been widely used recently.

#### **3. Principle of the Method**

This paper is based on PSENet: without an anchor and in one stage, it explores common text detection frameworks such as ResNet and FPN in other directions. The proposed framework is mainly divided into two modules: the SENet module and the MPANet module. In the residual structure of ResNet, the original PANet processes adjacent layers through addition operations. The MPANet used in this paper is modified from original PANet and connects the characteristic graphs of adjacent layers together to improve the effect. Figure 1 clearly describes the proposed architecture of the scene text detection algorithm.

**Figure 1.** An illustration of our framework. It includes a basic structure with SE blocks; a backbone of feature pyramid networks; bottom-up path augmentation; the progressive scale expansion algorithm, which predicts text regions, kernels, and similarity vectors to describe the text instances. Note that we omit the channel dimensions of feature maps for brevity.

## *3.1. SENet Block*

Convolution neural networks can only learn the dependence of local space according to the receptive field's size. A weight is introduced in the feature map layer considering the relationship between feature channels. In this way, different weights are added to each channel's features to improve the learning ability of features. It should be noted that the SE module adds weights in the dimension of channels. YOLOv4 uses the SE module to do target detection tasks, proving that the SE module can improve the network.

In terms of function, the framework shown in Figure 2 consists of three parts: firstly, a backbone network is constructed to generate the shared feature map, and then a squeeze and excitation network is inserted. This framework's key is adding three operations to the residual structure: squeeze feature compression, exception incentive, and weight recalibration.

**Figure 2.** Illustration of an SE block in our model.

Main steps of SENet:

(1) The spatial dimensions of features are compressed, and global average pooling is used Capture the global context, compress all the spatial information to generate channel statistics, compress the size of the graph from H × W to 1 × 1, and the one-dimensional parameter 1 × 1 can obtain the global view of H × W, and the perception area is wider, that is, the statistical information z, z ∈ R *C*. The c-th element of z in the formula is calculated by the following formula:

$$z\_c = F\_{sq}(\mu\_c) = \frac{1}{H \times W} \sum\_{i=1}^{H} \sum\_{j=1}^{W} \mu\_c(i, j) \tag{1}$$

where <sup>F</sup>*sq* (·) is the compression operation, and u*c* is the c-th feature.

(2) A 1 × 1 convolution and Relu operation follow, reducing the dimension by 16 times from 256; that is, the channel is transformed to 16—Relu activation function *δ*(x) = max(0,x), dimension reduction layer parameter, *W*1 ∈ *RC*<sup>×</sup> *Cr* ; then, the dimension increment layer of 1 × 1 convolution stimulates the number of channels to the original number of 256.

$$S = F\_{\varepsilon x}(Z, \mathcal{W}) = \sigma(\lg(Z, \mathcal{W})) = \sigma(\mathcal{W}\_2 \delta(\mathcal{W}\_1 Z)) \tag{2}$$

where the sigmoid activation function *σ*(*x*) = 1 (<sup>1</sup>+*e*−*<sup>x</sup>*) , and the dimension increase layer parameter *W*2 ∈ *RC*<sup>×</sup> *Cr* ,*Fex*(·) is the excitation operation, S = [*<sup>s</sup>*1,*s*2,*s*3, ...,*sc*], *sk* ∈ *RH*×*<sup>W</sup>* (*k* = 1, 2, 3, ..., *c*);

(3) The weight is generated for each feature channel's importance after feature selection is obtained, which are multiplied one by one with the previous features to complete the calibration of the original features in the channel dimension.

$$X\_{\mathbb{C}} = F\_{\text{scale}}(\mu\_{\mathbb{C}}, s\_{\mathbb{C}}) = s\_{\mathbb{C}} \cdot \mu\_{\mathbb{C}} \tag{3}$$

where *X* = [ ∼ *x*1, ∼ *x*2, ..., ∼ *xC*], *Fscale*(*uC*,*sC*) refers to the corresponding channel product between the feature map *uC* ∈ *RH*×*<sup>W</sup>* and the scalar s*C*.

#### *3.2. Architecture of MPANet*

∼

Inspired by FPN, which obtains the semantic features of multi-scale targets, we propose a path aggregation network described in Figure 3; it can be added to the FPN to

∼

make the features of different scales more in-depth and more expressive. The emphasis is on fusing low-level elements and adaptive features at the top level.

Our framework improves the bottom-up path expansion. We follow FPN to define the layer that generates the feature map. The same space size is in the same network stage. Each functional level corresponds to a specific stage. We also need ResNet-50 as the basic structure; the output vector of Conv2-x, Conv3-x, Conv4-x, and Conv5-x in the ResNet network is C2,C3,C4,C5. P5, P4, P3, and P2 are used to represent the feature levels from top to bottom of FPN generation.

$$P\_i = \begin{cases} f\_1^{3 \times 3}(\mathbb{C}\_i) & i = 5, \\ f\_2^{3 \times 3}\{\mathbb{C}\_i \oplus F\_{\text{upsample}}^{\times 2}[f\_1^{3 \times 3}(P\_{i+1})]\} & i = 2, 3, 4. \end{cases} \tag{4}$$

where *f* 3×3 1 means that each P*i*+<sup>1</sup> first passes a 3 × 3 convolutional layer to reduce the number of channels; then the feature map is upsampled to the same size as C*i* and adds to the C*i* feature map elements; *f* 3×3 2 means that the summed feature map undergoes another 3 × 3 convolution operation to generate P*i*.

$$N\_i = \begin{cases} P\_i & i = 2. \\ f\_2^{3 \times 3} \left\{ f^{1 \times 1} [P\_{i+1} \| f\_1^{3 \times 3} (N\_i)] \right\} & i = 3, 4, 5. \end{cases} \tag{5}$$

Our augmented path starts from the bottom P2 and gradually approaches P5. The spatial size is gradually sampled down by factor 2 from P2 to P5. We use N2,N3,N4,N5 to represent the newly generated feature graph. Note that N2 is P2, without any processing, and retains the original feature map's information.

**Figure 3.** An illustration of our modification of the bottom-up path augmentation.

As shown in Figure 3, each building block needs a higher resolution feature map N*i* and a coarser P*i*+<sup>1</sup> to generate a new feature map N*i*+1.

*f* 3×3 1 means that each feature map N*i* passes through a 3×3 convolution layer with a step size of 2 to reduce the space size firstly.

"" means that the feature map P*i*+<sup>1</sup> of each layer is connected horizontally, not added, but concatenated with the downsampled map.

After this operation, *f* 1×1 means that the number of channels in the concatenated feature map will be doubled, through 1×1 convolution layer, the step size is 1, and then the channel number is restored to 256.

*f* 3×3 2 means that the fused feature map is then processed by 3×3 convolution fusion to generate N*i*+<sup>1</sup> layer for the next step. This is an iterative process, which ends when it approaches P5. In these building blocks, we mostly use each feature map with 256 channels.

$$N = N\_{2} \| F\_{\
u \text{sampple}}^{\times 2} (N \mathfrak{z}) \| F\_{\
u \text{sampple}}^{\times 4} (N \mathfrak{z}) \| F\_{\
u \text{sampple}}^{\times 8} (N \mathfrak{z}) \tag{6}$$

Then, the suggestions of each function are collected from the new feature mapping, namely, N2,N3,N4,N5. The N3, N4 and N5 are upsampled to the size of N2, *F*<sup>×</sup><sup>2</sup> *upsample* , *F*<sup>×</sup><sup>4</sup> *upsample*, *F*<sup>×</sup><sup>8</sup> *upsample* refers to 2, 4, 8 times unsampling, and the four layers are concatenated into a feature map.

$$input\_{PSE} = F\_{upsample}^{\times 2} \{ f^{1 \times 1} [f^{3 \times 3}(N)] \} \tag{7}$$

where *f* 3×3 refers to convolution operation for reducing the number of channels to 256, *f* 1×1 refers to the generation of 7 segmentation results. <sup>F</sup>*upsample* refers to upsampling to the size of the original image, and the output channel is 7, which is input into the PSE block.
