2.4.4. H-SPP Module

In the CV community, by capturing rich context information, the network can better understand the relationship between pixels and enhance the performance of detection. For aggregating context information, a pyramid pooling module with max-pooling has been commonly adopted so far [47–49]. The previous works believe that since the object of interest may produce the largest pixel value, adopting max-pooling is enough. We argue that average-pooling gathers another important clue about global information extraction capacity. This idea is inspired by the works of [29], which is recommended for readers. Thus, we adopt a hybrid spatial pyramid pooling (H-SPP) module to further enhance the global context information extraction capacity of the network. It can be aware of both local and global contents of feature maps, and attach importance to key small ship features.

Different from previous works [47–49], the H-SPP module mainly aggregates the feature map generated by both average-pooling and max-pooling operations with different pooling sizes. Figure 11 shows the detailed structures of the H-SPP module. Then, we will further introduce the principle of the H-SPP module.

The H-SPP module aggregates the feature map generated by both the max-pooling layer and average-pooling layer of different kernel sizes (i.e., 5 × 5, 9 × 9 and 13 × 13), as shown in Figure 11. Specifically, given the input feature map *Fin* ∈ R*<sup>W</sup>* × *H* × *C* generated by the backbone, it is first transmitted to a CBL module to generate a feature map *F* ∈ R*<sup>W</sup>* × *H* × 0.5*C* with refined channel information. Then, (1) max-pooling of different kernel sizes (i.e., 5 × 5, 9 × 9 and 13 × 13) is simultaneously carried out to generate three local receptive field feature maps and (2) average-pooling of different kernel sizes (i.e., 5 × 5, 9 × 9 and 13 × 13) is carried out to generate three global receptive field feature maps. Next, six generated results and original ones (i.e., *F*1–*F*7 from Figure 11) are concatenated as a synthetic feature map. Finally, the feature map level fusion of local features and global features is realized, which enriches the expression ability of the final feature map *Fout* ∈ R*<sup>W</sup>* × *H* × *3.5C*.

In short, the above can be defined by

$$F\_{out} = \mathsf{Conv}\_{1\times1}(\mathsf{F}\_{in})(\mathsf{c}) \, \mathsf{MaxPool}(\mathsf{Conv}\_{1\times1}(\mathsf{F}\_{in}))(\mathsf{c}) \, \mathsf{AvgPool}(\mathsf{Conv}\_{1\times1}(\mathsf{F}\_{in})) \tag{16}$$

where *Conv*1 × 1 denotes the 1 × 1 convolution operation, *MaxPool* denotes the max-pooling operations (with kernel sizes of 5 × 5, 9 × 9 and 13 × 13, respectively), *AvgPool* denotes the average-pooling operations (with kernel sizes of 5 × 5, 9 × 9, and 13 × 13, respectively), and *©* denotes the concatenation operation.

By aggregating the feature maps of abundant receptive fields, the H-SPP module obtains different degrees of context information, and enhances the network's ability to capture both local and global information. Thus, the H-SPP module can improve the accuracy of the final prediction result of the algorithm. In this paper, the H-SPP module will be used to further improve the detection performance of Lite-YOLOv5.

**Figure 11.** The detailed structure of H-SPP module. MaxPool means the max pooling layer, AvgPool means the average pooling layer, and © means the concatenation operation.
