2.4.3. CSA Module

The essence of the attention mechanism is to extract more valuable information for task objectives in the target area and to suppress or ignore some irrelevant details. Channel attention focuses on the "what" problem, (i.e., it focuses on what plays an important role in the whole image). However, in general, a ship in the image is sparsely distributed and its pixel proportion is quite small. Thus, only part of the pixel space in a ship detection task is valuable. Spatial attention focuses on the "where" problem, (i.e., where the ship is in the whole image). Spatial attention is the supplement of channel attention; each spatial feature is selectively aggregated through the weighting of spatial features. Therefore, different from SENet [45], which only focuses on channel attention, we bring channel and spatial attention simultaneously, named as the CSA Module.

To achieve this, we sequentially apply channel and spatial attention. From Figure 10, given the input feature map *F* ∈ R *W* × *H* × *C*, channel attention can generate channel weight *WC* ∈ R*<sup>C</sup>* × 1 × 1 and spatial attention can generate spatial weight *WS* ∈ R*<sup>H</sup>* × W × 1. The different depths of color in Figure 10 represent different values of weights.

**Figure 10.** The detailed structure of CSA module, which is composed of channel attention and spatial attention.

Specifically, for channel attention, given the input feature map *F* ∈ R *W* × *H* × *C*, first, global average pooling (GAP) [46] is carried out to generate the average spatial response and global max pooling (GMP) [46] to generate the maximum spatial response; then, they are transmitted, respectively, to the multi-layer perceptron (MLP) to encode the channel information, which is helpful to infer finer channel attention; next, element-wise addition is conducted between the two feature maps; then, the synthetic channel information is activated through the sigmoid function to obtain the channel weight feature map, i.e., a channel weight matrix *WC*; finally, element-wise multiplication is conducted between the original feature map *F* ∈ R *W* × *H* × *C* and obtained channel weight matrix *WC*.

In short, the above can be defined by

$$F' = F \odot \mathcal{W}\_{\mathbb{C}} \tag{12}$$

where *F* denotes the weighted feature map, *F* denotes the input feature map, denotes element-wise multiplication, and *WC* denotes the obtained channel weight matrix, i.e.,

$$\mathcal{W}\_{\mathbb{C}} = \sigma \left\{ f\_{\mathbb{C}-\operatorname{enc}\text{c}\text{c}\text{d}c}(\operatorname{GAP}(\mathbb{F})) \oplus f\_{\mathbb{C}-\operatorname{enc}\text{c}\text{d}c}(\operatorname{GMP}(\mathbb{F})) \right\} \tag{13}$$

where *GAP* denotes the global average-pooling operation, *GMP* denotes the global maxpooling operation, ⊕ denotes element-wise summation, *fc-encode* denotes the channel coder, and *σ* denotes the sigmod activation function.

As for spatial attention, given the input feature map *F* ∈ R*<sup>C</sup>* × *W* × *H*, first, global average pooling (GAP) [46] is carried out to generate the average spatial response and global max pooling (GAP) [46] is carried it to generate the maximum spatial response; then, the two generated results are concatenated as a whole feature map; next, a space encoder of our design is used to encode the space information, which is helpful to infer finer spatial attention; then, it is activated through the sigmoid function to obtain the spatial weight feature map, i.e., a spatial weight matrix *WS*; finally, element-wise multiplication is conducted between the original feature map *F* ∈ R*<sup>C</sup>* × *W* × *H* and the obtained spatial weight matrix *WS*.

In short, the above can be defined by

$$F' = F' \odot \mathcal{W}\_{\mathcal{S}} \tag{14}$$

where *F* denotes the weighted feature map, *F* denotes the input feature map, denotes element-wise multiplication, and *WS* denotes the obtained spatial weight matrix, i.e.,

$$\mathcal{W}\_{\mathbb{S}} = \sigma \{ f\_{s-\text{encoder}}(\text{GAP}(\mathcal{F})(\mathcal{c}) \text{GMP}(\mathcal{F})) \}\tag{15}$$

where *GAP* denotes the global average-pooling operation, *GMP* denotes the global maxpooling operation, **©** denotes concatenation operation, *fs-encode* denotes the space coder, and *σ* denotes the sigmod activation function.

Finally, we can obtain a finer channel and space information. With the CAS module inserted into the network, each level can extract both rich spatial and rich semantic information, which is helpful to improve the detection performance of small ships.
