*2.1. Balanced-Attention FPN*

Previous works [40–44] tended to use the feature pyramid network (FPN) [45] as the backbone because the low-level feature maps with higher resolution are suitable for smallscale ship detection, while the high-level feature maps with more semantic information are very suitable for large-scale objects. However, using horizontal connection to integrate multi-level information makes the network pay more attention to the feature maps of adjacent layers, and the semantic information of non-adjacent layers will be diluted many times in the network. In SAR ship datasets [21,46,47], the scale span of the SAR ship target tends to be very large, which makes the ship features exhibit significant difference. In order to solve this problem, inspired by Ref. [48], we use a balanced-attention FPN (BAFPN) to balance and enhance the feature maps of different scales. As shown in Figure 3, BAFPN mainly consists of three parts: (1) feature extraction and fusion, (2) self-attention enhancement module, (3) feature pyramid recovery.

### 2.1.1. Feature Extraction and Fusion

Considering the feature dilution caused by multi-level horizontal connection structure in traditional FPN, we adopt another way for feature fusion. First, we use ResNet-50 to extract feature maps of different scales. As shown in Figure 4, the extracted feature maps of five levels are denoted by {*C***1**, *C***2**, *C***3**, *C***4**, *C***5**}. Since *C***3** is in the middle of the pyramid, we recognize that *C***3** can synthesize top semantic information and bottom spatial information better [48]. So, we resize {*C***1**, *C***2**, *C***4**, *C***5**} to the *C***3** resolution with up-sampling and max-pooling. The rescaled feature maps are denoted by {*C***1**, *C***2**, *C***3**, *C***4**, *C***5**}.

**Figure 4.** Architecture of feature fusion module.

Then, we fuse these rescaled feature maps by averaging, and the result is executed by

$$F = \sum\_{i=1}^{5} \mathcal{C}\_i' \tag{1}$$

where *i* represents the *i*-th detection.

During averaging, the fused feature map obtains information from all resolutions. By using feature fusion, the impact of different ship scales on detection and classification performance can be reduced.
