*3.2. Experimental Details*

We refer to the work of Pang et al. [48] and use Resnet-50 [58] pretrained on Imagenet as the backbone of RBFAN. A large number of images in Image-net can better train Resnet-50 to extract underlying features. Limited by GPU, we set the batch size to 4. Similar to Ref. [59], we use stochastic gradient descent (SGD) [60] as the optimizer, with a 0.005 learning rate, 0.9 momentum and 0.0001 weight. In addition, the learning rate is reduced from 130 epochs to 140 epochs, and each epoch is reduced by 10 times to ensure sufficient loss reduction. In addition, the intersection of union (IOU) threshold in the experiment is set to 0.5. According to the configuration in Ref. [48], we set *a* = 0.5, *γ* = 1.5. The hyper-parameter *λ* in loss function is set to 1. The hyper-parameter *C* in balanced L1 loss is set to 0.5.

According to Refs [13,30,48], we set the following parameters. In BAFPN, the resolution of each layer is [512 × 512, 256 × 256, 128 × 128, 64 × 64, 32 × <sup>32</sup>]. The size of the fused feature map is 128 × 128 × 256. The resolution of the restored feature pyramid is [512 × 512, 256 × 256, 128 × 128, 64 × 64, 32 × <sup>32</sup>]. In AFAN, the convolution kernel size is 3 × 3. In RDN, the size of the used convolution kernels is 3 × 3 for all. The number of convolution layers stacked in the classification subnet and the regression subnet is 2.
