*4.2. Implementation Detail*

The network is established on Tensorflow and trained end-to-end.MB-RPN loss and detection loss are optimized simultaneous with Nvidia 1080Ti on Ubuntu operation system [17]. ResNet-101 pre-trained network is adopted to extract image features and other convolution layers were initialized randomly [14]. To optimize network parameters, Adam optimizer with lr=10−6, *β*<sup>1</sup> = 0.9 and *β*<sup>2</sup> = 0.999 is adopted [18].

The input images were set to 600 × 600. For DOTA dataset, since the shape of training and testing images are much larger than input size, to reduce the loss of image resolution the images are cropped into input size with stride of 300 for training and testing and merge test results to original shape. For UAVB dataset, since the shape of training and testing images are similar to input size, it is only needed to resize the images to input size. The size of anchor boxes for layer Conv1∼Conv5 are [32,64,128,256,512], which is consistent to the default value of FPN method. For DOTA dataset, aspect ratios of anchor boxes is [ <sup>1</sup> 7 , 1 5 , 1 3 , 1 <sup>2</sup> ,1,2,3,5,7] to adapt categories with both normal and slender shape such as bridge. For UAVB dataset, since the shape of all the categories are normal, therefore the

aspect ratio is same to default FPN method. Mean average precision(mAP) is adopted to evaluate the proposed approach[19].
