*2.4. End-to-End Training*

The proposed detection method adopts the detection architecture of Faster R-CNN, and we define the loss function of Faster R-CNN as *LFaster*. In this paper, the loss functions of the newly proposed difference maximum loss and focused feature enhancement module are *Ld* (Equation (1)) and *Lf* (Equation (2)), respectively. Thus, the loss function of the whole detection network is defined as

$$L = L\_{Faster} + L\_d + L\_{f'} \tag{3}$$

The base networks CNN1 and CNN2 are initialized with ResNet-50 pre-trained weights for ImageNet classification [12]. The stochastic gradient descent (SGD) optimizer [47] is used to optimize the network parameters. The detection network is trained end-to-end, which means that the proposed detection network can directly output the detection results based on the input images, without any other operations. We use an NVIDIA GTX 1080 Ti GPU to train and test the detection network. The weights of the network are updated with a learning rate of 10−<sup>4</sup> for the first 50k iterations, and 10−<sup>5</sup> for the next 50k iterations. The momentum, weight decay, and batch size are set as 0.9, 0.0005, and 2, respectively. The code of the proposed detection network is implemented based on PyTorch [48].
