**5. Discussion**

**A. The effectiveness of the proposed difference maximum loss, focused feature enhancement, and cascaded semantic extension.** In this paper, we mainly propose three novel modules (i.e., difference maximum loss function, focused feature enhancement module, and cascaded semantic extension module). In order to demonstrate the effectiveness of the three proposed modules, massive experiments are implemented and some results are listed in Table 3. × denotes that the module is not used in the detection network, and - denotes that the module is used in the detection network. All testing images are used to produce the detection accuracy (mAP). We can see that without the three proposed modules, the accuracy of the detection network decreases to 78.9%. Using only the difference maximum loss function, the detection accuracy is 80.7%. Using only the focused feature enhancement module, the detection accuracy is 80.1%. Using only the cascaded semantic extension module, the detection accuracy is 81.6%. Hence, all three modules can improve the detection performance, and the cascaded semantic extension module has the largest benefit. When using two modules for the detection network, the detection accuracy can be further improved. When using all three modules, the detection accuracy reaches a maximum of 83.7%.

**Table 3.** The effectiveness of difference maximum loss, focused feature enhancement, and cascaded semantic extension.


**B. The effectiveness for small objects and large objects.** In order to demonstrate the detection effectiveness of small objects and large objects, we define objects with size smaller than 48 × 48 in testing images as small objects, and objects with sizes larger than 96 × 96 as large objects. Our method produces a mAP of 76.5% for small objects, while without the focused feature enhancement module the mAP for small objects drops to 74.2%. For large objects, the mAP from our method is 88.6%. Without the cascaded semantic extension module, the mAP for large objects drops to 87.1%.

**C. The first stage in CNN1 and CNN2.** In the proposed detection network, the first stage in CNN1 and CNN2 is not used for the difference maximum loss function and the focused feature enhancement module (see Figure 1). There are two reasons for this. Firstly, in many other state-of-the-art object detection networks [19,59], the first stage of the base CNN is not used for various designed modules. Secondly, the resolution of the first stage is one-half the resolution of the input image. Therefore, semantic features in the first stage are very few, and features in the first stage are usually edges and gradients. In this case, those features would be useless for the detection of small objects. Edges and gradients have very small differences between two base CNNs, and thus the difference maximum loss function also does not use the first stage. Table 4 lists the detection accuracy with and without the first stage. We can see that it would be good for the designed detection network not to use the first stage.


**Table 4.** The mAP with and without the first stage for difference maximum loss and focused feature enhancement.

**D. Automatically generated segmentation labels for the focused feature enhancement.** In the focused feature enhancement module, we use automatically generated segmentation labels instead of ground-truth segmentation labels to save time and effort (see Figure 3). Although the automatically generated segmentation labels are relatively inaccurate compared with the ground-truth segmentation labels, the automatically generated segmentation labels can meet the requirements of the focused feature enhancement module. The generated labels can make the enhancement module effectively concentrate on strengthening the features of small objects. Table 5 lists the detection accuracy based on the ground-truth segmentation labels and the automatically generated segmentation labels. We can see that our used strategy (i.e., automatically generated segmentation labels) can produce satisfactory detection results while effectively saving time and effort for annotation.

**Table 5.** The mAP for the ground-truth segmentation labels and the automatically generated segmentation labels.

