4.1. Experimental Settings
Dataset. To validate the performance of the DA-FPN network structure proposed in this study, we performed all our experiments on the MS COCO [
55] dataset. The MS COCO dataset contains both MS COCO 2014 and MS COCO 2017 versions. Specifically, the method proposed in this paper was evaluated on the MS COCO2017 dataset, which contains more than 80 categories and 1.5 million object instances. In total, 80 k images were used for training, 40 k images for validation, and 20 k images for testing. All models in this paper were trained on trainval 35 k. Then, we used another 5 k images in the validation set for testing and visualization.
Metrics. The MS-COCO dataset uses AP metrics to characterize detector performance. The average precision rate (AP) is calculated across 10 different IoU thresholds (i.e., 0.5:0.05:0.95) and all categories. AP is regarded as the most important metric for the MS-COCO dataset and can be used to evaluate performance at various object scales, such as small objects (area < 322), medium objects (322 < area < 962), and large objects (area > 962).
Implementation Details. All experiments in this paper are based on the MMDection framework. We trained three network structure models, FoveaBox, GFL [
56], and SABL, setting the initial learning rate of FoveaBox and GFL to 0.0025 and the initial learning rate of SABL to 0.005. The number of epochs was set to 12. Otherwise, if not specified, the backbone feature extraction networks used in our networks were all ResNet50 pre-trained on ImageNet. For all three network models, we used the DA-FPN structure to replace their original FPN structures. We used ubuntu18.04 as our operating system, Python 3.7 as our programming language, PyTorch 1.7.0 as our deep learning framework, and two 10 GB NVIDIA 2080Ti GPUs with CUDA version 10.1 for training.
4.2. Results
To verify the effectiveness and generality of our DA-FPN structure relative to the traditional FPN structure, in this section, we compare our method with popular two-stage object detection algorithms such as Faster R-CNN [
24], Cascade R-CNN [
25], single-stage object detection algorithms FCOS [
35], and ATSS [
36]. We also replace the FPN structures in the two-stage object detection algorithm SABL and the single-stage object detection algorithms FoveaBox and GFL using the DA-FPN structure in this paper. The experimental results are listed in
Table 1. The data in this table show that when we use ResNet50 as our backbone, the method in this paper can improve the accuracy of the original FoveaBox, GFL, and SABL object detection algorithms by 1.7%, 2.4%, and 2.4%, respectively. Among them, the accuracy of
is improved by 2.2%, 2.2%, and 2%, respectively. The accuracy of
improves by 2%, 4.3%, and 3.2%, respectively. When we use ResNet101 as our backbone for feature extraction, the method in this paper is found to improve the accuracy of the original FoveaBox, GFL, and SABL object detection algorithms by 1.5%, 2%, and 1.4%, respectively. This result also fully illustrates that the proposed method can effectively solve the problem of object accuracy reduction due to the insufficient extraction of object boundary information and shallow information in traditional FPN.
4.3. Ablation Study
To verify the effectiveness of the deformable convolution module (DCM), feature alignment module (LFAM), and bottom-up module (BUM) in the DA-FPN structure proposed in this paper to improve the accuracy of the object detection algorithm, we performed corresponding ablation experiments in the single-stage object detection algorithm GFL, whose experimental data are shown in
Table 2. This table show that using the deformable convolution module can improve the accuracy of
by 1.2%, which represents the largest improvement among the three modules. This result occurred because using 3 × 3 deformable convolution instead of 1 × 1 convolution can capture more boundary information of small objects and improve the detection accuracy of objects. In addition, the bottom-up module was found to improve the values of
and
by 1.6% and 2.8%, respectively, because the bottom-up module can effectively fuse the shallow information of the object into the high-level feature map, thereby improving the detection accuracy of large objects.
In the DA-FPN structure, we replaced the 1 × 1 convolution used in the FPN lateral connection with a 3 × 3 deformable convolution. There are two different lateral connection operations in the traditional FPN structure: behind the backbone feature extraction feature map
and behind the top-down feature map
. To verify the replacement location of our 3 × 3 deformable convolutional module, corresponding experiments were performed on the single-stage objective detection algorithm FoveaBox. The experimental results are shown in
Table 3. From the data in
Table 3, we find that when we replace only the 1 × 1 convolution behind the feature map
with a 3 × 3 deformable convolution module, our detection accuracy improves by 0.5%, while the accuracy remains the same when we replace only 1 × 1 convolution behind the feature map
layer with a deformable convolution module. This occurs because the feature map
is obtained from the feature map
via multiple convolution pooling operations, which causes the feature map
to lose a large amount of object boundary information in the information flow process. Thus, it is necessary to use 3 × 3 deformable convolution to expand the receptive field to obtain more object boundary information when fusing features with the feature map
, while in the bottom-up module, since the feature map
is only obtained from the feature map
through few convolution operations, the feature map does not lose a great deal of object boundary information. Thus, the effect of using 3 × 3 deformable convolutions to replace 1 × 1 convolution is not obvious. Since using the deformable convolution module adds some computation, only the 1 × 1 convolution behind the feature map
is replaced in the DA-FPN structure using the deformable convolution module.
We also added the feature alignment module (RFAM) in the bottom-up module. To verify the effectiveness of adding the feature alignment module (RFAM) in the bottom-up module, we performed the corresponding experiments in the single-stage object detection algorithm GFL and the two-stage object detection algorithm SABL, the experimental data of which are shown in
Table 4 and
Table 5. Based on the experimental data, by adding the feature alignment module to the bottom-up module, we improved the object detection accuracy in GFL and SABL by 1.0% and 0.8%, respectively. Additionally, the accuracy values of
improved by 1.9% and 2.1%, respectively, because the retransmission of shallow information to the high-level feature map through the bottom-up path also causes feature misalignment in the feature map when feature fusion is performed due to the use of the 2x downsampling operation. Especially in two-stage object detection algorithms, this lost information is important for the classification and regression of objects in high-level feature maps.
Table 2 shows that among the three modules improved by DA-FPN, LFAM, and BUM perform very well in the single-stage object detection algorithms. To confirm the generality of these two modules in the two-stage object detection algorithms, ablation experiments with these two modules were performed using the two-stage objective detection algorithm SABL. The experimental data are shown in
Table 6. These data indicate that the two modules are also very applicable in the two-stage object detection algorithms, improving the object detection accuracy value from 40.0% to 41.9% in SABL, demonstrating an accuracy increase of 1.9%. When only the feature alignment module was used, the small target detection accuracy
in the detection algorithm improved from 22.9% to 24.6%, thus improving the overall accuracy by 1.7%. This result shows that there was also a decrease in small object detection accuracy due to feature misalignment in the two-stage object detection algorithms. When using only the bottom-up module, we improved the value of
from 52.0% to 54.5%, demonstrating a 2.5% increase in accuracy. This result shows that the bottom-up module effectively improved the detection accuracy of large objects because we effectively fused information from the low-level feature maps to high-level feature maps and reduced the loss of shallow information. These measures effectively improved the detection accuracy of large objects.
To verify whether the proposed method in this paper can better extract the boundary information of the object, we performed visualization experiments using FoveaBox. The experimental results are shown in
Figure 5.
Figure 5 shows that the proposed method can accurately extract the boundary information of the object and thus improve the accuracy of target detection. For example, as shown in the first column, the bounding box generated by the FoveaBox method does not completely frame the truck, while the bounding box generated by the method in this paper completely frames the truck because the FoveaBox method does not extract enough information about the object boundary; this problem was accurately solved by our method. From the second column, we can also see that the FoveaBox method does not detect boundary information of the zebra’s ears and tail, while the proposed method accurately extracts this boundary information.
To visualize whether the proposed DA-FPN method performs better than the traditional FPN structure, we conducted comparison experiments in both the single-stage object detection algorithm GFL and the two-stage object detection algorithm SABL and visualized the results as shown in
Figure 6. From the figure, we can see that DA-FPN can accurately detect those objects that are detected wrongly and missed—especially the detection of small objects. For example, as can be seen from the figure on the left of the second row, the GFL-DAFPN detection algorithm proposed in this paper can detect some small objects such as umbrellas and potted plants that were missed in the GFL algorithm, and successfully classify some of these incorrectly detected objects. In the figure in the middle of the second row, the DA-FPN algorithm successfully removed the tennis racket that was incorrectly located in the GFL algorithm. In the figure on the right of the second row, the algorithm was able to detect the missed traffic light and handbag, as well as the incorrectly detected person. Similarly, in the figure on the left of the fourth row, the SABL-DAFPN algorithm successfully removed the ties that were incorrectly located in the SABL algorithm. In the figure in the middle of the fourth row, the SABL-DAFPN algorithm successfully removed the objects that the SABL algorithm misdetected as kites for the identifiers. As shown in the image on the right side of the fourth row, the SABL-DAFPN algorithm was able to detect missed benches as well as distant hot air balloons. The above analysis shows that the DA-FPN method proposed in this paper can effectively detect some objects that were detected incorrectly and mistakenly and can also detect more small objects.